{"doi":"10.1101/2020.06.02.130955","title":"SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes","abstract":"<h4>Summary</h4> Despite its overwhelming clinical importance, the SARS-CoV-2 gene set remains unresolved, hindering dissection of COVID-19 biology. Here, we use comparative genomics to provide a high-confidence protein-coding gene set, characterize protein-level and nucleotide-level evolutionary constraint, and prioritize functional mutations from the ongoing COVID-19 pandemic. We select 44 complete Sarbecovirus genomes at evolutionary distances ideally-suited for protein-coding and non-coding element identification, create whole-genome alignments, and quantify protein-coding evolutionary signatures and overlapping constraint. We find strong protein-coding signatures for all named genes and for 3a, 6, 7a, 7b, 8, 9b, and also ORF3c, a novel alternate-frame gene. By contrast, ORF10, and overlapping-ORFs 9c, 3b, and 3d lack protein-coding signatures or convincing experimental evidence and are not protein-coding. Furthermore, we show no other protein-coding genes remain to be discovered. Cross-strain and within-strain evolutionary pressures largely agree at the gene, amino-acid, and nucleotide levels, with some notable exceptions, including fewer-than-expected mutations in nsp3 and Spike subunit S1, and more-than-expected mutations in Nucleocapsid. The latter also shows a cluster of amino-acid-changing variants in otherwise-conserved residues in a predicted B-cell epitope, which may indicate positive selection for immune avoidance. Several Spike-protein mutations, including D614G, which has been associated with increased transmission, disrupt otherwise-perfectly-conserved amino acids, and could be novel adaptations to human hosts. The resulting high-confidence gene set and evolutionary-history annotations provide valuable resources and insights on COVID-19 biology, mutations, and evolution.","journal":null,"year":2020,"id":3157,"datarank":1.561890375723201,"base_score":3.2188758248682006,"endowment":3.2188758248682006,"self_citation_contribution":0.48283137373023016,"citation_network_contribution":1.079059001992971,"self_endowment_contribution":0.48283137373023016,"citer_contribution":1.079059001992971,"corpus_percentile":62.40846216436127,"corpus_rank":463,"citation_count":24,"citer_count":22,"citers_with_citation_signal":19,"citers_with_endowment":19,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.5006,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2020-06-03","fair_score":39.375,"fair_percentile":20.030782761653473,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":5957,"name":"Rachel Sealfon","orcid":"0000-0002-3007-4698","position":1,"is_corresponding":false},{"id":14693,"name":"Sharon L. R. Kardia","orcid":"0000-0002-9853-3379","position":2,"is_corresponding":false},{"id":292,"name":"Irwin Jungreis","orcid":"0000-0002-3197-5367","position":0,"is_corresponding":true}],"reference_count":57,"raw_metadata":{"citation_network_status":"fetched"},"created_at":"2026-03-01T18:20:47.508186Z","pmid":"32577641","pmcid":"PMC7302193","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"green","license":"cc-by","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":52.5,"fair_a":55.0,"fair_i":25.0,"fair_r":25.0,"fair_zscore":-0.5279,"fair_rationale":{"fair_score":39.38,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":52.5,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.25,"signal":null,"rationale":"The paper does not provide machine-readable metadata; only human-readable text and supplementary files with no structured metadata (e.g., JSON-LD, schema.org) are mentioned."}]},"A":{"name":"Accessible","score":55.0,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper describes access to some data via UCSC Genome Browser track hubs and supplementary files, but does not provide a clear, step-by-step protocol for accessing the underlying code or all data in a persistent repository with authentication details."}]},"I":{"name":"Interoperable","score":25.0,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper uses standard formats (e.g., FASTA, BED, GFF) and refers to standard identifiers (e.g., NCBI accession numbers), but does not specify adherence to community-endorsed vocabularies (e.g., OBO Foundry) or data-model standards (e.g., ISA-Tab) beyond common bioinformatics file formats."}]},"R":{"name":"Reusable","score":25.0,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.333,"signal":null,"rationale":"The paper has a CC-BY 4.0 license for the preprint and mentions data availability, but lacks an explicit data-availability statement for the final peer-reviewed article, does not provide a code availability statement, and does not guarantee long-term storage in a dedicated repository or include a software license for custom scripts."}]}},"suggestions":["Add machine-readable metadata (e.g., JSON-LD or schema.org) to the paper's HTML/PDF for better findability by search engines.","Provide a permanent DOI for the data (e.g., via Zenodo) and include a clear code-availability statement with a software license (e.g., MIT) and versioned repository links.","Use standardized vocabularies (e.g., EDAM for bioinformatics operations) and data-model standards (e.g., ISA-Tab) for experimental metadata in supplementary files.","Include a detailed data-access protocol table in the paper, specifying repository URLs, access conditions, and a persistent identifier for the code.","Deposit all supplementary tables and scripts in a FAIR-aligned repository (e.g., Figshare) with descriptive metadata and a license file."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"epmc_xml"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"epmc_xml","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:46:56.544562Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}