{"doi":"10.1093/nar/gkw569","title":"NCBI prokaryotic genome annotation pipeline","abstract":"Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.","journal":"Nucleic Acids Research","year":2016,"id":1426,"datarank":14.619716944753138,"base_score":8.794218683679489,"endowment":8.794218683679489,"self_citation_contribution":1.3191328025519236,"citation_network_contribution":13.300584142201215,"self_endowment_contribution":1.3191328025519236,"citer_contribution":13.300584142201215,"corpus_percentile":86.57445077298617,"corpus_rank":166,"citation_count":6925,"citer_count":194,"citers_with_citation_signal":194,"citers_with_endowment":194,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.5867,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2016-06-24","fair_score":39.375,"fair_percentile":20.030782761653473,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":17114,"name":"Michael DiCuccio","orcid":"0000-0003-0585-2862","position":1,"is_corresponding":false},{"id":17115,"name":"Azat Badretdin","orcid":null,"position":2,"is_corresponding":false},{"id":17116,"name":"Vyacheslav Chetvernin","orcid":null,"position":3,"is_corresponding":false},{"id":17117,"name":"Eric P. Nawrocki","orcid":"0000-0002-2497-3427","position":4,"is_corresponding":false},{"id":17118,"name":"Leonid Zaslavsky","orcid":"0000-0001-5873-4873","position":5,"is_corresponding":false},{"id":17119,"name":"Alexandre Lomsadze","orcid":null,"position":6,"is_corresponding":false},{"id":2100,"name":"Kim D. Pruitt","orcid":"0000-0001-7950-1374","position":7,"is_corresponding":false},{"id":17120,"name":"Mark Borodovsky","orcid":"0000-0002-1401-4046","position":8,"is_corresponding":false},{"id":17121,"name":"James Ostell","orcid":null,"position":9,"is_corresponding":false},{"id":11703,"name":"Tatiana Tatusova","orcid":null,"position":0,"is_corresponding":true}],"reference_count":36,"raw_metadata":{"citation_network_status":"fetched"},"created_at":"2026-03-01T18:20:47.508186Z","pmid":"27342282","pmcid":"PMC5001611","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"gold","license":"public-domain","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":52.5,"fair_a":42.5,"fair_i":37.5,"fair_r":25.0,"fair_zscore":-0.5279,"fair_rationale":{"fair_score":39.38,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":52.5,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.25,"signal":null,"rationale":"The paper provides URLs to external resources but does not include machine-readable metadata for the pipeline or its outputs."}]},"A":{"name":"Accessible","score":42.5,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":0.25,"signal":null,"rationale":"The paper references external websites but does not provide a direct, clear protocol for accessing the pipeline code or the annotated data."}]},"I":{"name":"Interoperable","score":37.5,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.75,"signal":null,"rationale":"The pipeline uses standard formats (ASN.1, GenBank flat file), standard identifiers (NCBI Taxonomy, RefSeq), and follows community guidelines (INSDC, UniProt naming), but does not explicitly mention use of formal ontologies."}]},"R":{"name":"Reusable","score":25.0,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.333,"signal":null,"rationale":"The paper lacks a data-availability statement for the pipeline code, does not specify a license for the software, and while the pipeline is described, the code is not openly accessible for full reproducibility."}]}},"suggestions":["Provide a clear data-availability statement for the pipeline code, including a repository link and license.","Include machine-readable metadata (e.g., schema.org annotations) for the pipeline and its outputs.","Specify the exact version of the pipeline and provide a DOI for the code.","Use formal ontologies for functional annotation to enhance interoperability.","Provide a clear protocol for accessing the pipeline, e.g., via a web service API or container image."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"epmc_xml"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"epmc_xml","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:27:48.008645Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}