{"doi":"10.1371/journal.pcbi.0010031","title":"Functional Coverage of the Human Genome by Existing Structures, Structural Genomics Targets, and Homology Models","abstract":"The bias in protein structure and function space resulting from experimental limitations and targeting of particular functional classes of proteins by structural biologists has long been recognized, but never continuously quantified. Using the Enzyme Commission and the Gene Ontology classifications as a reference frame, and integrating structure data from the Protein Data Bank (PDB), target sequences from the structural genomics projects, structure homology derived from the SUPERFAMILY database, and genome annotations from Ensembl and NCBI, we provide a quantified view, both at the domain and whole-protein levels, of the current and projected coverage of protein structure and function space relative to the human genome. Protein structures currently provide at least one domain that covers 37% of the functional classes identified in the genome; whole structure coverage exists for 25% of the genome. If all the structural genomics targets were solved (twice the current number of structures in the PDB), it is estimated that structures of one domain would cover 69% of the functional classes identified and complete structure coverage would be 44%. Homology models from existing experimental structures extend the 37% coverage to 56% of the genome as single domains and 25% to 31% for complete structures. Coverage from homology models is not evenly distributed by protein family, reflecting differing degrees of sequence and structure divergence within families. While these data provide coverage, conversely, they also systematically highlight functional classes of proteins for which structures should be determined. Current key functional families without structure representation are highlighted here; updated information on the \"most wanted list\" that should be solved is available on a weekly basis from http://function.rcsb.org:8080/pdb/function_distribution/index.html.","journal":"PLoS Computational Biology","year":2005,"id":1921,"datarank":3.450512731121688,"base_score":4.189654742026425,"endowment":4.189654742026425,"self_citation_contribution":0.6284482113039639,"citation_network_contribution":2.822064519817724,"self_endowment_contribution":0.6284482113039639,"citer_contribution":2.822064519817724,"corpus_percentile":68.51098454027665,"corpus_rank":388,"citation_count":65,"citer_count":58,"citers_with_citation_signal":54,"citers_with_endowment":54,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.5818,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2005-08-19","fair_score":42.5,"fair_percentile":21.459982409850483,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":125,"name":"Philip  E. Bourne","orcid":"0000-0002-7618-7292","position":1,"is_corresponding":false},{"id":150,"name":"Lei Xie","orcid":"0000-0001-9051-2111","position":0,"is_corresponding":true}],"reference_count":53,"raw_metadata":null,"created_at":"2026-03-01T18:20:47.508186Z","pmid":"16118666","pmcid":"PMC1188274","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"gold","license":"cc-by","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":52.5,"fair_a":55.0,"fair_i":37.5,"fair_r":25.0,"fair_zscore":-0.2453,"fair_rationale":{"fair_score":42.5,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":52.5,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.25,"signal":null,"rationale":"The paper provides rich descriptive metadata in the text (e.g., EC, GO, PDB identifiers) but no evidence of machine-readable metadata (e.g., structured data, schema.org, or RDF) in the paper itself."}]},"A":{"name":"Accessible","score":55.0,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper mentions a web resource (http://function.rcsb.org:8080/pdb/function_distribution/index.html) for data access, but does not provide a clear, persistent protocol for accessing the underlying data or code (e.g., no DOI, repository, or download link for the dataset)."}]},"I":{"name":"Interoperable","score":37.5,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.75,"signal":null,"rationale":"The paper uses standard community vocabularies (EC, GO, SCOP, Pfam) and identifiers (UniProt, PDB, OMIM), but does not explicitly state that the data are provided in standard machine-readable formats (e.g., CSV, XML, RDF) or use formal ontologies beyond those mentioned."}]},"R":{"name":"Reusable","score":25.0,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.333,"signal":null,"rationale":"The paper is open-access under CC BY 4.0, but lacks a formal data-availability statement, a license for the data/code, and sufficient detail to reproduce the analysis (e.g., exact software versions, parameters, or a repository for scripts)."}]}},"suggestions":["Provide a persistent identifier (e.g., DOI) for the dataset and code, and deposit them in a public repository (e.g., Zenodo, Figshare).","Include a formal data-availability statement specifying the license and access conditions for the data and code.","Add machine-readable metadata (e.g., JSON-LD or RDFa) to the paper to improve findability by automated systems.","Document the analysis pipeline with exact software versions, parameters, and a workflow (e.g., in a README or Jupyter notebook) to enhance reproducibility.","Provide the data in standard, non-proprietary formats (e.g., CSV, TSV, or XML) alongside the paper or in a repository."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"epmc_xml"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"epmc_xml","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:41:59.157592Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}