{"doi":"10.1101/gr.221202","title":"Structural Characterization of the Human Proteome","abstract":"<jats:p>This paper reports an analysis of the encoded proteins (the proteome) of the genomes of human, fly, worm, yeast, and representatives of bacteria and archaea in terms of the three-dimensional structures of their globular domains together with a general sequence-based study. We show that 39% of the human proteome can be assigned to known structures. We estimate that for 77% of the proteome, there is some functional annotation, but only 26% of the proteome can be assigned to standard sequence motifs that characterize function. Of the human protein sequences, 13% are transmembrane proteins, but only 3% of the residues in the proteome form membrane-spanning regions. There are substantial differences in the composition of globular domains of transmembrane proteins between the proteomes we have analyzed. Commonly occurring structural superfamilies are identified within the proteome. The frequencies of these superfamilies enable us to estimate that 98% of the human proteome evolved by domain duplication, with four of the 10 most duplicated superfamilies specific for multicellular organisms. The zinc-finger superfamily is massively duplicated in human compared to fly and worm, and occurrence of domains in repeats is more common in metazoa than in single cellular organisms. Structural superfamilies over- and underrepresented in human disease genes have been identified. Data and results can be downloaded and analyzed via web-based applications at<jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" ext-link-type=\"uri\" xlink:href=\"http://www.sbg.bio.ic.ac.uk\" xlink:type=\"simple\">http://www.sbg.bio.ic.ac.uk</jats:ext-link>.</jats:p><jats:p>[Supplemental material is available online at <jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" ext-link-type=\"uri\" xlink:href=\"http://www.genome.org\" xlink:type=\"simple\">http://www.genome.org</jats:ext-link>.]</jats:p>","journal":"Genome Research","year":2002,"id":23895,"datarank":4.132477581862563,"base_score":4.394449154672439,"endowment":4.394449154672439,"self_citation_contribution":0.6591673732008659,"citation_network_contribution":3.473310208661697,"self_endowment_contribution":0.6591673732008659,"citer_contribution":3.473310208661697,"corpus_percentile":71.2,"corpus_rank":390,"citation_count":80,"citer_count":79,"citers_with_citation_signal":69,"citers_with_endowment":69,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":null,"is_oa":false,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":null,"fair_score":40.0,"fair_percentile":20.40457343887423,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":145066,"name":"Robert M. MacCallum","orcid":null,"position":1,"is_corresponding":false},{"id":3514,"name":"Michael J.E. Sternberg","orcid":"0000-0002-1884-5445","position":2,"is_corresponding":false},{"id":145065,"name":"Arne Müller","orcid":null,"position":0,"is_corresponding":false}],"reference_count":0,"raw_metadata":{"has_enrichment":true,"base_score":4.394449154672439,"endowment":4.394449154672439,"datacite_reuse_total":0,"file_count":0,"downloads":0,"views":0,"has_version_chain":false,"is_dataset":false,"is_oa":false,"pmid":"12421749","pmcid":"PMC187559","openalex_id":"https://openalex.org/W2145037531","authors":[],"funders":[],"total_grants":0,"fwci":null,"citation_percentile":null,"influential_citations":3,"citation_trend":[{"year":2012,"count":1},{"year":2013,"count":6},{"year":2014,"count":2},{"year":2015,"count":5},{"year":2016,"count":2},{"year":2017,"count":5},{"year":2019,"count":2},{"year":2020,"count":1},{"year":2021,"count":1},{"year":2022,"count":1},{"year":2023,"count":1},{"year":2024,"count":1},{"year":2025,"count":1},{"year":2026,"count":1}],"oa_status":"bronze","license":null,"oa_locations":[{"url":"https://genome.cshlp.org/content/12/11/1625.full.pdf","host_type":"journal"},{"url":"https://genome.cshlp.org/content/12/11/1625.full.pdf","host_type":"HYBRID"},{"url":"https://genome.cshlp.org/content/12/11/1625.full.pdf","host_type":"publisher"},{"url":"https://syndication.highwire.org/content/doi/10.1101/gr.221202","host_type":"publisher"},{"url":"https://doi.org/10.1101/gr.221202","host_type":"journal"},{"url":"https://pubmed.ncbi.nlm.nih.gov/12421749","host_type":"repository"},{"url":"http://europepmc.org/articles/PMC187559","host_type":"repository"},{"url":"https://www.ncbi.nlm.nih.gov/pmc/articles/187559","host_type":"repository"}],"fields_of_study":["Genomics and Phylogenetic Studies","Machine Learning in Bioinformatics","RNA and protein synthesis mechanisms","Medicine","Biology","Algorithms","Animals","Archaeal Proteins","Bacterial Proteins","Caenorhabditis elegans Proteins","Databases, Genetic","Drosophila Proteins","Escherichia coli Proteins","Gene Duplication","Genetic Diseases, Inborn","Humans","Markov Chains","Membrane Proteins","Online Systems","Phylogeny","Protein Structure, Quaternary","Proteome","Saccharomyces cerevisiae Proteins"],"mesh_terms":["Algorithms","Animals","Bacterial Proteins","Humans","Markov Chains","Membrane Proteins","Online Systems","Phylogeny","Archaeal Proteins","Gene Duplication","Proteome","Protein Structure, Quaternary","Saccharomyces cerevisiae Proteins","Drosophila Proteins","Caenorhabditis elegans Proteins","Escherichia coli Proteins","Genetic Diseases, Inborn","Databases, Genetic"],"keywords":["Proteome","Biology","Human proteome project","Computational biology","Genome","Human genome","Genetics","Proteomics","Transmembrane protein","Protein domain","Human proteins","Archaea","Gene"],"sdg_mappings":[],"linked_datasets":[],"clinical_trials":[],"software_tools":[],"database_accessions":[{"name":"interpro"}],"source":"live","citation_network_status":"fetched"},"created_at":"2026-06-07T20:50:31.919201Z","pmid":"12421749","pmcid":"PMC187559","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":null,"license":null,"views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":69.0,"fair_a":66.0,"fair_i":5.0,"fair_r":20.0,"fair_zscore":-0.4714,"fair_rationale":{"fair_score":40.0,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":69.0,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.25,"signal":null,"rationale":"The text provides a high-level summary of analysis types and a raw URL for data download, but does not describe structured, machine-readable metadata (e.g., JSON-LD, schema.org markup, or formal metadata schema)."}]},"A":{"name":"Accessible","score":66.0,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":0.5,"signal":"files/OA location present but not flagged OA","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"8 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper states that data and results can be downloaded via a web-based application at a given URL and mentions supplemental material at a second URL, but does not specify a persistent identifier (e.g., DOI) for the data or a formal access protocol (e.g., API, structured repository)."}]},"I":{"name":"Interoperable","score":5.0,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.25,"signal":null,"rationale":"The paper discusses structural superfamilies and sequence motifs but does not mention use of standard formats (e.g., FASTA, PDB), controlled vocabularies, or resolvable identifiers (e.g., UniProt IDs, PDB IDs) in the downloadable data."}]},"R":{"name":"Reusable","score":20.0,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.333,"signal":null,"rationale":"The paper reports analysis results and provides a download link, but lacks a formal data-availability statement, a license, or explicit details on how to reproduce the analysis (e.g., code, parameters)."}]}},"suggestions":["Deposit the underlying data (domain assignments, sequence motifs, structural superfamilies) in a recognized repository (e.g., Zenodo, Figshare) with a persistent identifier like a DOI.","Provide a formal data-availability statement that includes license information (e.g., Creative Commons) and conditions for reuse.","Use standard machine-readable formats (e.g., FASTA for sequences, PDB/mmCIF for structures, JSON-LD for metadata) with resolvable identifiers (e.g., UniProt accessions, PDB codes).","Include a code repository (e.g., GitHub) with versioned scripts and a README explaining how to reproduce all analyses and figures.","Add structured metadata (e.g., schema.org/Dataset markup) to the paper's landing page to enable automated discovery and indexing."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"abstract_only"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"abstract_only","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:40:55.998697Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}