{"doi":"10.1016/j.jbi.2011.03.007","title":"Comparison of automated and human assignment of MeSH terms on publicly-available molecular datasets","abstract":"Publicly available molecular datasets can be used for independent verification or investigative repurposing, but depends on the presence, consistency and quality of descriptive annotations. Annotation and indexing of molecular datasets using well-defined controlled vocabularies or ontologies enables accurate and systematic data discovery, yet the majority of molecular datasets available through public data repositories lack such annotations. A number of automated annotation methods have been developed; however few systematic evaluations of the quality of annotations supplied by application of these methods have been performed using annotations from standing public data repositories. Here, we compared manually-assigned Medical Subject Heading (MeSH) annotations associated with experiments by data submitters in the PRoteomics IDEntification (PRIDE) proteomics data repository to automated MeSH annotations derived through the National Center for Biomedical Ontology Annotator and National Library of Medicine MetaMap programs. These programs were applied to free-text annotations for experiments in PRIDE. As many submitted datasets were referenced in publications, we used the manually curated MeSH annotations of those linked publications in MEDLINE as \"gold standard\". Annotator and MetaMap exhibited recall performance 3-fold greater than that of the manual annotations. We connected PRIDE experiments in a network topology according to shared MeSH annotations and found 373 distinct clusters, many of which were found to be biologically coherent by network analysis. The results of this study suggest that both Annotator and MetaMap are capable of annotating public molecular datasets with a quality comparable, and often exceeding, that of the actual data submitters, highlighting a continuous need to improve and apply automated methods to molecular datasets in public data repositories to maximize their value and utility.","journal":"Journal of Biomedical Informatics","year":2011,"id":5247,"datarank":0.7681982744357967,"base_score":2.70805020110221,"endowment":2.70805020110221,"self_citation_contribution":0.40620753016533157,"citation_network_contribution":0.3619907442704652,"self_endowment_contribution":0.40620753016533157,"citer_contribution":0.3619907442704652,"corpus_percentile":55.65500406834825,"corpus_rank":546,"citation_count":14,"citer_count":13,"citers_with_citation_signal":8,"citers_with_endowment":8,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.7876,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2011-12-01","fair_score":40.5,"fair_percentile":20.5584872471416,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":52640,"name":"Michael Mbagwu","orcid":"0000-0001-6079-3119","position":1,"is_corresponding":false},{"id":48,"name":"Joel T. Dudley","orcid":"0000-0002-7036-6492","position":2,"is_corresponding":false},{"id":52641,"name":"Vijay Krishnan","orcid":"0000-0003-3685-5249","position":3,"is_corresponding":false},{"id":51,"name":"Atul Janardhan Butte","orcid":"0000-0002-7433-2740","position":4,"is_corresponding":false},{"id":65477,"name":"David J. Ruau","orcid":null,"position":0,"is_corresponding":true}],"reference_count":21,"raw_metadata":{"citation_network_status":"fetched"},"created_at":"2026-03-01T18:20:47.508186Z","pmid":"21420508","pmcid":"PMC3155012","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"hybrid","license":"cc-by-nc-nd","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":74.0,"fair_a":53.0,"fair_i":15.0,"fair_r":20.0,"fair_zscore":-0.4262,"fair_rationale":{"fair_score":40.5,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":74.0,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper discusses MeSH annotations but does not provide machine-readable metadata or detail the metadata format used for the datasets."}]},"A":{"name":"Accessible","score":53.0,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":0.25,"signal":null,"rationale":"No clear protocol for accessing the data or code used in the comparison is provided; only general reference to public repositories is given."}]},"I":{"name":"Interoperable","score":15.0,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.75,"signal":null,"rationale":"Standard vocabularies (MeSH, NCBO Annotator, MetaMap) are used, but no mention of standardized file formats or unique identifiers for the datasets."}]},"R":{"name":"Reusable","score":20.0,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.333,"signal":null,"rationale":"No data-availability statement, license, or explicit reproducibility details are present; the datasets are described as publicly available but specifics are lacking."}]}},"suggestions":["Include a supplementary file with machine-readable metadata (e.g., JSON-LD) describing the datasets and annotations.","Provide explicit access instructions and a persistent identifier (e.g., DOI) for the code and evaluation scripts.","State a reuse license (e.g., CC0) and include a formal data-availability statement.","Use standardized file formats (e.g., CSV, RDF) with schema definitions to enhance interoperability."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"abstract_only"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"abstract_only","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:48:53.132002Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}