{"doi":"10.1093/nargab/lqad070","title":"Single-cell reference mapping to construct and extend cell-type hierarchies","abstract":"Single-cell genomics is now producing an ever-increasing amount of datasets that, when integrated, could provide large-scale reference atlases of tissue in health and disease. Such large-scale atlases increase the scale and generalizability of analyses and enable combining knowledge generated by individual studies. Specifically, individual studies often differ regarding cell annotation terminology and depth, with different groups specializing in different cell type compartments, often using distinct terminology. Understanding how these distinct sets of annotations are related and complement each other would mark a major step towards a consensus-based cell-type annotation reflecting the latest knowledge in the field. Whereas recent computational techniques, referred to as 'reference mapping' methods, facilitate the usage and expansion of existing reference atlases by mapping new datasets (i.e. queries) onto an atlas; a systematic approach towards harmonizing dataset-specific cell-type terminology and annotation depth is still lacking. Here, we present 'treeArches', a framework to automatically build and extend reference atlases while enriching them with an updatable hierarchy of cell-type annotations across different datasets. We demonstrate various use cases for treeArches, from automatically resolving relations between reference and query cell types to identifying unseen cell types absent in the reference, such as disease-associated cell states. We envision treeArches enabling data-driven construction of consensus atlas-level cell-type hierarchies and facilitating efficient usage of reference atlases.","journal":"NAR Genomics and Bioinformatics","year":2023,"id":8807,"datarank":0.9523474259871697,"base_score":3.4657359027997265,"endowment":3.4657359027997265,"self_citation_contribution":0.519860385419959,"citation_network_contribution":0.43248704056721066,"self_endowment_contribution":0.519860385419959,"citer_contribution":0.43248704056721066,"corpus_percentile":57.93327908868999,"corpus_rank":518,"citation_count":34,"citer_count":25,"citers_with_citation_signal":17,"citers_with_endowment":17,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.8528,"is_oa":true,"file_count":0,"downloads":38,"has_version_chain":false,"published_date":"2023-07-05","fair_score":48.75,"fair_percentile":44.94283201407212,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":1694,"name":"Mohammad Lotfollahi","orcid":"0000-0001-6858-7985","position":1,"is_corresponding":false},{"id":11446,"name":"Daniel Strobl","orcid":"0000-0002-5516-7057","position":2,"is_corresponding":false},{"id":52013,"name":"Avi Srivastava","orcid":"0000-0001-9798-2079","position":3,"is_corresponding":false},{"id":37194,"name":"Marcel J.T. Reinders","orcid":"0000-0002-1148-1562","position":4,"is_corresponding":false},{"id":42,"name":"Fabian Joachim Theis","orcid":"0000-0002-2419-1943","position":5,"is_corresponding":false},{"id":37171,"name":"Ahmed Mahfouz","orcid":"0000-0001-8601-2149","position":6,"is_corresponding":false},{"id":40531,"name":"Lieke Michielsen","orcid":"0000-0003-4615-1309","position":0,"is_corresponding":true}],"reference_count":38,"raw_metadata":{"citation_network_status":"fetched"},"created_at":"2026-03-01T18:20:47.508186Z","pmid":"37502708","pmcid":"PMC10370450","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"gold","license":"cc-by","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":52.5,"fair_a":67.5,"fair_i":25.0,"fair_r":50.0,"fair_zscore":0.3201,"fair_rationale":{"fair_score":48.75,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":52.5,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.25,"signal":null,"rationale":"The paper provides DOIs and URLs for datasets and code, but does not describe any machine-readable metadata (e.g., structured metadata files, schema.org annotations, or formal metadata standards)."}]},"A":{"name":"Accessible","score":67.5,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":0.75,"signal":null,"rationale":"The paper clearly states that code is available on GitHub and Zenodo with DOIs, and data are deposited in repositories with DOIs, but does not specify authentication or access protocols beyond open access."}]},"I":{"name":"Interoperable","score":25.0,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper uses standard file formats (e.g., count matrices) and references Cell Ontology, but does not demonstrate use of standard identifiers or controlled vocabularies for all data elements."}]},"R":{"name":"Reusable","score":50.0,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"downloads=38","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.667,"signal":null,"rationale":"The paper includes a data availability statement with DOIs, a license (CC BY 4.0), and reproducibility code, but lacks explicit licensing for the code and does not provide a formal software citation or versioning for all components."}]}},"suggestions":["Include machine-readable metadata (e.g., JSON-LD or schema.org annotations) for datasets and code to improve findability.","Specify access protocols (e.g., whether authentication is required) for each data repository.","Use standard identifiers (e.g., ORCID for authors, RRID for cell types) and controlled vocabularies consistently.","Provide explicit software licenses (e.g., MIT or Apache 2.0) for the code repositories.","Add a formal software citation with version and DOI for the treeArches framework itself."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"epmc_xml"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"epmc_xml","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:44:07.948911Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}