{"doi":"10.1038/sdata.2018.273","title":"Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records","abstract":"Columbia Open Health Data (COHD) is a publicly accessible database of electronic health record (EHR) prevalence and co-occurrence frequencies between conditions, drugs, procedures, and demographics. COHD was derived from Columbia University Irving Medical Center's Observational Health Data Sciences and Informatics (OHDSI) database. The lifetime dataset, derived from all records, contains 36,578 single concepts (11,952 conditions, 12,334 drugs, and 10,816 procedures) and 32,788,901 concept pairs from 5,364,781 patients. The 5-year dataset, derived from records from 2013-2017, contains 29,964 single concepts (10,159 conditions, 10,264 drugs, and 8,270 procedures) and 15,927,195 concept pairs from 1,790,431 patients. Exclusion of rare concepts (count ≤ 10) and Poisson randomization enable data sharing by eliminating risks to patient privacy. EHR prevalences are informative of healthcare consumption rates. Analysis of co-occurrence frequencies via relative frequency analysis and observed-expected frequency ratio are informative of associations between clinical concepts, useful for biomedical research tasks such as drug repurposing and pharmacovigilance. COHD is publicly accessible through a web application-programming interface (API) and downloadable from the Figshare repository. The code is available on GitHub.","journal":"Scientific Data","year":2018,"id":12039,"datarank":1.7510233341220571,"base_score":3.9889840465642745,"endowment":3.9889840465642745,"self_citation_contribution":0.5983476069846413,"citation_network_contribution":1.152675727137416,"self_endowment_contribution":0.5983476069846413,"citer_contribution":1.152675727137416,"corpus_percentile":63.30349877949553,"corpus_rank":452,"citation_count":53,"citer_count":39,"citers_with_citation_signal":29,"citers_with_endowment":29,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.9467,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2018-11-27","fair_score":64.375,"fair_percentile":96.06420404573439,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":74,"name":"Michel J. Dumontier","orcid":"0000-0003-4727-9435","position":1,"is_corresponding":false},{"id":2010,"name":"George Hripcsak","orcid":"0000-0003-2664-7614","position":2,"is_corresponding":false},{"id":2011,"name":"Nicholas P. Tatonetti","orcid":"0000-0002-2700-2597","position":3,"is_corresponding":false},{"id":2012,"name":"Chunhua Weng","orcid":"0000-0002-9624-0214","position":4,"is_corresponding":false},{"id":2009,"name":"Casey Ta","orcid":"0000-0002-4679-805X","position":5,"is_corresponding":false}],"reference_count":37,"raw_metadata":null,"created_at":"2026-03-01T18:20:47.508186Z","pmid":"30480666","pmcid":"PMC6257042","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"gold","license":"cc-by","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":77.5,"fair_a":80.0,"fair_i":50.0,"fair_r":50.0,"fair_zscore":1.7335,"fair_rationale":{"fair_score":64.38,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":77.5,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.75,"signal":null,"rationale":"The paper describes the data via OMOP standard concepts, provides concept definitions, and uses JSON from the API, but no formal machine-readable metadata like schema.org or DCAT is mentioned."}]},"A":{"name":"Accessible","score":80.0,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":1.0,"signal":null,"rationale":"Data is clearly accessible via a public API (cohd.io) with documented endpoints and downloadable flat-files from Figshare, fulfilling a clear access protocol."}]},"I":{"name":"Interoperable","score":50.0,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":1.0,"signal":null,"rationale":"The data uses OMOP CDM standard concept IDs, references established vocabularies (SNOMED, RxNorm, ICD), and formats like JSON, ensuring high interoperability."}]},"R":{"name":"Reusable","score":50.0,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.833,"signal":null,"rationale":"A clear CC-BY license is stated, code is on GitHub, data is on Figshare, and usage notes are provided, but reproducibility is limited because the original EHR data cannot be shared and the code requires an OMOP database."}]}},"suggestions":["Add structured metadata (e.g., schema.org/Dataset) to the API and Figshare page to enhance machine findability.","Include a persistent data-level identifier (e.g., DOI) for the data on Figshare, as the paper only mentions a citation.","Provide a tutorial or containerized environment (e.g., Docker) to allow easier reproduction of the analysis without requiring an OMOP database.","Document the exact version of the OMOP CDM and the date of extraction to improve provenance tracking.","Offer the association analysis results (chi-square, etc.) as precomputed files to reduce computational burden for reusers."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"epmc_xml"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"epmc_xml","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:42:47.691852Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}