{"doi":"10.7759/cureus.97260","title":"Apples-to-Apples: Age-Sex Standardisation of Public Chest X-ray Datasets","abstract":"Background Public chest radiograph datasets are widely used for model development and benchmarking, but differences in patient demographics can inflate apparent between-dataset differences in disease label prevalence. Objective To quantify the proportion of NIH ChestX-ray14 versus CheXpert prevalence differences that is explained by age and sex alone. Methods A cross-sectional analysis of NIH ChestX-ray14 (n=112,120 studies) and CheXpert (n=223,413) databases was performed. Sex was harmonised to Male/Female and age was categorised as 0-17, 18-39, 40-59, 60-79, and ≥80 years. Five shared labels were assessed: consolidation, atelectasis, pleural effusion, edema, and cardiomegaly. For CheXpert, label uncertainty (-1) was treated as negative in the primary analysis. For each label, we calculated crude prevalence with Wilson 95% confidence intervals and compared datasets using a two-proportion z-test. We then performed direct standardisation by reweighting CheXpert age-sex strata to the NIH age-sex distribution and reported the reduction in the crude prevalence gap attributable to age-sex adjustment. Results Crude prevalence was higher in CheXpert than NIH for all labels (all p<0.001). After age-sex standardisation, CheXpert prevalence decreased for every label, indicating that demographics account for a substantial share of between-dataset differences. For consolidation, the crude gap of 1.96 percentage points (6.12% vs 4.16%) decreased to a standardised gap of 1.47 percentage points (CheXpert standardised 5.63% vs NIH 4.16%), representing approximately a 25% reduction. For atelectasis, the gap declined from 4.85 to 2.84 percentage points (41% reduction approx.). For pleural effusion, the gap declined from 28.10 to 19.03 percentage points (32% reduction approx.). For edema, the gap declined from 21.70 to 14.78 percentage points (32% reduction approx.). For cardiomegaly, the gap declined from 9.45 to 6.55 percentage points (31% reduction approx.). Across labels, age-sex standardisation explained approximately 25% to 40% of the crude prevalence differences. Conclusion A simple age-sex standardisation step explains a large proportion of apparent label prevalence differences between NIH ChestX-ray14 and CheXpert. Routine reporting of standardised prevalence alongside crude estimates and demographic composition can improve fairness and interpretability in cross-dataset benchmarking and reduce the risk of attributing demographic composition effects to labelling or model performance.","journal":"Cureus","year":2025,"id":11134,"datarank":0.0,"base_score":0.0,"endowment":0.0,"self_citation_contribution":0.0,"citation_network_contribution":0.0,"self_endowment_contribution":0.0,"citer_contribution":0.0,"corpus_percentile":0.0,"corpus_rank":765,"citation_count":0,"citer_count":0,"citers_with_citation_signal":0,"citers_with_endowment":0,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.6465,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2025-11-19","fair_score":41.4583,"fair_percentile":20.734388742304308,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":89929,"name":"Maiar Elhariry","orcid":"0009-0006-3011-9303","position":1,"is_corresponding":false},{"id":89930,"name":"Amrit Chirrimar","orcid":null,"position":2,"is_corresponding":false},{"id":89931,"name":"Ashrit Chohan","orcid":null,"position":3,"is_corresponding":false},{"id":89932,"name":"Ahmed Badawy","orcid":"0000-0002-1112-3001","position":4,"is_corresponding":false},{"id":89928,"name":"Amr Badawy","orcid":null,"position":0,"is_corresponding":true}],"reference_count":17,"raw_metadata":{"citation_network_status":"fetched"},"created_at":"2026-03-01T18:20:47.508186Z","pmid":"41426766","pmcid":"PMC12716848","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"diamond","license":"cc-by","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":52.5,"fair_a":55.0,"fair_i":25.0,"fair_r":33.3333,"fair_zscore":-0.3395,"fair_rationale":{"fair_score":41.46,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":52.5,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.25,"signal":null,"rationale":"The paper describes use of publicly available datasets but does not provide any machine-readable metadata (e.g., structured schema, vocabulary annotations, or DataCite metadata) for the derived data or analysis outputs."}]},"A":{"name":"Accessible","score":55.0,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper cites the original data sources and their licenses but does not provide a direct link to the analysis code (Google Sheets) or a clear protocol for accessing the exact version of data used, limiting transparent access."}]},"I":{"name":"Interoperable","score":25.0,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"Age categories and sex are harmonised but the paper does not reference standard terminologies (e.g., RadLex, SNOMED) for the radiological labels, nor does it use persistent identifiers for the datasets beyond URLs."}]},"R":{"name":"Reusable","score":33.33,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.5,"signal":null,"rationale":"The method is clearly described and the paper is open-licensed (CC-BY), but the analysis code (Google Sheets) is not shared, and the processed data (e.g., standardised prevalence tables per stratum) are not provided as downloadable files, reducing reproducibility."}]}},"suggestions":["Publish the Google Sheets link or a GitHub repository containing the exact formulae and intermediate data used for standardisation.","Provide a supplementary CSV file with the age-sex stratum weights and standardised prevalence calculations to enable independent reuse.","Assign DOIs or persistent identifiers to the specific versions of NIH ChestX-ray14 and CheXpert metadata used, and reference them in the paper.","Include a machine-readable DataCite metadata record for the study outputs, describing variables, licenses, and access conditions."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"epmc_xml"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"epmc_xml","fair_has_llm":true,"fair_computed_at":"2026-06-18T06:47:28.225669Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}