{"doi":"10.7717/peerj.1621","title":"Cross-platform normalization of microarray and RNA-seq data for machine learning applications","abstract":"Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log 2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.","journal":"PeerJ","year":2016,"id":4772,"datarank":3.7089562870858392,"base_score":4.61512051684126,"endowment":4.61512051684126,"self_citation_contribution":0.692268077526189,"citation_network_contribution":3.0166882095596503,"self_endowment_contribution":0.692268077526189,"citer_contribution":3.0166882095596503,"corpus_percentile":69.16192026037429,"corpus_rank":380,"citation_count":102,"citer_count":87,"citers_with_citation_signal":72,"citers_with_endowment":72,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.6598,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2016-01-21","fair_score":35.2083,"fair_percentile":17.282321899736147,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":18253,"name":"Jie Tan","orcid":"0000-0002-8893-4566","position":1,"is_corresponding":false},{"id":308,"name":"Casey S. Greene","orcid":"0000-0001-8713-9213","position":2,"is_corresponding":false},{"id":18254,"name":"Jeffrey Thompson","orcid":"0000-0002-0876-2582","position":3,"is_corresponding":false}],"reference_count":38,"raw_metadata":{"citation_network_status":"fetched"},"created_at":"2026-03-01T18:20:47.508186Z","pmid":"26844019","pmcid":"PMC4736986","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"gold","license":"cc-by","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":40.0,"fair_a":55.0,"fair_i":12.5,"fair_r":33.3333,"fair_zscore":-0.9048,"fair_rationale":{"fair_score":35.21,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":40.0,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.0,"signal":null,"rationale":"No machine-readable metadata (e.g., structured schema.org markup, standardized data descriptors) is mentioned in the paper text."}]},"A":{"name":"Accessible","score":55.0,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper provides DOIs and URLs for the TDM R package and source code (zenodo.32852, zenodo.32851, github.com/greenelab/TDM, github.com/greenelab/TDMresults), but does not specify a formal access protocol or repository with persistent access guarantees beyond these links."}]},"I":{"name":"Interoperable","score":12.5,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.25,"signal":null,"rationale":"The paper uses standard file formats (e.g., R packages, supplemental files) but does not mention use of standard controlled vocabularies or identifiers for data elements (e.g., gene symbols, ontologies) beyond generic references."}]},"R":{"name":"Reusable","score":33.33,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.5,"signal":null,"rationale":"The paper includes a data-availability statement with DOIs for code and a BSD-3 license for the R package, but lacks explicit mention of a data repository for the underlying datasets (only simulated and TCGA data are referenced) and does not provide full reproducibility details (e.g., exact versions of all dependencies)."}]}},"suggestions":["Add machine-readable metadata (e.g., JSON-LD or schema.org markup) to the paper's landing page to describe datasets and methods.","Deposit all raw and processed data in a persistent repository (e.g., Zenodo, Figshare) with a formal access protocol and license.","Use standard ontologies (e.g., Gene Ontology, EFO) for gene and condition identifiers to enhance interoperability.","Provide a complete computational environment (e.g., Dockerfile or Binder) to ensure full reproducibility of results.","Include a formal data-availability statement that explicitly states where all datasets (including simulated) are archived and under what license."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"epmc_xml"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"epmc_xml","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:39:55.549269Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}