{"doi":"10.1101/2024.06.12.598655","title":"Signals in the Cells: Multimodal and Contextualized Machine Learning Foundations for Therapeutics","abstract":"Drug discovery AI datasets and benchmarks have not traditionally included single-cell analysis biomarkers. While benchmarking efforts in single-cell analysis have recently released collections of single-cell tasks, they have yet to comprehensively release datasets, models, and benchmarks that integrate a broad range of therapeutic discovery tasks with cell-type-specific biomarkers. Therapeutics Commons (TDC-2) presents datasets, tools, models, and benchmarks integrating cell-type-specific contextual features with ML tasks across therapeutics. We present four tasks for contextual learning at single-cell resolution: drug-target nomination, genetic perturbation response prediction, chemical perturbation response prediction, and protein-peptide interaction prediction. We introduce datasets, models, and benchmarks for these four tasks. Finally, we detail the advancements and challenges in machine learning and biology that drove the implementation of TDC-2 and how they are reflected in its architecture, datasets and benchmarks, and foundation model tooling.","journal":null,"year":2024,"id":2493,"datarank":0.5321429271892798,"base_score":2.302585092994046,"endowment":2.302585092994046,"self_citation_contribution":0.3453877639491069,"citation_network_contribution":0.18675516324017286,"self_endowment_contribution":0.3453877639491069,"citer_contribution":0.18675516324017286,"corpus_percentile":51.912123677786816,"corpus_rank":592,"citation_count":12,"citer_count":11,"citers_with_citation_signal":6,"citers_with_endowment":6,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.9364,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2024-06-14","fair_score":49.7917,"fair_percentile":77.9023746701847,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":29861,"name":"Xiang Lin","orcid":"0000-0002-7634-5780","position":1,"is_corresponding":false},{"id":29862,"name":"Michelle M. Li","orcid":"0000-0003-0223-7485","position":2,"is_corresponding":false},{"id":29863,"name":"Kexin Huang","orcid":"0000-0001-6693-8390","position":3,"is_corresponding":false},{"id":29864,"name":"Wenhao Gao","orcid":"0000-0002-6506-8044","position":4,"is_corresponding":false},{"id":29865,"name":"Tianfan Fu","orcid":"0000-0002-5574-2541","position":5,"is_corresponding":false},{"id":29866,"name":"Bradley L. Pentelute","orcid":"0000-0002-7242-801X","position":6,"is_corresponding":false},{"id":14693,"name":"Sharon L. R. Kardia","orcid":"0000-0002-9853-3379","position":7,"is_corresponding":false},{"id":29867,"name":"Marinka Zitnik","orcid":"0000-0001-8530-7228","position":8,"is_corresponding":false},{"id":29860,"name":"Alejandro Velez-Arce","orcid":"0009-0009-2303-6114","position":0,"is_corresponding":true}],"reference_count":164,"raw_metadata":null,"created_at":"2026-03-01T18:20:47.508186Z","pmid":"38948789","pmcid":"PMC11212894","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"green","license":"cc-by","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":65.0,"fair_a":67.5,"fair_i":25.0,"fair_r":41.6667,"fair_zscore":0.4143,"fair_rationale":{"fair_score":49.79,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":65.0,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper provides URLs and DOIs for datasets and code (e.g., GitHub, Dataverse, website), which are human-readable identifiers but lacks formal machine-readable metadata (e.g., structured schema.org annotations, JSON-LD, or RDF) in the text."}]},"A":{"name":"Accessible","score":67.5,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":0.75,"signal":null,"rationale":"Clear data and code access protocols are given via URLs (GitHub, Dataverse, website), and the CC BY 4.0 license is stated, but there is no explicit description of authentication or authorization steps for accessing the API or data."}]},"I":{"name":"Interoperable","score":25.0,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper uses standard file formats (CSV, Python code) and references common ontologies (e.g., Open Targets, CellXGene), but does not specify use of formal standard vocabularies or unique identifiers (e.g., ORCID, PubChem CID, UniProt AC) in the dataset descriptions."}]},"R":{"name":"Reusable","score":41.67,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.667,"signal":null,"rationale":"A data-availability statement with persistent identifiers (DOI, GitHub) and a clear CC BY 4.0 license are provided, but reproducibility instructions are incomplete (e.g., not all hyperparameters or environment details are fully specified), and the paper lacks a formal reproducibility checklist or containerization."}]}},"suggestions":["Add machine-readable metadata (e.g., schema.org JSON-LD) to the dataset landing pages and paper.","Document authentication/authorization steps for API access (e.g., API key requirements) in the paper or repository.","Use persistent standardized identifiers (e.g., UniProt IDs, PubChem CIDs) in all dataset examples and tables.","Include a full reproducibility capsule (e.g., Dockerfile, conda environment.yml, or Binder link) in the GitHub repository.","Provide a formal data dictionary or schema for all datasets, specifying field types and controlled vocabularies."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"epmc_xml"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"epmc_xml","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:49:42.024429Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}