{"doi":"10.1038/s43588-024-00627-2","title":"MISATO: machine learning dataset of protein–ligand complexes for structure-based drug discovery","abstract":"Large language models have greatly enhanced our ability to understand biology and chemistry, yet robust methods for structure-based drug discovery, quantum chemistry and structural biology are still sparse. Precise biomolecule-ligand interaction datasets are urgently needed for large language models. To address this, we present MISATO, a dataset that combines quantum mechanical properties of small molecules and associated molecular dynamics simulations of ~20,000 experimental protein-ligand complexes with extensive validation of experimental data. Starting from the existing experimental structures, semi-empirical quantum mechanics was used to systematically refine these structures. A large collection of molecular dynamics traces of protein-ligand complexes in explicit water is included, accumulating over 170 μs. We give examples of machine learning (ML) baseline models proving an improvement of accuracy by employing our data. An easy entry point for ML experts is provided to enable the next generation of drug discovery artificial intelligence models.","journal":"Nature Computational Science","year":2024,"id":5380,"datarank":1.3928220601343537,"base_score":3.828641396489095,"endowment":3.828641396489095,"self_citation_contribution":0.5742962094733643,"citation_network_contribution":0.8185258506609895,"self_endowment_contribution":0.5742962094733643,"citer_contribution":0.8185258506609895,"corpus_percentile":61.35069161920261,"corpus_rank":476,"citation_count":64,"citer_count":62,"citers_with_citation_signal":39,"citers_with_endowment":39,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.9132,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2024-05-10","fair_score":52.9167,"fair_percentile":79.11169744942832,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":3368,"name":"Filipe Menezes","orcid":"0000-0002-7630-5447","position":1,"is_corresponding":false},{"id":3369,"name":"Sabrina Benassou","orcid":null,"position":2,"is_corresponding":false},{"id":53240,"name":"Kieran Didi","orcid":"0000-0001-6839-3320","position":4,"is_corresponding":false},{"id":53241,"name":"André Santos Dias Mourão","orcid":null,"position":5,"is_corresponding":false},{"id":53242,"name":"Radosław Kitel","orcid":"0000-0003-4718-8082","position":6,"is_corresponding":false},{"id":53243,"name":"Pietro Liò","orcid":null,"position":7,"is_corresponding":false},{"id":3371,"name":"Stefan Kesselheim","orcid":"0000-0003-0940-5752","position":8,"is_corresponding":false},{"id":3372,"name":"Marie Piraud","orcid":"0000-0002-4917-2458","position":9,"is_corresponding":false},{"id":42,"name":"Fabian Joachim Theis","orcid":"0000-0002-2419-1943","position":10,"is_corresponding":false},{"id":3373,"name":"Michael Sattler","orcid":"0000-0002-1594-0527","position":11,"is_corresponding":false},{"id":3374,"name":"Grzegorz M. Popowicz","orcid":"0000-0003-2818-7498","position":12,"is_corresponding":false},{"id":3370,"name":"Erinç Merdivan","orcid":"0009-0004-9213-7393","position":13,"is_corresponding":false},{"id":53244,"name":"André Mourão","orcid":"0000-0003-0764-9868","position":14,"is_corresponding":false},{"id":53245,"name":"Píetro Lió","orcid":"0000-0002-0540-5053","position":15,"is_corresponding":false},{"id":3367,"name":"Till Siebenmorgen","orcid":"0009-0008-5160-8100","position":0,"is_corresponding":true}],"reference_count":78,"raw_metadata":null,"created_at":"2026-03-01T18:20:47.508186Z","pmid":"38730184","pmcid":"PMC11136668","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"hybrid","license":"cc-by","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":65.0,"fair_a":67.5,"fair_i":37.5,"fair_r":41.6667,"fair_zscore":0.697,"fair_rationale":{"fair_score":52.92,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":65.0,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper describes many computed properties (e.g., QM, MD) but does not provide machine-readable metadata (e.g., schema.org, JSON-LD) or structured metadata beyond the text."}]},"A":{"name":"Accessible","score":67.5,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":0.75,"signal":null,"rationale":"The paper states the dataset is on Zenodo and code on GitHub with instructions, but does not specify a formal access protocol (e.g., API, authentication) or persistent identifier for the code beyond a GitHub URL."}]},"I":{"name":"Interoperable","score":37.5,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.75,"signal":null,"rationale":"The paper uses standard formats (H5, MOL2, PDB) and identifiers (PDB IDs), but does not mention use of standard vocabularies or ontologies for the metadata."}]},"R":{"name":"Reusable","score":41.67,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.667,"signal":null,"rationale":"The paper provides a data-availability statement, a CC-BY 4.0 license, and code, but lacks a formal reproducibility statement and does not specify software versions or dependencies in a machine-readable way."}]}},"suggestions":["Add machine-readable metadata (e.g., JSON-LD or schema.org markup) to the dataset landing page.","Provide a persistent identifier (e.g., DOI) for the code repository and specify an access protocol (e.g., API or direct download link).","Use standard ontologies (e.g., CHEBI, SIO) to annotate the computed properties and metadata.","Include a formal reproducibility statement with exact software versions, dependencies, and container specifications.","Add a license file to the code repository and specify the license in the README."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"epmc_xml"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"epmc_xml","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:42:13.501250Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}