{"doi":"10.1101/2023.05.24.542082","title":"MISATO - Machine learning dataset of protein-ligand complexes for structure-based drug discovery","abstract":"Large language models (LLMs) have greatly enhanced our ability to understand biology and chemistry. Yet, relatively few robust methods have been reported for structure-based drug discovery. Highly precise biomolecule-ligand interaction datasets are urgently needed in particular for LLMs, that require extensive training data. We present MISATO, the first dataset that combines quantum mechanics properties of small molecules and associated molecular dynamics simulations of about 20000 experimental protein-ligand complexes. Starting from the PDBbind dataset, semi-empirical quantum mechanics was used to systematically refine these structures. The largest collection to date of molecular dynamics traces of protein-ligand complexes in explicit water are included, accumulating to 170 μs. We give ML baseline models and simple Python data loaders, and aim to foster a thriving community around MISATO ( https://github.com/t7morgen/misato-dataset ). An easy entry point for ML experts is provided without the need of deep domain expertise to enable the next generation of drug discovery AI models.","journal":null,"year":2023,"id":3881,"datarank":1.0465582158928872,"base_score":3.1354942159291497,"endowment":3.1354942159291497,"self_citation_contribution":0.47032413238937254,"citation_network_contribution":0.5762340835035146,"self_endowment_contribution":0.47032413238937254,"citer_contribution":0.5762340835035146,"corpus_percentile":58.665581773799836,"corpus_rank":509,"citation_count":23,"citer_count":22,"citers_with_citation_signal":17,"citers_with_endowment":17,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.9118,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2023-05-24","fair_score":53.125,"fair_percentile":79.94722955145119,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":3368,"name":"Filipe Menezes","orcid":"0000-0002-7630-5447","position":1,"is_corresponding":false},{"id":3369,"name":"Sabrina Benassou","orcid":null,"position":2,"is_corresponding":false},{"id":3371,"name":"Stefan Kesselheim","orcid":"0000-0003-0940-5752","position":4,"is_corresponding":false},{"id":3372,"name":"Marie Piraud","orcid":"0000-0002-4917-2458","position":5,"is_corresponding":false},{"id":42,"name":"Fabian Joachim Theis","orcid":"0000-0002-2419-1943","position":6,"is_corresponding":false},{"id":3373,"name":"Michael Sattler","orcid":"0000-0002-1594-0527","position":7,"is_corresponding":false},{"id":3374,"name":"Grzegorz M. Popowicz","orcid":"0000-0003-2818-7498","position":8,"is_corresponding":false},{"id":3370,"name":"Erinç Merdivan","orcid":"0009-0004-9213-7393","position":9,"is_corresponding":false},{"id":3367,"name":"Till Siebenmorgen","orcid":"0009-0008-5160-8100","position":0,"is_corresponding":true}],"reference_count":77,"raw_metadata":null,"created_at":"2026-03-01T18:20:47.508186Z","pmid":null,"pmcid":null,"fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"green","license":"cc-by","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":45.0,"fair_a":80.0,"fair_i":37.5,"fair_r":50.0,"fair_zscore":0.7158,"fair_rationale":{"fair_score":53.12,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":45.0,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"datacite=0, pmcid=False, pmid=False","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The dataset uses H5 files with defined properties but lacks explicit machine-readable metadata standards or schema documentation."}]},"A":{"name":"Accessible","score":80.0,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":1.0,"signal":null,"rationale":"The paper provides direct links to Zenodo and GitHub with clear download and usage instructions, ensuring open access."}]},"I":{"name":"Interoperable","score":37.5,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.75,"signal":null,"rationale":"The dataset uses standard PDB identifiers and common file formats (H5, MOL2, pdbqt) and standard computational methods, but lacks explicit reference to community metadata standards."}]},"R":{"name":"Reusable","score":50.0,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.833,"signal":null,"rationale":"The paper provides open access to data and code with detailed methods and reproducibility support, but the dataset license is not explicitly stated in the text."}]}},"suggestions":["Provide a formal metadata schema (e.g., JSON-LD) and use community standards like DCAT or schema.org to enhance machine-readability.","Explicitly state the license for the dataset (e.g., CC BY 4.0) in the data availability statement to clarify reuse terms.","Use standard ontologies for chemical and biological entities (e.g., ChEBI, UniProt) and provide mappings to improve interoperability.","Add a persistent identifier for the dataset version (e.g., a DOI for each release) to support precise citation and access.","Include a comprehensive data descriptor or README file with detailed metadata, usage examples, and provenance information."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"unpaywall_pdf"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"unpaywall_pdf","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:47:06.479118Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}