{"doi":"10.1093/jamiaopen/ooad045","title":"A certified de-identification system for all clinical text documents for information extraction at scale","abstract":"<h4>Objectives</h4>Clinical notes are a veritable treasure trove of information on a patient's disease progression, medical history, and treatment plans, yet are locked in secured databases accessible for research only after extensive ethics review. Removing personally identifying and protected health information (PII/PHI) from the records can reduce the need for additional Institutional Review Boards (IRB) reviews. In this project, our goals were to: (1) develop a robust and scalable clinical text de-identification pipeline that is compliant with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule for de-identification standards and (2) share routinely updated de-identified clinical notes with researchers.<h4>Materials and methods</h4>Building on our open-source de-identification software called Philter, we added features to: (1) make the algorithm and the de-identified data HIPAA compliant, which also implies type 2 error-free redaction, as certified via external audit; (2) reduce over-redaction errors; and (3) normalize and shift date PHI. We also established a streamlined de-identification pipeline using MongoDB to automatically extract clinical notes and provide truly de-identified notes to researchers with periodic monthly refreshes at our institution.<h4>Results</h4>To the best of our knowledge, the Philter V1.0 pipeline is currently the <i>first</i> and <i>only</i> certified, de-identified redaction pipeline that makes clinical notes available to researchers for nonhuman subjects' research, without further IRB approval needed. To date, we have made over 130 million certified de-identified clinical notes available to over 600 UCSF researchers. These notes were collected over the past 40 years, and represent data from 2757016 UCSF patients.","journal":"JAMIA Open","year":2023,"id":5956,"datarank":1.108771855063139,"base_score":3.8712010109078907,"endowment":3.8712010109078907,"self_citation_contribution":0.5806801516361837,"citation_network_contribution":0.5280917034269551,"self_endowment_contribution":0.5806801516361837,"citer_contribution":0.5280917034269551,"corpus_percentile":null,"corpus_rank":null,"citation_count":47,"citer_count":32,"citers_with_citation_signal":16,"citers_with_endowment":16,"datacite_reuse_total":0,"is_dataset":false,"is_dataset_confidence":0.1074,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2023-07-04","fair_score":null,"fair_percentile":null,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":56634,"name":"Gundolf Schenk","orcid":"0000-0003-1240-9949","position":1,"is_corresponding":false},{"id":56635,"name":"Kathleen Muenzen","orcid":"0000-0002-0840-614X","position":2,"is_corresponding":false},{"id":21592,"name":"Boris Oskotsky","orcid":"0000-0002-9364-7051","position":3,"is_corresponding":false},{"id":56636,"name":"Habibeh Ashouri Choshali","orcid":null,"position":4,"is_corresponding":false},{"id":56637,"name":"Thomas Plunkett","orcid":null,"position":5,"is_corresponding":false},{"id":37890,"name":"Sharat Israni","orcid":"0000-0003-0100-5826","position":6,"is_corresponding":false},{"id":51,"name":"Atul Janardhan Butte","orcid":"0000-0002-7433-2740","position":7,"is_corresponding":false},{"id":51598,"name":"Lakshmi Radhakrishnan","orcid":"0009-0007-9409-1044","position":0,"is_corresponding":true}],"reference_count":17,"raw_metadata":{"citation_network_status":"fetched"},"created_at":"2026-03-01T18:20:47.508186Z","pmid":null,"pmcid":null,"fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":null,"license":null,"views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":null,"fair_a":null,"fair_i":null,"fair_r":null,"fair_zscore":null,"fair_rationale":null,"fair_model":null,"fair_agent_version":null,"fair_fulltext_source":null,"fair_has_llm":null,"fair_computed_at":null,"clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}