{"doi":"10.1128/msystems.00341-22","title":"Computationally Efficient Assembly of Pseudomonas aeruginosa Gene Expression Compendia","abstract":"Thousands of Pseudomonas aeruginosa RNA sequencing (RNA-seq) gene expression profiles are publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). In this work, the transcriptional profiles from hundreds of studies performed by over 75 research groups were reanalyzed in aggregate to create a powerful tool for hypothesis generation and testing. Raw sequence data were uniformly processed using the Salmon pseudoaligner, and this read mapping method was validated by comparison to a direct alignment method. We developed filtering criteria to exclude samples with aberrant levels of housekeeping gene expression or an unexpected number of genes with no reported values and normalized the filtered compendia using the ratio-of-medians method. The filtering and normalization steps greatly improved gene expression correlations for genes within the same operon or regulon across the 2,333 samples. Since the RNA-seq data were generated using diverse strains, we report the effects of mapping samples to noncognate reference genomes by separately analyzing all samples mapped to cDNA reference genomes for strains PAO1 and PA14, two divergent strains that were used to generate most of the samples. Finally, we developed an algorithm to incorporate new data as they are deposited into the SRA. Our processing and quality control methods provide a scalable framework for taking advantage of the troves of biological information hibernating in the depths of microbial gene expression data and yield useful tools for P. aeruginosa RNA-seq data to be leveraged for diverse research goals. <b>IMPORTANCE</b> Pseudomonas aeruginosa is a causative agent of a wide range of infections, including chronic infections associated with cystic fibrosis. These P. aeruginosa infections are difficult to treat and often have negative outcomes. To aid in the study of this problematic pathogen, we mapped, filtered for quality, and normalized thousands of P. aeruginosa RNA-seq gene expression profiles that were publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The resulting compendia facilitate analyses across experiments, strains, and conditions. Ultimately, the workflow that we present could be applied to analyses of other microbial species.","journal":"mSystems","year":2023,"id":10777,"datarank":0.34182725961498284,"base_score":1.9459101490553132,"endowment":1.9459101490553132,"self_citation_contribution":0.29188652235829704,"citation_network_contribution":0.0499407372566858,"self_endowment_contribution":0.29188652235829704,"citer_contribution":0.0499407372566858,"corpus_percentile":48.57607811228641,"corpus_rank":633,"citation_count":6,"citer_count":2,"citers_with_citation_signal":2,"citers_with_endowment":2,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.9329,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2023-02-23","fair_score":49.7917,"fair_percentile":77.9023746701847,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":452,"name":"Alexandra J. Lee","orcid":"0000-0002-0208-3730","position":1,"is_corresponding":false},{"id":42147,"name":"Samuel L. Neff","orcid":"0000-0002-5993-8445","position":2,"is_corresponding":false},{"id":453,"name":"Taylor Reiter","orcid":"0000-0002-7388-421X","position":3,"is_corresponding":false},{"id":88382,"name":"Jacob D. Holt","orcid":null,"position":4,"is_corresponding":false},{"id":65140,"name":"BRUCE A. STANTON","orcid":"0000-0002-1661-407X","position":5,"is_corresponding":false},{"id":308,"name":"Casey S. Greene","orcid":"0000-0001-8713-9213","position":6,"is_corresponding":false},{"id":456,"name":"Deborah A. Hogan","orcid":"0000-0002-6366-2971","position":7,"is_corresponding":false},{"id":454,"name":"Georgia Doing","orcid":"0000-0002-0835-6955","position":0,"is_corresponding":true}],"reference_count":61,"raw_metadata":{"citation_network_status":"fetched"},"created_at":"2026-03-01T18:20:47.508186Z","pmid":"36541761","pmcid":"PMC9948711","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"gold","license":"cc-by","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":65.0,"fair_a":67.5,"fair_i":25.0,"fair_r":41.6667,"fair_zscore":0.4143,"fair_rationale":{"fair_score":49.79,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":65.0,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper provides human-readable metadata (e.g., strain, media, treatment) for about half the studies via GEO, but does not provide machine-readable metadata (e.g., structured JSON-LD, schema.org annotations) for the compendia themselves."}]},"A":{"name":"Accessible","score":67.5,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":0.75,"signal":null,"rationale":"The paper clearly states that data are available at OSF (https://osf.io/s9gyu/) and code at GitHub (https://github.com/hoganlab-dartmouth/pa-seq-compendia), but does not specify a formal access protocol (e.g., API, authentication requirements) for the OSF repository."}]},"I":{"name":"Interoperable","score":25.0,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The paper uses standard formats (FASTQ, TPM, counts) and standard identifiers (NCBI SRA, GEO, KEGG, GO), but does not use community-standard vocabularies for metadata (e.g., MIAME, MINSEQE) and does not provide a formal data dictionary or ontology mapping."}]},"R":{"name":"Reusable","score":41.67,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":0.667,"signal":null,"rationale":"The paper provides a clear data-availability statement, a CC-BY 4.0 license, and code for reproducibility, but the license is only mentioned for supplementary materials, not explicitly for the compendia themselves, and the metadata are incomplete (only ~50% of studies annotated)."}]}},"suggestions":["Add machine-readable metadata (e.g., JSON-LD with schema.org/Dataset) to the OSF repository to improve findability.","Provide a formal data-access protocol (e.g., REST API or direct download instructions) for the OSF repository.","Use community-standard metadata schemas (e.g., MIAME or MINSEQE) and controlled vocabularies (e.g., NCBI BioSample attributes) for all samples.","Explicitly state the license (CC-BY 4.0) for the compendia data files, not just for supplementary materials.","Complete metadata annotation for all 277 BioProjects, not just the ~50% in GEO, to enhance reusability."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"epmc_xml"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"epmc_xml","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:51:59.532650Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}