{"doi":"10.3758/s13428-021-01698-z","title":"LOCO: The 88-million-word language of conspiracy corpus","abstract":"The spread of online conspiracy theories represents a serious threat to society. To understand the content of conspiracies, here we present the language of conspiracy (LOCO) corpus. LOCO is an 88-million-token corpus composed of topic-matched conspiracy (N = 23,937) and mainstream (N = 72,806) documents harvested from 150 websites. Mimicking internet user behavior, documents were identified using Google by crossing a set of seed phrases with a set of websites. LOCO is hierarchically structured, meaning that each document is cross-nested within websites (N = 150) and topics (N = 600, on three different resolutions). A rich set of linguistic features (N = 287) and metadata includes upload date, measures of social media engagement, measures of website popularity, size, and traffic, as well as political bias and factual reporting annotations. We explored LOCO's features from different perspectives showing that documents track important societal events through time (e.g., Princess Diana's death, Sandy Hook school shooting, coronavirus outbreaks), while patterns of lexical features (e.g., deception, power, dominance) overlap with those extracted from online social media communities dedicated to conspiracy theories. By computing within-subcorpus cosine similarity, we derived a subset of the most representative conspiracy documents (N = 4,227), which, compared to other conspiracy documents, display prototypical and exaggerated conspiratorial language and are more frequently shared on Facebook. We also show that conspiracy website users navigate to websites via more direct means than mainstream users, suggesting confirmation bias. LOCO and related datasets are freely available at https://osf.io/snpcg/ .","journal":"Behavior Research Methods","year":2021,"id":2039,"datarank":0.7890012261362598,"base_score":3.367295829986474,"endowment":3.367295829986474,"self_citation_contribution":0.5050943744979712,"citation_network_contribution":0.2839068516382886,"self_endowment_contribution":0.5050943744979712,"citer_contribution":0.2839068516382886,"corpus_percentile":55.98047192839707,"corpus_rank":542,"citation_count":30,"citer_count":19,"citers_with_citation_signal":11,"citers_with_endowment":11,"datacite_reuse_total":0,"is_dataset":true,"is_dataset_confidence":0.9274,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2021-10-25","fair_score":60.2083,"fair_percentile":92.52418645558487,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":3862,"name":"Thomas Hills","orcid":"0000-0003-0322-5822","position":1,"is_corresponding":false},{"id":23949,"name":"Adrian Bangerter","orcid":"0000-0001-6989-8654","position":2,"is_corresponding":false},{"id":23950,"name":"Thomas T. Hills","orcid":"0000-0003-3842-2076","position":3,"is_corresponding":false},{"id":23948,"name":"Alessandro Miani","orcid":"0000-0001-6610-3510","position":0,"is_corresponding":true}],"reference_count":109,"raw_metadata":null,"created_at":"2026-03-01T18:20:47.508186Z","pmid":"34697754","pmcid":"PMC8545361","fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":"hybrid","license":"cc-by","views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":77.5,"fair_a":80.0,"fair_i":25.0,"fair_r":58.3333,"fair_zscore":1.3566,"fair_rationale":{"fair_score":60.21,"has_llm":true,"dimensions":{"F":{"name":"Findable","score":77.5,"criteria":[{"key":"f_has_doi","label":"Has a persistent DOI","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"DOI present","rationale":null},{"key":"f_repository_presence","label":"Indexed in repositories / literature DBs","kind":"deterministic","weight":1.0,"fraction":1.0,"signal":"datacite=0, pmcid=True, pmid=True","rationale":null},{"key":"f_persistent_ids","label":"Resolvable scholarly identifiers (OpenAlex)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no OpenAlex id","rationale":null},{"key":"f_metadata_richness","label":"Rich, machine-readable metadata","kind":"llm","weight":1.0,"fraction":0.75,"signal":null,"rationale":"The paper provides rich metadata (e.g., lexical features, topic labels, website metrics, political bias, factual reporting), but there is no evidence of machine-readable semantic markup or structured metadata that would enable automated discovery without parsing JSON files."}]},"A":{"name":"Accessible","score":80.0,"criteria":[{"key":"a_open_access","label":"Open Access / files deposited","kind":"deterministic","weight":1.5,"fraction":1.0,"signal":"Open Access","rationale":null},{"key":"a_retrievable","label":"Free full text retrievable","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"0 OA location(s)","rationale":null},{"key":"a_access_protocol","label":"Clear data/code access protocol","kind":"llm","weight":1.0,"fraction":1.0,"signal":null,"rationale":"The data are freely available at a persistent URL (OSF), and the paper clearly describes the download location, file formats (JSON, PDF), and a standard license (CC BY 4.0), meeting the highest standard for accessibility."}]},"I":{"name":"Interoperable","score":25.0,"criteria":[{"key":"i_linked_data","label":"Linked datasets / DataCite relations","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"linked_datasets=0, datacite=0","rationale":null},{"key":"i_standard_ids","label":"References data via standard accessions","kind":"deterministic","weight":1.0,"fraction":0.0,"signal":"accessions=0, trials=0","rationale":null},{"key":"i_standards","label":"Standard formats, vocabularies & identifiers","kind":"llm","weight":1.0,"fraction":0.5,"signal":null,"rationale":"The corpus uses standard formats (JSON, PDF) and standard linguistic tools (LIWC, Empath, LDA), but does not employ community-endorsed controlled vocabularies, established identifiers for topics/features, or interoperability standards beyond file format."}]},"R":{"name":"Reusable","score":58.33,"criteria":[{"key":"r_license","label":"Clear, open reuse license","kind":"deterministic","weight":1.5,"fraction":0.0,"signal":"no license","rationale":null},{"key":"r_downloads","label":"Demonstrated reuse (downloads)","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"downloads=0","rationale":null},{"key":"r_version","label":"Versioned / maintained","kind":"deterministic","weight":0.5,"fraction":0.0,"signal":"no version chain","rationale":null},{"key":"r_dataset","label":"Classified as a data resource","kind":"deterministic","weight":0.5,"fraction":1.0,"signal":"is_dataset","rationale":null},{"key":"r_reusability","label":"Data-availability statement, license & reproducibility","kind":"llm","weight":2.0,"fraction":1.0,"signal":null,"rationale":"The data are openly licensed (CC BY 4.0) with a clear data-availability statement and persistent access, and the paper provides extensive methodological detail to support reproducibility, achieving full reuse potential."}]}},"suggestions":["Add a machine-readable metadata schema (e.g., schema.org or DCAT) to the corpus landing page to improve automated discovery.","Assign persistent identifiers (e.g., DOIs) to each individual sub-dataset (e.g., LOCO.json, lexical features) to facilitate fine-grained citation and reuse.","Include a data dictionary or codebook as a standardized, machine-parseable table (e.g., CSV or RDF) to enhance interpretability.","Adopt community-standard controlled vocabularies (e.g., from linguistics or NLP) for topic labels and lexical categories to improve cross-corpus interoperability.","Provide an explicit linkage of lexical categories to widely used ontologies (e.g., WordNet, BFO) to enhance semantic interoperability."],"model":"deepseek/deepseek-v4-flash","agent_version":"fair_agent_v2","fulltext_source":"epmc_xml"},"fair_model":"deepseek/deepseek-v4-flash","fair_agent_version":"fair_agent_v2","fair_fulltext_source":"epmc_xml","fair_has_llm":true,"fair_computed_at":"2026-06-18T00:44:58.182134Z","clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}