Contents
Platform overview Cross-referencing methodology Full-text search IRS 990 nonprofit filings DAF grant identification Congress members Congressional Record Roll call votes Congressional stock trades Legislation Campaign finance (FEC) Federal Register Regulations.gov Code of Federal Regulations Lobbying disclosures Foreign agents (FARA) Federal spending General limitations Independence statement Corrections and feedback

Platform overview

DataDawn operates two databases covering 13 distinct public records datasets. All data is sourced exclusively from U.S. federal government agencies and published in its original form with no editorial filtering, scoring, or ranking applied.

990 Database (data.datadawn.org)

IRS nonprofit filings: 990, 990-EZ, 990-PF, and 990-T returns from tax years 2015–2025, plus foundation grants, DAF disbursements, and the IRS Business Master File. Approximately 5 million filings, 12.1 million grants, and 1.25 million DAF grants.

OpenRegs Database (regs.datadawn.org)

Congressional and regulatory data: members of Congress, floor speeches, roll call votes, stock trades, legislation, campaign finance (FEC), Federal Register documents, Regulations.gov dockets and comments, lobbying disclosures, FARA foreign agent registrations, federal spending, and the Code of Federal Regulations.

Both databases are powered by Datasette, providing interactive browsing, arbitrary SQL queries, JSON API access, and full data downloads in SQLite and CSV formats. No account, API key, or registration is required.

Cross-referencing methodology

The distinctive feature of DataDawn's OpenRegs database is that datasets are linked through shared identifiers, enabling queries that span multiple data sources. The primary linkage key is bioguide_id, the Biographical Directory of the United States Congress identifier assigned to every member of Congress.

bioguide_id (universal member key) β”‚ β”œβ”€β”€ congress_members β€” identity, party, state, terms β”œβ”€β”€ congressional_record β€” floor speeches (via crec_speakers) β”œβ”€β”€ member_votes β€” roll call voting record β”œβ”€β”€ stock_trades β€” financial disclosures β”œβ”€β”€ legislation β€” sponsored/cosponsored bills └── fec_candidate_crosswalk β†’ FEC contributions docket_id (regulatory chain) β”‚ β”œβ”€β”€ dockets β€” regulatory docket metadata β”œβ”€β”€ documents β€” rules, proposed rules, notices └── comments β€” public comments bill_id ({congress}-{type}-{number}) β”‚ β”œβ”€β”€ legislation β€” bill text and status β”œβ”€β”€ crec_bills β€” floor speech references └── lobbying_activities β€” lobbying on specific bills document_number (Federal Register) β”‚ └── fr_regs_crossref β†’ dockets/documents

FEC-to-bioguide crosswalk

The Federal Election Commission uses its own candidate ID system, separate from the Congressional bioguide_id. DataDawn maintains a fec_candidate_crosswalk table with 1,711 verified mappings between FEC candidate IDs and bioguide IDs, enabling queries that link campaign contributions to legislators' voting records, stock trades, and floor statements.

Congressional Record speaker linkage

Floor speeches in the Congressional Record are linked to members via the crec_speakers table. 99.6% of speaker entries have a verified bioguide_id, enabling reliable attribution of floor statements to individual legislators.

Stock trade linkage

Congressional stock trade disclosures are linked to members' bioguide IDs through name and chamber matching. 85.5% of trades are currently linked. Unlinked trades are primarily from historical filings with ambiguous or variant name formatting.

Linkage transparency: All cross-reference rates are reported as-is. Where linkage is incomplete (e.g., 85.5% for stock trades), the unlinked records remain in the database and are queryable β€” they simply lack a bioguide_id join key. We do not impute or guess linkages.

Full-text search

DataDawn builds SQLite FTS5 (Full-Text Search) indexes on key text fields across both databases, enabling instant search across millions of records.

DatasetFields indexed
990 ReturnsOrganization name
Foundation GrantsRecipient name, purpose
DAF GrantsRecipient name, funder name
Federal RegisterTitle, abstract
Regulations.gov DocketsTitle
Regulations.gov DocumentsTitle
Regulations.gov CommentsTitle, submitter name
Congressional RecordFull speech text
CFR SectionsFull regulatory text
Lobbying FilingsFiling descriptions
FARA RegistrantsRegistrant name, business
FARA Foreign PrincipalsRegistrant, principal, country
FEC EmployersEmployer name

IRS 990 nonprofit filings

Data source

All data is extracted from IRS electronic filings (e-files) published on Amazon S3 at s3://irs-form-990/. This is the same public dataset used by ProPublica, GuideStar, and academic researchers. The IRS releases new batches periodically throughout the year, typically monthly.

DataDawn currently covers filings from tax years 2015 through 2025, with earlier years having sparser coverage due to lower e-filing adoption rates before the IRS mandate took effect. Coverage is most complete for tax years 2016 onward.

Form types parsed

FormFiled byWhat we extract
990Public charities (revenue > $200K or assets > $500K)Revenue, expenses, assets, officers, contractors, program activities
990-EZSmall public charities (revenue < $200K)Revenue, expenses, assets, officers
990-PFPrivate foundationsRevenue, assets, grants paid, officers, investments, contributors
990-TOrgs with unrelated business incomeBasic filing data

Database tables

Raw filings are parsed into 12 structured tables. No editorial filtering is applied β€” if the IRS published it, we parsed it.

TableRecordsDescription
returns~5.0MCore filing data: org name, EIN, state, revenue, expenses, assets, return type, tax year
grants~12.1MFoundation grants from 990-PF filings: recipient, amount, purpose, date, location
schedule_i_grants~1.25MSchedule I disbursements (DAF sponsors and public charities)
bmf~1.9MIRS Business Master File: NTEE codes, ruling dates, asset codes
officersvariesOfficers, directors, trustees, and key employees with compensation
contractorsvariesIndependent contractors receiving > $100K
contributorsvariesContributors to private foundations (from 990-PF Schedule B)
top_employeesvariesHighest-compensated employees
investmentsvariesFoundation investments (from 990-PF Part II)
program_investmentsvariesProgram-related investments (PRIs)
capital_gainsvariesCapital gains and losses from 990-PF
program_activitiesvariesProgram service accomplishments and expenses

Extraction pipeline

IRS e-files are XML documents following IRS-defined schemas that have evolved across filing years. DataDawn's extraction scripts handle schema variations across years, mapping different XML element paths to consistent database columns.

  1. Download β€” New XML batches are synced from the IRS S3 bucket. Batch completion is tracked with marker files to prevent reprocessing.
  2. Parse β€” Three extraction scripts process 990/990-EZ returns, 990-PF detail filings (grants, investments, contributors), and Schedule I grants respectively.
  3. Deduplicate β€” Filings are keyed on a combination of EIN and object ID to prevent duplicate insertion from overlapping IRS releases.
  4. Index β€” Full-text search indexes (SQLite FTS5) are built on organization names and grant recipient names for instant search.
  5. Publish β€” The public database is built from an allowlist of raw data tables. No analysis or curated tables are included in the public release.

Known limitations β€” 990 data

E-file only

DataDawn only includes electronically filed returns. Paper filings β€” roughly one-third of all 990s β€” are not included. E-filing rates have increased over time, so recent years have better coverage than earlier years.

Filing lag

Organizations file 990s after their fiscal year ends, and the IRS publishes e-files on a rolling basis. The most recent tax year will always have incomplete data.

Sparse early years

Tax years 2014–2015 have limited coverage because the IRS e-filing mandate was not yet in full effect. Coverage is most reliable from 2016 onward.

Grant dates

Foundation grant dates come from the filer's reported grant date field. Some foundations report the approval date, others the payment date, and some leave it blank. Year-level analysis is more reliable than month-level.

Name matching

Organization names are as reported on the filing. The same organization may appear under slightly different names across years. DataDawn does not perform entity resolution β€” search results should be verified by checking the EIN.

Amount discrepancies

Financial figures reflect what was reported on the filing. Amended returns may not overwrite original filings. In rare cases, both an original and amended filing for the same tax year may appear.

DAF grant identification

Donor-advised fund (DAF) disbursements are extracted from Schedule I of 990 filings submitted by DAF sponsor organizations. DataDawn identifies and parses grants from major DAF sponsors including Vanguard Charitable, Fidelity Charitable, Schwab Charitable, National Philanthropic Trust, Silicon Valley Community Foundation, and others.

These are grants made by DAF sponsors to recipient nonprofits. They do not identify the individual donors who recommended the grants β€” that information is not available in any public filing.

Why this matters: DAF grants represent a large and growing share of philanthropic funding, but because they flow through intermediary sponsors, they are difficult to trace using traditional 990-PF data alone. Combining 990-PF grants with Schedule I DAF data provides a more complete picture of institutional funding flows.

Congress members

The congress_members table is the universal identity table for the OpenRegs database. It contains 12,763 members of Congress, both historical and current, sourced from the Congress.gov API and the Biographical Directory of the United States Congress.

Each member is identified by their bioguide_id, which serves as the primary join key across all member-related datasets. The table includes name, party, state, chamber, number of terms served, and service dates.

A precomputed member_stats table provides aggregate counts per member (total trades, speeches, bills sponsored, votes cast) for quick summary views. Committee assignments (3,908 current assignments across 233 committees and subcommittees) are maintained in separate committees and committee_memberships tables with leadership title indicators.

Congressional Record

Data source

Floor proceedings from the Congressional Record, sourced from the Government Publishing Office (GPO) via govinfo.gov bulk data. Coverage spans 1994 to present, encompassing speeches, debates, remarks, and other floor proceedings from both chambers.

Tables

TableRecordsDescription
congressional_record878,583Floor proceedings with full text, date, chamber, section
crec_speakers944,216Speaker-to-speech linkage, 99.6% with bioguide_id
crec_bills1,560,000Bill references extracted from floor proceedings

Known limitations

The Congressional Record is not a verbatim transcript. Members may revise and extend their remarks after delivery. The "Extensions of Remarks" section includes statements that were not delivered orally on the floor. Speaker attribution relies on GPO markup, which occasionally misattributes statements in colloquy or debate.

Roll call votes

Data source

Roll call voting data from both chambers, sourced from the Congress.gov API and official House/Senate clerk records.

Tables

TableRecordsDescription
roll_call_votes26,359Vote metadata: question, result, date, congress, chamber
member_votes8,300,000Individual vote records: Yea, Nay, Present, Not Voting per member per vote

Known limitations

Roll call votes capture only recorded votes, not voice votes or unanimous consent agreements. Many legislative actions proceed without a recorded vote. "Not Voting" may indicate absence, abstention, or recusal β€” the data does not distinguish between these.

Congressional stock trades

Data source

Financial disclosure data from both chambers: House Periodic Transaction Reports (PTRs, parsed from PDF filings) and Senate electronic Financial Disclosures (eFD, scraped from Senate disclosure website).

Coverage

95,621 transactions are currently in the database. 85.5% are linked to a bioguide_id. Trades include ticker symbol, transaction date, transaction type (purchase/sale/exchange), amount range, and source (House PTR or Senate eFD).

Known limitations

Disclosure amounts are reported in ranges (e.g., $1,001–$15,000), not exact figures. Filing deadlines allow up to 45 days after a transaction, and extensions are common. Trades by spouses and dependent children are included in disclosures but may not always be clearly distinguished from the member's own trades. The 14.5% of trades lacking bioguide linkage are available in the database but cannot be joined to other member-level datasets.

Legislation

Data source

Bill data from the Congress.gov API covering Congresses 108–119 (2003–present).

Tables

TableRecordsDescription
legislation167,507Bills with title, sponsor, policy area, latest action, status
legislation_cosponsors2,070,000Cosponsor records with bioguide_id linkage
legislation_actions1,100,000Action steps: introduced, referred, passed, signed
legislation_subjects1,500,000Subject tags assigned by the Congressional Research Service

Bills are identified by a composite bill_id in the format {congress}-{type}-{number} (e.g., 118-hr-1234), which links to floor speech references in crec_bills and lobbying activity records in lobbying_activities.

Campaign finance (FEC)

Data source

Federal Election Commission bulk data files covering candidates, committees, and contributions.

Tables

TableRecordsDescription
fec_candidates64,700FEC-registered candidates
fec_committees155,000PACs, party committees, campaign committees
fec_contributions4,400,000PAC/committee-to-candidate contributions
fec_candidate_crosswalk1,711Verified FEC candidate ID to bioguide_id mappings

Employer-aggregated donations

A separate fec_employers database aggregates individual contributions by employer name, with zero personally identifiable information. This enables queries like "which employers' employees donated most to members of a specific committee" without exposing individual donor records.

Known limitations

The full FEC individual contributions file (104M records, 49GB) is processed locally but is not deployed to the public database due to size and PII considerations. Only the employer-aggregated and committee-level contribution data are published. FEC data has its own filing lag β€” contributions may not appear for weeks or months after they are made.

Federal Register

Data source

The Federal Register API (federalregister.gov/api), which provides structured data for every document published in the Federal Register.

Tables

TableRecordsDescription
federal_register993,703Rules, proposed rules, notices, presidential documents with title, abstract, dates, PDF/HTML URLs
federal_register_agencies1,500,000Agency tags (many documents have multiple agencies)
presidential_documents5,904Executive orders, proclamations, memoranda
fr_regs_crossrefvariesLinks Federal Register document numbers to Regulations.gov dockets

Regulations.gov

Data source

The Regulations.gov API (api.regulations.gov), the federal government's public comment and rulemaking system.

Tables

TableRecordsDescription
dockets86,706Regulatory dockets from EPA, FDA, USDA, FWS, APHIS, DOT, DOE, HHS, DOL, and others
documents727,510Regulatory documents: rules, proposed rules, notices, supporting materials
comments3,677,962Public comment headers: submitter, date, agency, docket
comment_details36,191Full-text comment bodies (organizational comments, growing via ongoing download)

Known limitations

The Regulations.gov API has strict rate limits. Full-text comment bodies (comment_details) are being downloaded incrementally using a dual-key approach and currently cover a fraction of total comments, prioritizing organizational submissions. Comment header data (submitter name, date, docket) is complete for all 3.7M comments. Some agencies do not publish all comments through Regulations.gov.

Code of Federal Regulations

Data source

Bulk XML downloads from the Electronic Code of Federal Regulations (eCFR) at ecfr.gov.

Coverage

123,480 regulatory sections from five key CFR titles relevant to environmental, agricultural, and public health regulation: Agriculture (Title 7), Animals and Animal Products (Title 9), Food and Drugs (Title 21), Protection of Environment (Title 40), and Wildlife and Fisheries (Title 50). Full regulatory text is indexed for full-text search.

Known limitations

Only five of 50 CFR titles are currently included. The CFR is updated continuously as agencies publish final rules; the DataDawn snapshot reflects the eCFR as of the most recent bulk download. Regulations that have been proposed but not finalized are not included in the CFR data (they appear in the Federal Register).

Lobbying disclosures

Data source

Senate Lobbying Disclosure Act (LDA) filings, downloaded from the Senate Office of Public Records bulk data system.

Tables

TableRecordsDescription
lobbying_filings1,170,000Disclosure filings: client, registrant, income/expenses, year
lobbying_activities2,080,000Activity records: issue codes, descriptions, specific bills lobbied
lobbying_lobbyists2,720,000Lobbyist entries, many with covered_position (revolving door indicator)
lobbying_issue_codes79Standard issue category codes

Revolving door

The covered_position field in lobbyist records identifies individuals who previously held government positions β€” the "revolving door" between government service and lobbying. This field is self-reported by the registrant.

Known limitations

Data currently covers 1999–2017 with ongoing download of 2018+ (approximately 85% complete). LDA filings are self-reported by registrants and are not independently audited. Income and expense figures are reported in ranges on some filing types. The lobbying_activities table links to specific bill numbers when reported, but lobbyists are not required to list every bill they lobby on.

Foreign agents (FARA)

Data source

Foreign Agents Registration Act data from the Department of Justice FARA database at fara.gov.

Tables

TableRecordsDescription
fara_registrants7,035Registered foreign agents (firms and individuals)
fara_foreign_principals17,627Foreign government and entity clients
fara_short_forms44,363Individual agents working under registrations
fara_registrant_docs151,348Filed documents with PDF links

Known limitations

FARA registration is self-reported and enforcement has historically been limited. The DOJ has acknowledged that compliance rates are uncertain. Some entities that may be required to register under FARA instead register under the LDA, which has less stringent disclosure requirements. Cross-referencing FARA registrants with lobbying filings (by firm name) can reveal some of these overlaps but is not definitive.

Federal spending

Data source

USAspending.gov bulk award data covering grants, contracts, and other federal awards across 20 agencies.

Coverage

863,632 awards including recipient name, award amount, funding agency, award type, and date ranges. Linkable to agencies referenced in Federal Register documents and lobbying filings.

Known limitations

USAspending.gov data has known reporting quality issues acknowledged by the government itself. Not all agencies report at the same level of detail or timeliness. Sub-award data is not currently included. The 20-agency scope covers the most active federal funders but is not comprehensive across all federal agencies.

General limitations

Data as reported

DataDawn publishes data as reported in source filings and government databases. We do not correct, impute, or editorialize. Errors in source filings propagate to our database. Where we are aware of systematic data quality issues, they are documented in the dataset-specific sections above.

No entity resolution

The same real-world entity may appear under different names across datasets (e.g., "ASPCA" vs "American Society for the Prevention of Cruelty to Animals" in 990 data, or variant name spellings across FEC and Congressional records). DataDawn does not perform automated entity resolution. Users should verify matches using stable identifiers like EIN, bioguide_id, or FEC candidate ID.

Point-in-time snapshots

Each dataset reflects the state of its source at the time of DataDawn's most recent extraction. Government agencies update their data on different schedules. The database is not a real-time feed.

Correlation is not causation

Cross-referencing datasets enables powerful queries (e.g., stock trades within 30 days of floor speeches on related topics), but temporal or thematic proximity does not establish a causal or improper relationship. DataDawn provides the data; interpretation is the user's responsibility.

Update schedule

The 990 database updates as the IRS publishes new e-file batches, typically monthly. OpenRegs datasets are updated on varying schedules depending on source API availability and data volume. The current databases were built in March 2026 from all available source data as of that date.

Independence statement

DataDawn is an independent project with no institutional affiliations. It receives no funding from any nonprofit, foundation, government agency, or organization represented in its datasets. All data is sourced exclusively from public records filed with federal government agencies.

DataDawn does not endorse, evaluate, or rank any organization, legislator, or entity. The platform provides raw data and search tools. Interpretation and analysis are the responsibility of the user.

All source code, extraction pipelines, and database schemas are published on GitHub under a CC0 1.0 Universal (public domain) license.

Corrections and feedback

If you find a data quality issue, parsing error, or have questions about the methodology, you can reach DataDawn at info@datadawn.org.