Platform overview
DataDawn operates two databases covering 13 distinct public records datasets. All data is sourced exclusively from U.S. federal government agencies and published in its original form with no editorial filtering, scoring, or ranking applied.
990 Database (data.datadawn.org)
IRS nonprofit filings: 990, 990-EZ, 990-PF, and 990-T returns from tax years 2015β2025, plus foundation grants, DAF disbursements, and the IRS Business Master File. Approximately 5 million filings, 12.1 million grants, and 1.25 million DAF grants.
OpenRegs Database (regs.datadawn.org)
Congressional and regulatory data: members of Congress, floor speeches, roll call votes, stock trades, legislation, campaign finance (FEC), Federal Register documents, Regulations.gov dockets and comments, lobbying disclosures, FARA foreign agent registrations, federal spending, and the Code of Federal Regulations.
Both databases are powered by Datasette, providing interactive browsing, arbitrary SQL queries, JSON API access, and full data downloads in SQLite and CSV formats. No account, API key, or registration is required.
Cross-referencing methodology
The distinctive feature of DataDawn's OpenRegs database is that datasets are linked
through shared identifiers, enabling queries that span multiple data sources.
The primary linkage key is bioguide_id, the Biographical Directory of
the United States Congress identifier assigned to every member of Congress.
FEC-to-bioguide crosswalk
The Federal Election Commission uses its own candidate ID system, separate from
the Congressional bioguide_id. DataDawn maintains a fec_candidate_crosswalk
table with 1,711 verified mappings between FEC candidate IDs
and bioguide IDs, enabling queries that link campaign contributions to legislators'
voting records, stock trades, and floor statements.
Congressional Record speaker linkage
Floor speeches in the Congressional Record are linked to members via the
crec_speakers table. 99.6% of speaker entries
have a verified bioguide_id, enabling reliable attribution of floor statements
to individual legislators.
Stock trade linkage
Congressional stock trade disclosures are linked to members' bioguide IDs through name and chamber matching. 85.5% of trades are currently linked. Unlinked trades are primarily from historical filings with ambiguous or variant name formatting.
Linkage transparency: All cross-reference rates are reported as-is. Where linkage is incomplete (e.g., 85.5% for stock trades), the unlinked records remain in the database and are queryable β they simply lack a bioguide_id join key. We do not impute or guess linkages.
Full-text search
DataDawn builds SQLite FTS5 (Full-Text Search) indexes on key text fields across both databases, enabling instant search across millions of records.
| Dataset | Fields indexed |
|---|---|
| 990 Returns | Organization name |
| Foundation Grants | Recipient name, purpose |
| DAF Grants | Recipient name, funder name |
| Federal Register | Title, abstract |
| Regulations.gov Dockets | Title |
| Regulations.gov Documents | Title |
| Regulations.gov Comments | Title, submitter name |
| Congressional Record | Full speech text |
| CFR Sections | Full regulatory text |
| Lobbying Filings | Filing descriptions |
| FARA Registrants | Registrant name, business |
| FARA Foreign Principals | Registrant, principal, country |
| FEC Employers | Employer name |
IRS 990 nonprofit filings
Data source
All data is extracted from IRS electronic filings (e-files) published on
Amazon S3 at s3://irs-form-990/. This is the same public dataset
used by ProPublica, GuideStar, and academic researchers. The IRS releases new
batches periodically throughout the year, typically monthly.
DataDawn currently covers filings from tax years 2015 through 2025, with earlier years having sparser coverage due to lower e-filing adoption rates before the IRS mandate took effect. Coverage is most complete for tax years 2016 onward.
Form types parsed
| Form | Filed by | What we extract |
|---|---|---|
| 990 | Public charities (revenue > $200K or assets > $500K) | Revenue, expenses, assets, officers, contractors, program activities |
| 990-EZ | Small public charities (revenue < $200K) | Revenue, expenses, assets, officers |
| 990-PF | Private foundations | Revenue, assets, grants paid, officers, investments, contributors |
| 990-T | Orgs with unrelated business income | Basic filing data |
Database tables
Raw filings are parsed into 12 structured tables. No editorial filtering is applied β if the IRS published it, we parsed it.
| Table | Records | Description |
|---|---|---|
| returns | ~5.0M | Core filing data: org name, EIN, state, revenue, expenses, assets, return type, tax year |
| grants | ~12.1M | Foundation grants from 990-PF filings: recipient, amount, purpose, date, location |
| schedule_i_grants | ~1.25M | Schedule I disbursements (DAF sponsors and public charities) |
| bmf | ~1.9M | IRS Business Master File: NTEE codes, ruling dates, asset codes |
| officers | varies | Officers, directors, trustees, and key employees with compensation |
| contractors | varies | Independent contractors receiving > $100K |
| contributors | varies | Contributors to private foundations (from 990-PF Schedule B) |
| top_employees | varies | Highest-compensated employees |
| investments | varies | Foundation investments (from 990-PF Part II) |
| program_investments | varies | Program-related investments (PRIs) |
| capital_gains | varies | Capital gains and losses from 990-PF |
| program_activities | varies | Program service accomplishments and expenses |
Extraction pipeline
IRS e-files are XML documents following IRS-defined schemas that have evolved across filing years. DataDawn's extraction scripts handle schema variations across years, mapping different XML element paths to consistent database columns.
- Download β New XML batches are synced from the IRS S3 bucket. Batch completion is tracked with marker files to prevent reprocessing.
- Parse β Three extraction scripts process 990/990-EZ returns, 990-PF detail filings (grants, investments, contributors), and Schedule I grants respectively.
- Deduplicate β Filings are keyed on a combination of EIN and object ID to prevent duplicate insertion from overlapping IRS releases.
- Index β Full-text search indexes (SQLite FTS5) are built on organization names and grant recipient names for instant search.
- Publish β The public database is built from an allowlist of raw data tables. No analysis or curated tables are included in the public release.
Known limitations β 990 data
E-file only
DataDawn only includes electronically filed returns. Paper filings β roughly one-third of all 990s β are not included. E-filing rates have increased over time, so recent years have better coverage than earlier years.
Filing lag
Organizations file 990s after their fiscal year ends, and the IRS publishes e-files on a rolling basis. The most recent tax year will always have incomplete data.
Sparse early years
Tax years 2014β2015 have limited coverage because the IRS e-filing mandate was not yet in full effect. Coverage is most reliable from 2016 onward.
Grant dates
Foundation grant dates come from the filer's reported grant date field. Some foundations report the approval date, others the payment date, and some leave it blank. Year-level analysis is more reliable than month-level.
Name matching
Organization names are as reported on the filing. The same organization may appear under slightly different names across years. DataDawn does not perform entity resolution β search results should be verified by checking the EIN.
Amount discrepancies
Financial figures reflect what was reported on the filing. Amended returns may not overwrite original filings. In rare cases, both an original and amended filing for the same tax year may appear.
DAF grant identification
Donor-advised fund (DAF) disbursements are extracted from Schedule I of 990 filings submitted by DAF sponsor organizations. DataDawn identifies and parses grants from major DAF sponsors including Vanguard Charitable, Fidelity Charitable, Schwab Charitable, National Philanthropic Trust, Silicon Valley Community Foundation, and others.
These are grants made by DAF sponsors to recipient nonprofits. They do not identify the individual donors who recommended the grants β that information is not available in any public filing.
Why this matters: DAF grants represent a large and growing share of philanthropic funding, but because they flow through intermediary sponsors, they are difficult to trace using traditional 990-PF data alone. Combining 990-PF grants with Schedule I DAF data provides a more complete picture of institutional funding flows.
Congress members
The congress_members table is the universal identity table for the OpenRegs
database. It contains 12,763 members of Congress,
both historical and current, sourced from the Congress.gov API and the Biographical
Directory of the United States Congress.
Each member is identified by their bioguide_id, which serves as the
primary join key across all member-related datasets. The table includes name,
party, state, chamber, number of terms served, and service dates.
A precomputed member_stats table provides aggregate counts per member
(total trades, speeches, bills sponsored, votes cast) for quick summary views.
Committee assignments (3,908 current assignments
across 233 committees and subcommittees) are maintained in separate
committees and committee_memberships tables with
leadership title indicators.
Congressional Record
Data source
Floor proceedings from the Congressional Record, sourced from the Government Publishing Office (GPO) via govinfo.gov bulk data. Coverage spans 1994 to present, encompassing speeches, debates, remarks, and other floor proceedings from both chambers.
Tables
| Table | Records | Description |
|---|---|---|
| congressional_record | 878,583 | Floor proceedings with full text, date, chamber, section |
| crec_speakers | 944,216 | Speaker-to-speech linkage, 99.6% with bioguide_id |
| crec_bills | 1,560,000 | Bill references extracted from floor proceedings |
Known limitations
The Congressional Record is not a verbatim transcript. Members may revise and extend their remarks after delivery. The "Extensions of Remarks" section includes statements that were not delivered orally on the floor. Speaker attribution relies on GPO markup, which occasionally misattributes statements in colloquy or debate.
Roll call votes
Data source
Roll call voting data from both chambers, sourced from the Congress.gov API and official House/Senate clerk records.
Tables
| Table | Records | Description |
|---|---|---|
| roll_call_votes | 26,359 | Vote metadata: question, result, date, congress, chamber |
| member_votes | 8,300,000 | Individual vote records: Yea, Nay, Present, Not Voting per member per vote |
Known limitations
Roll call votes capture only recorded votes, not voice votes or unanimous consent agreements. Many legislative actions proceed without a recorded vote. "Not Voting" may indicate absence, abstention, or recusal β the data does not distinguish between these.
Congressional stock trades
Data source
Financial disclosure data from both chambers: House Periodic Transaction Reports (PTRs, parsed from PDF filings) and Senate electronic Financial Disclosures (eFD, scraped from Senate disclosure website).
Coverage
95,621 transactions are currently in the database. 85.5% are linked to a bioguide_id. Trades include ticker symbol, transaction date, transaction type (purchase/sale/exchange), amount range, and source (House PTR or Senate eFD).
Known limitations
Disclosure amounts are reported in ranges (e.g., $1,001β$15,000), not exact figures. Filing deadlines allow up to 45 days after a transaction, and extensions are common. Trades by spouses and dependent children are included in disclosures but may not always be clearly distinguished from the member's own trades. The 14.5% of trades lacking bioguide linkage are available in the database but cannot be joined to other member-level datasets.
Legislation
Data source
Bill data from the Congress.gov API covering Congresses 108β119 (2003βpresent).
Tables
| Table | Records | Description |
|---|---|---|
| legislation | 167,507 | Bills with title, sponsor, policy area, latest action, status |
| legislation_cosponsors | 2,070,000 | Cosponsor records with bioguide_id linkage |
| legislation_actions | 1,100,000 | Action steps: introduced, referred, passed, signed |
| legislation_subjects | 1,500,000 | Subject tags assigned by the Congressional Research Service |
Bills are identified by a composite bill_id in the format
{congress}-{type}-{number} (e.g., 118-hr-1234),
which links to floor speech references in crec_bills and lobbying
activity records in lobbying_activities.
Campaign finance (FEC)
Data source
Federal Election Commission bulk data files covering candidates, committees, and contributions.
Tables
| Table | Records | Description |
|---|---|---|
| fec_candidates | 64,700 | FEC-registered candidates |
| fec_committees | 155,000 | PACs, party committees, campaign committees |
| fec_contributions | 4,400,000 | PAC/committee-to-candidate contributions |
| fec_candidate_crosswalk | 1,711 | Verified FEC candidate ID to bioguide_id mappings |
Employer-aggregated donations
A separate fec_employers database aggregates individual contributions
by employer name, with zero personally identifiable information.
This enables queries like "which employers' employees donated most to members of
a specific committee" without exposing individual donor records.
Known limitations
The full FEC individual contributions file (104M records, 49GB) is processed locally but is not deployed to the public database due to size and PII considerations. Only the employer-aggregated and committee-level contribution data are published. FEC data has its own filing lag β contributions may not appear for weeks or months after they are made.
Federal Register
Data source
The Federal Register API (federalregister.gov/api), which provides
structured data for every document published in the Federal Register.
Tables
| Table | Records | Description |
|---|---|---|
| federal_register | 993,703 | Rules, proposed rules, notices, presidential documents with title, abstract, dates, PDF/HTML URLs |
| federal_register_agencies | 1,500,000 | Agency tags (many documents have multiple agencies) |
| presidential_documents | 5,904 | Executive orders, proclamations, memoranda |
| fr_regs_crossref | varies | Links Federal Register document numbers to Regulations.gov dockets |
Regulations.gov
Data source
The Regulations.gov API (api.regulations.gov), the federal government's
public comment and rulemaking system.
Tables
| Table | Records | Description |
|---|---|---|
| dockets | 86,706 | Regulatory dockets from EPA, FDA, USDA, FWS, APHIS, DOT, DOE, HHS, DOL, and others |
| documents | 727,510 | Regulatory documents: rules, proposed rules, notices, supporting materials |
| comments | 3,677,962 | Public comment headers: submitter, date, agency, docket |
| comment_details | 36,191 | Full-text comment bodies (organizational comments, growing via ongoing download) |
Known limitations
The Regulations.gov API has strict rate limits. Full-text comment bodies
(comment_details) are being downloaded incrementally using a dual-key
approach and currently cover a fraction of total comments, prioritizing
organizational submissions. Comment header data (submitter name, date, docket)
is complete for all 3.7M comments. Some agencies do not publish all comments
through Regulations.gov.
Code of Federal Regulations
Data source
Bulk XML downloads from the Electronic Code of Federal Regulations (eCFR)
at ecfr.gov.
Coverage
123,480 regulatory sections from five key CFR titles relevant to environmental, agricultural, and public health regulation: Agriculture (Title 7), Animals and Animal Products (Title 9), Food and Drugs (Title 21), Protection of Environment (Title 40), and Wildlife and Fisheries (Title 50). Full regulatory text is indexed for full-text search.
Known limitations
Only five of 50 CFR titles are currently included. The CFR is updated continuously as agencies publish final rules; the DataDawn snapshot reflects the eCFR as of the most recent bulk download. Regulations that have been proposed but not finalized are not included in the CFR data (they appear in the Federal Register).
Lobbying disclosures
Data source
Senate Lobbying Disclosure Act (LDA) filings, downloaded from the Senate Office of Public Records bulk data system.
Tables
| Table | Records | Description |
|---|---|---|
| lobbying_filings | 1,170,000 | Disclosure filings: client, registrant, income/expenses, year |
| lobbying_activities | 2,080,000 | Activity records: issue codes, descriptions, specific bills lobbied |
| lobbying_lobbyists | 2,720,000 | Lobbyist entries, many with covered_position (revolving door indicator) |
| lobbying_issue_codes | 79 | Standard issue category codes |
Revolving door
The covered_position field in lobbyist records identifies individuals
who previously held government positions β the "revolving door" between government
service and lobbying. This field is self-reported by the registrant.
Known limitations
Data currently covers 1999β2017 with ongoing download of 2018+
(approximately 85% complete). LDA filings are self-reported by registrants and
are not independently audited. Income and expense figures are reported in ranges
on some filing types. The lobbying_activities table links to
specific bill numbers when reported, but lobbyists are not required to list
every bill they lobby on.
Foreign agents (FARA)
Data source
Foreign Agents Registration Act data from the Department of Justice FARA
database at fara.gov.
Tables
| Table | Records | Description |
|---|---|---|
| fara_registrants | 7,035 | Registered foreign agents (firms and individuals) |
| fara_foreign_principals | 17,627 | Foreign government and entity clients |
| fara_short_forms | 44,363 | Individual agents working under registrations |
| fara_registrant_docs | 151,348 | Filed documents with PDF links |
Known limitations
FARA registration is self-reported and enforcement has historically been limited. The DOJ has acknowledged that compliance rates are uncertain. Some entities that may be required to register under FARA instead register under the LDA, which has less stringent disclosure requirements. Cross-referencing FARA registrants with lobbying filings (by firm name) can reveal some of these overlaps but is not definitive.
Federal spending
Data source
USAspending.gov bulk award data covering grants, contracts, and other federal awards across 20 agencies.
Coverage
863,632 awards including recipient name, award amount, funding agency, award type, and date ranges. Linkable to agencies referenced in Federal Register documents and lobbying filings.
Known limitations
USAspending.gov data has known reporting quality issues acknowledged by the government itself. Not all agencies report at the same level of detail or timeliness. Sub-award data is not currently included. The 20-agency scope covers the most active federal funders but is not comprehensive across all federal agencies.
General limitations
Data as reported
DataDawn publishes data as reported in source filings and government databases. We do not correct, impute, or editorialize. Errors in source filings propagate to our database. Where we are aware of systematic data quality issues, they are documented in the dataset-specific sections above.
No entity resolution
The same real-world entity may appear under different names across datasets (e.g., "ASPCA" vs "American Society for the Prevention of Cruelty to Animals" in 990 data, or variant name spellings across FEC and Congressional records). DataDawn does not perform automated entity resolution. Users should verify matches using stable identifiers like EIN, bioguide_id, or FEC candidate ID.
Point-in-time snapshots
Each dataset reflects the state of its source at the time of DataDawn's most recent extraction. Government agencies update their data on different schedules. The database is not a real-time feed.
Correlation is not causation
Cross-referencing datasets enables powerful queries (e.g., stock trades within 30 days of floor speeches on related topics), but temporal or thematic proximity does not establish a causal or improper relationship. DataDawn provides the data; interpretation is the user's responsibility.
Update schedule
The 990 database updates as the IRS publishes new e-file batches, typically monthly. OpenRegs datasets are updated on varying schedules depending on source API availability and data volume. The current databases were built in March 2026 from all available source data as of that date.
Independence statement
DataDawn is an independent project with no institutional affiliations. It receives no funding from any nonprofit, foundation, government agency, or organization represented in its datasets. All data is sourced exclusively from public records filed with federal government agencies.
DataDawn does not endorse, evaluate, or rank any organization, legislator, or entity. The platform provides raw data and search tools. Interpretation and analysis are the responsibility of the user.
All source code, extraction pipelines, and database schemas are published on GitHub under a CC0 1.0 Universal (public domain) license.
Corrections and feedback
If you find a data quality issue, parsing error, or have questions about the methodology, you can reach DataDawn at info@datadawn.org.