Comment on page
This page documents the data sources, update frequencies, and processing methodology used to create Code for Democracy's data.
We believe that open data provided by the government is still the best place to start for any search. As such, these datasets are core to our platform:
Data from the FEC's bulk data download is indexed daily, along with financial reports and more detailed Schedule A data from the FEC API. We use the bulk data as the primary source for our campaign finance data because it is processed by the FEC before release, and therefore we consider it the cleanest source. However, in many cases, there may be a lag between when raw data is reported to the FEC and when the bulk data is available.
Lobbying activity and related contribution reports are continuously indexed from both the House and Senate websites. For data from the House, we ingest the data by paging through the front-end website. For the data from the Senate, we ingest the data directly from their API. Therefore, it is very possible that data from the House is less comprehensive than data from the Senate.
A subset of fields from the IRS 990, IRS 990EZ, and IRS 990PF filings are continuously indexed from the AWS XML mirror. We use the provided index listings of available filings for each year in order to page through the individual XML filings.
In additional to traditional open data sources, we also ingest a variety of datasets that are helpful for understanding the type of narratives occurring in political discourse:
We also index all data related to "Issues, Elections or Politics" from the Facebook Ads Library on a continual basis. Our data comes from the Facebook API, and therefore it should be an exact mirror of the data available in the Ads Library. However, the universe of data available here is dependent on the accuracy of Facebook's own classification algorithms.
We continuously index all tweets from a core group of Twitter users that are relevant political candidates, commentators, activists, fact checkers, or journalists. This is our least comprehensive dataset and is subject a multitude of potential misattribution and latency issues, so it should be used for exploratory purposes only.