Methodology

This page documents the data sources, update frequencies, and processing methodology used to create Code for Democracy's data.

You can also view the status of each dataset to better understand its coverage and update cadence.

Government Data

We believe that open data provided by the government is still the best place to start for any search. As such, these datasets are core to our platform:

Campaign Finance

Data from the FEC's bulk data download is indexed daily, along with financial reports and more detailed Schedule A data from the FEC API. We use the bulk data as the primary source for our campaign finance data because it is processed by the FEC before release, and therefore we consider it the cleanest source. However, in many cases, there may be a lag between when raw data is reported to the FEC and when the bulk data is available.

Lobbying Disclosures

Lobbying activity and related contribution reports are continuously indexed from both the House and Senate websites. For data from the House, we ingest the data by paging through the front-end website. For the data from the Senate, we ingest the data directly from their API. Therefore, it is very possible that data from the House is less comprehensive than data from the Senate.

Tax Documents

A subset of fields from the IRS 990, IRS 990EZ, and IRS 990PF filings are continuously indexed from the AWS XML mirror. We use the provided index listings of available filings for each year in order to page through the individual XML filings.

Narrative Data

In additional to traditional open data sources, we also ingest a variety of datasets that are helpful for understanding the type of narratives occurring in political discourse:

News

News articles are indexed twice each day from the news sources rated by Allsides and Media Bias/Fact Check. Although we attempt to index all articles from each news source, in reality our coverage is should be thought of as a collection of "front-page" articles.

Facebook Ads

We also index all data related to "Issues, Elections or Politics" from the Facebook Ads Library on a continual basis. Our data comes from the Facebook API, and therefore it should be an exact mirror of the data available in the Ads Library. However, the universe of data available here is dependent on the accuracy of Facebook's own classification algorithms.

Tweets

We continuously index all tweets from a core group of Twitter users that are relevant political candidates, commentators, activists, fact checkers, or journalists. This is our least comprehensive dataset and is subject a multitude of potential misattribution and latency issues, so it should be used for exploratory purposes only.

Our data pipeline is open source! See the Data repository on GitHub for details on how we are ingesting and processing each data source.

Last updated