Validating 2022 voters in our survey data

Knowing who voted is critical to developing an accurate understanding of an election’s outcome. But self-reports of voting tend to be somewhat unreliable. Fortunately, validating respondents’ self-reported turnout against official voting records can improve the accuracy of surveys that ask about voting.

Pew Research Center has released an updated version of our 2022 post-election survey dataset that includes validated measures of turnout in the 2022, 2020, 2018 and 2016 U.S. general elections. The data package also includes two special weights for the validated turnout variables.

To validate 2022 turnout, we attempted to locate an official turnout record for each member of the Center’s American Trends Panel (ATP) – our nationally representative survey panel of adults – in at least one of three commercial databases. Each state and the District of Columbia compiles these publicly available turnout records as part of their routine election administration. Commercial vendors then make the information available to political parties, campaigns and researchers.

This dataset is the basis for our July 2023 report about the characteristics of the 2022 electorate, including those who voted and those who did not. The dataset is available as an SPSS statistics file (with the file extension .sav) and is accompanied by a ReadMe.txt file with information about the computation of turnout variables. All major statistical software packages can read SPSS files.

We’ll discuss the measures and what you can do with the dataset in more detail below, and offer some suggestions for analyzing the data with a free package.

As a reminder, the Center releases nearly all of its raw survey datasets to the public. Users who register for a free account can download and manage datasets as often as desired.

It usually takes anywhere from a few months to more than a year after collection to release a dataset. This additional time allows us to fully analyze and report on the data, as well as clean and anonymize the files to protect respondents from being personally identified.

Defining validated voters

To validate 2022 election turnout among ATP members, we attempted to link panel members to a turnout record in at least one of three commercial voter files: one that serves conservative and Republican organizations and campaigns, one that serves progressive and Democratic organizations and campaigns, and one that is nonpartisan.

A member of the ATP is considered a validated voter for a given election if they:

Told us they voted, and
Were recorded as having voted in at least one of the three commercial voter files.

Those who said they did not vote in an election are considered nonvoters. Nonvoters also include anyone – regardless of their self-reported vote – for whom we could not locate a voting record in any of the three commercial voter files. We assumed those not represented in any voter file were not registered voters and therefore had not voted. Overall, 95% of panelists included in the analysis were matched to at least one of the three files.

(In Utah, residents can opt to keep their voter registration and vote history data private because of a 2018 law. Therefore, we could not assume that the absence of a voting record meant a Utah panelist is a nonvoter. Utah residents in the ATP are considered voters if they reported having voted when asked in the post-election survey.)

Additional information about the voter file matching and verification process, as well as sample sizes for all elections reflected in the validated voter variables, can be found in the methodology section of the Center’s 2022 election report. Users should keep in mind that unweighted sample sizes may not exactly match those listed in the report because some exclusions occur when the weights are applied.

Our 2018 report provides a comprehensive introduction to commercial voter files and how they are used to study U.S. politics.

The dataset we are releasing today allows users to replicate or extend portions of the analysis presented in our July 2023 report on the 2022 election.

Note: Certain variables used in the published longitudinal analysis have been removed from the data to protect respondent privacy. Replication of that analysis with this dataset will produce results that differ slightly from those that were originally published.

What this dataset cannot be used for

This dataset cannot be used to replicate the analysis presented in the Center’s validated voter reports on the 2020, 2018 or 2016 elections. Like the 2022 election report, the earlier reports are based on post-election surveys conducted in the weeks after the elections and voter validations conducted several months later.

Although many of the respondents who took those post-election surveys are still in the panel and took the 2022 post-election survey, many others are no longer in the panel. To replicate findings from an earlier election, download the relevant post-election survey datasets; those files contain similar voter validation variables as those in the 2022 dataset.

We have also released an updated version of the Center’s 2020 post-election survey dataset that includes the special weight (WEIGHT_W78_VALIDATEDVOTE_REVISED) necessary to replicate the 2020 estimates shown in the 2022 report, along with an updated ReadMe.txt file with information about how to use it. None of the substantive findings from the 2020 report have changed. Refer to the 2022 methodology for more detail about why and how we made this change to our weighting approach for 2020.

Variables of interest

The dataset includes 10 new variables:

Two special weights to be used with analysis of the validated vote (one for analysis of only the 2022 vote, and another for analysis of turnout and vote over time from 2018-2020-2022)
Measures of validated turnout for 2022, 2020, 2018 and 2016
Vote choice among panelists who voted in 2022, 2020, 2018 or 2016

Noncitizens (F_CITIZEN=2,99) are coded as missing for these measures of turnout and vote choice.

Special weights

The first special weight is WEIGHT_W117_VALIDATEDVOTE. This weight should be used for any analysis involving validated voters in the 2022 election alone. We refer to this weight as the “2022 special weight” below. It was used to produce the 2022 estimates shown in the detailed tables and the charts in Chapter 2 of the report.

The weight adjusts the sample on the large set of variables used in a typical wave of the ATP but includes additional parameters for turnout and vote choice in the 2022 and 2020 elections. It should be used in conjunction with any analysis that includes the variables VOTED2022 or VOTECHOICE2022 but does not require the other validated voter variables described below. This weight will produce the most precise estimates for turnout and vote choice in 2022 and should be used to replicate the vast majority of findings from our 2022 report.

The second special weight is WEIGHT_W78_W117_VALIDATEDVOTE. This weight should be used for any analysis that combines the voter turnout or vote choice variables from multiple elections. We refer to this weight as the “longitudinal special weight” below.

This weight adjusts the sample on the large set of variables used in a typical wave of the ATP but includes additional parameters for turnout and vote choice across the 2022, 2020 and 2018 elections. We do not recommend using this longitudinal weight for analysis focused only on the 2022 election, as the 2022 special weight described above will produce more precise estimates for 2022 turnout and vote choice.

For analysis that does not require identifying voters or nonvoters, WEIGHT_W117 should be used instead.

Note: The weight variable WEIGHT_W117_VOTE was used in earlier analyses before validated vote data was available. If you wish to use validated vote in your analysis, do not use this weight variable.

The methodological report for the 2022 study describes the weighting process in greater detail.

Turnout variables

Validated turnout variables for the four elections are as follows:

Variable name	Variable label
`VOTED2022`	Validated turnout in 2022 general election
`VOTED2020`	Validated turnout in 2020 general election
`VOTED2018`	Validated turnout in 2018 general election
`VOTED2016`	Validated turnout in 2016 general election

These are dichotomous variables coded “1” for validated vote and “0” otherwise.

When the 2022 special weight is applied, voter turnout in 2022 matches the national turnout among the voting eligible population as documented by the U.S. Elections Project, based on ballots counted for the highest office in the election. When the longitudinal special weight is applied, voter turnout in each election matches the U.S. Elections Project’s national turnout numbers for the voting eligible population. The share of adults who were eligible to vote in each election is based on the 2021 American Community Survey, the latest available at the time.

Vote choice variables

Vote choice variables for the four elections are as follows:

Variable name	Variable label
`VOTECHOICE2022`	Vote choice for U.S. House of Representatives in 2022 among validated voters
`VOTECHOICE2020`	Vote choice for president in 2020 among validated voters
`VOTECHOICE2018`	Vote choice for U.S. House of Representatives in 2018 among validated voters
`VOTECHOICE2016`	Vote choice for president in 2016 among validated voters

These variables are coded as “1” for the Republican candidate, “2” for the Democratic candidate and “3” for candidates of other parties.

When the 2022 special weight is applied, candidate choice for the 2022 election matches vote shares for each party’s candidate(s), as documented by the Cook Political Report as of April 4, 2023. When the longitudinal special weight is applied, candidate choice in each election matches vote shares for each party’s candidate(s).

Tips for analyzing this data

The 2022 dataset described here is available for download as an SPSS file. Nearly any statistical program designed for the analysis of surveys can read the SPSS file. However, it’s important to note that spreadsheet software like Microsoft Excel or Google Sheets may not be able to read the SPSS file. They also are unable to apply the survey’s weights, which are critical for producing accurate estimates.

Statistical software packages like SPSS, SAS, Stata or R can tabulate the weighted data and reproduce the analyses found in our report. But users should ensure that their software package is properly accounting for the effect of weighting on the precision of the estimates. That is, estimates of the margin of error or the significance of differences between two groups in the sample will be incorrect unless the survey software can correctly account for the impact of the weighting on the variance of the estimates.

Fortunately, the open-source statistical package R is free and can correctly handle survey data weighting. Our colleagues at the Center have developed some special packages within R and have written guides to the use of R and these packages:

A guide that covers the basics of using R to read and analyze Center data, including properly handling the weights to create an accurate margin of error and tests of significance
An introduction to pewmethods, a special R package written by the survey methodology team to simplify several tasks in working with survey data, and a guide for using the pewmethods package
An explanation of how to use the popular set of R packages known as the “tidyverse” to explore the Center’s survey data

Validating 2022 voters in Pew Research Center’s survey data