This study measured the effect of survey mode of administration on respondents’ answer choices across 27 different questions about political figures, policy preferences and core political values. Looking at such a broad range of questions provides a great deal of insight into the kinds of items that could be adversely affected by the choice of survey mode, but it also introduces complexities that need to be addressed in the analysis. Specifically, the more comparisons one makes, the greater the likelihood of finding statistically significant results not because of any real underlying difference but simply because of random variability. If the threshold for statistical significance is set at the conventional p=0.05 level, on average we would expect to see one in 20 comparisons show up as significant, even if there were no real underlying effect. A common approach to addressing this problem of multiple comparisons is to increase the p-value of each test to account for the number of tests being performed, reducing the number of significant findings.
For this report, researchers chose not to adjust for multiple comparisons when discussing the experimental findings. However, when reserchers perform such an adjustment in this study using a technique known as the Benjamini-Hochberg procedure1, the number of significant differences among the 27 primary mode comparisons drops from four to one2.
Why then discuss unadjusted significance tests throughout this report? Although there are technical reasons to believe that such corrections are too conservative, the decision primarily has to do with the consequences of being wrong. In this case, it is preferable to err on the side of detecting potential sources of bias so that they can be subjected to additional research and scrutiny, even if some turn out to be overstated, than it is for problems to slip by undetected.
To this end, it is important not to confuse “significant” with “true” and “nonsignificant” with “false.” Nonsignificant differences may represent effects that are real but small, and would require a larger sample size to measure with sufficient precision. Likewise, some results with p-values just under the 0.05 threshold could have easily been on the other side if random mode assignment or response had turned out slightly differently.