# Hypothesis testing in Crunch

Crunch allows users to test hypotheses and display significance levels for two-way tables. Crunch computes *P*-values for testing null hypotheses of “no difference” or “independence.” It is common (though now considered poor practice) to declare differences with *P*-values less than .05 as "statistically significant." The *P*-value is the probability of observing a larger difference than what was observed in the sample data under the assumption that there is, in fact, no difference in the population. This should not be confused with the probability that there is actually a difference (no difference is assumed in performing the calculation), nor an indication that the differences are “substantial” or “important.” Users are advised to consult the American Statistical Association's "Statement on *P*-values: Context, Process, and Purpose," *The American Statistician* (2016). [URL: https://doi.org/10.1080/00031305.2016.1154108]

Best practice is for users to focus on *effect size* and *margin of error*, rather than statistical significance. Small differences can be significant if there are large amounts of data. With small amounts of data, large effects can be statistically significant, but the estimated effect sizes are not reliable (and, in many cases, not even in the correct direction). If an effect is large (as judged by a domain expert) and has a small margin of error, one can be confident that there is a real difference. Similarly, if an estimated effect is small and the margin of error is small, you can safely conclude that there is not an important effect. But when the margin of error is large, any mechanical rule (such as "*P*< .05") is almost guaranteed to lead to misleading conclusions. We urge extreme caution in the use of these tests.

## How to export hypothesis tests in Crunch

When exporting a multitable, users can check the option “Column t-tests” in the “Export tab book…” dialog box, as shown below.

With this option checked, Crunch will compare column percentages in a two-way table. Each column is labelled with a letter and a hypothesis test is performed comparing cells in the same row using a .05 significance level. The comparisons are of percentages in the *same row* of the table. The percentages being computed are *column percentages* — the denominator or base used to compute the percentage is the number of persons in each column.

## Example

For example, the table below is a cross-tabulation of Party ID (rows) and age group (columns). The cell percentages are *column percentages* — the percentage of persons in each column (age group) who report identifying with each party.

Age | |||||

All | Under 30 | 30-44 | 45-64 | 65+ | |

Party ID | - | A | B | C | D |

Republican | 28% | 24% | 21% | 30% | 35% |

- | B | A B | |||

Independent | 33% | 29% | 39% | 33% | 29% |

- | A D | ||||

Democrat | 40% | 48% | 40% | 37% | 35% |

- | C D | ||||

Unweighted N | 1310 | 275 | 293 | 450 | 292 |

Let's focus on the comparisons that could be made using the first row. 24% of the 275 persons under age 30 are Republican, compared to 21% of those aged 30-44, 30% of those aged 45-64, and 35% of those aged 65+. The B underneath 30% in the 45-64 column indicates that the difference between 30% (the estimated percentage of Republicans aged 45-64) and 21% (the estimated percentage of Republicans aged 30-44) is "statistically significant." More precisely, the hypothesis of no difference in the proportion of 30-44 year olds who are Republicans in the population and the proportion of 45-64 year olds who are Republicans is rejected at the .05 significance level. On the other hand, the hypothesis of no difference between the proportion of Republicans aged 45-64 and under 30 is *not* rejected.

(Be very careful: *not rejecting* a hypothesis of “no difference” should not be interpreted as saying there is no difference -- there are more Republicans among 45-64 year olds in the sample than there are those under 30, but the sample sizes are small, so these are noisy estimates. On the other hand, rejecting the hypothesis of no difference at the .05 significance level is often very weak evidence that there really is a difference in the population. If you test hundreds of hypotheses with random data, you will find many "significant differences" that are just meaningless noise. This is why in the Web application, we color cells according to their *P*-values, giving an indication of the strength of the evidence.)

These hypothesis tests are symmetric (so if column C is significantly different from column B, then column B is significantly different from column C). To avoid cluttering the output, we only label the column which is higher (so the first row of column C has a B below it, but nothing is shown underneath the first row of column B). The hypothesis tests are all “two-tailed tests” for the presence of differences, not the direction of the difference.

Multitables are composed of a set of bivariate tables. However, the tests presented in the multitable export to Excel are only for the bivariate subtables. For instance, the multitable shown below is composed of three bivariate subtables (party ID x age, party ID by race, and party ID x gender). The hypothesis tests are computed comparing the columns of each subtable. No test is performed comparing, say, whites to males.

## Differences between hypothesis testing in Excel exports and the web application

In the Web application, choosing "✳︎" from the popup display controller colors all of the cells to identify which ones are higher or lower than the full sample in the row or column. The coloring ranges from dark green (much higher) to dark red (much lower) according to the *P*-value of the difference of what would be expected if the row and column variables are independent. This is generally a superior way to judge associations of variables than the pairwise tests shown in the Excel output. It will show you the pattern of associations visually and avoids distracting users too much by isolated, noisy cells in a table.

The new column t-tests between percentages in columns *i* and *j* are based on the following test statistic:

*t* = p[i] - p[j] / √ [ ((p[i] * 1-p[i]) / n[i]) + ((p[j] * 1-p[j]) / n[j]) ]

where p[i] is the (weighted) proportion in the cell in column *i* and n[i] is the (unweighted) number of cases in column *i.* The test statistic is compared to a reference *t*-distribution with degrees of freedom equal to n[i]+n[j]−2.