Technical Assumptions of Tests of Statistical Significance

From Displayr
Jump to navigation Jump to search

Statistical tests are educated guesses. Based on incomplete data - that is, data from only a subset of the population - they seek to make conclusions. The incompleteness of the data guarantees that statistical tests will sometimes lead to the wrong conclusion, particularly in situations where there is little data or the differences being examined are small (i.e., to use the jargon, the tests are particularly inaccurate when power is low). Nevertheless, statistical tests are in widespread use because when they are conducted correctly they are the most educated form of guess that is possible.

All statistical tests make a large number of assumptions. When these assumptions are not satisfied the consequence is that the conclusions from statistical testing become less reliable. The more egregious the violation of the assumptions, the less accurate the conclusions.

Null hypothesis

All statistical tests require a null hypothesis (see Formal Hypothesis Testing). From time-to-time the null hypotheses that are used in statistical tests are not sensible and the consequence of this is that the p-values are not meaningful. As an example, consider the table below, which is a special type of table known as a duplication matrix. This table has the same data shown in both the rows and the columns. Thus, the first column shows that 100% of people that consume Coke consume Coke (not surprisingly), 14% of people that consume Coke also consume Diet Coke, 25% of people that consume Coke also consume Coke Zero, etc.

File:DuplicationMatrix.PNG.

The arrows on the table show the results of statistical tests. In all of these tests, the null hypothesis is independence between the rows and the columns (see Statistical Tests on Tables for a description of what this null hypothesis entails, although the description is non-technical and the word independence is not used). However, this assumption is clearly not appropriate, as the same data is shown in the rows and columns and thus it cannot be considered independent in any meaningful sense and a different null hypothesis is required.

To appreciate how the incorrect null hypothesis renders the significance tests meaningless focus on the top-left cell. It suggests that the finding that 100% of people that consume Coke also consume Coke is significantly high. However, it is a logically necessary conclusion and thus cannot, in any sense, be considered to be significant (i.e., the only possible value is 100% and thus the only sensible null hypothesis is that this value is 100% so the test should not be significant). All the other tests in this table are also wrong. The ones in the main diagonal are wrong for the same reason as just discussed. The other tests are wrong because the assumption of independence is not sensible in duplication matrices, as typically buying of one brand is correlated with buying of other brands[1].

Alpha (significance level or cut-off)

The key technical output of a significance test is the p-value (see Formal Hypothesis Testing). This p-value is then compared to some pre-specified cut-off, which is usually called [math]\displaystyle{ \alpha }[/math] (the Greek letter alpha). For example, most studies use a cut-off of and take conclude that a test is significant if [math]\displaystyle{ p \le \alpha }[/math].

Having a standard rule such as this gives the veneer of rigor as it is a transparent and non-subjective process. Unfortunately, when statistical tests are an input into real-world decision making it is generally not ideal to use such a simple process. Rather, it is better to take into account the costs and risks associated with an incorrect conclusion.

Consider a simple problem like a milk company deciding whether or not to change the color of its milk package from white to blue. A study may find that that a small increase in sales results if the color change is made. However, the resulting p-value may be 0.06. Thus, if using the 0.05 cut-off the conclusion would be that color makes no difference. However, if it costs the company nothing to make the change, then there is no downside, and they are better off making the change. That is, perhaps the change in packaging will have no impact, in which case there is no downside. And, there is the possibility that making the change will result in a small increase. Now, let's instead suppose that the p-value is 0.04 but that the change in pack size will cost the company millions of dollars. In that situation it is likely best to conclude that there is no significant effect for a change in pack size, even though the p-value is less than the 0.05 cut-off, as now the 0.04 chance is interpreted as meaning that there is a non-zero chance that the company could spend millions of dollars for no gain (and, the better course of events is for the company to increase the sample size of the study to see if it results in a smaller p-value).

The academic discipline of decision theory focuses on the question of how to weigh up risks and costs in such situations.

Data collection process

Most statistical tests make an assumption known as simple random sampling (see Formal Hypothesis Testing for an example and definition).

Simple random sampling is a type of probability sampling, which is a catch-all term to describe samples where everybody in the population has the potential to be included in a study and we know both the probability of people being included as well as having a good understanding of the mechanism by which people are included. Simple random sampling is the simplest type of probability sample. There are lots of others, such as cluster random sampling and stratified random sampling. When these other forms of sampling occur then there is a need to use different formulas to compute statistical significance (i.e., and, the standard formulas taught in introductory statistics courses all assume the data is from a simple random sample).[2] In general, if tests assume simple random sampling, but one of these other types of probability sampling methods is a better description of the sampling mechanism, the consequence will be that the computed p-value is smaller that the correctly-computed p-value and results will be concluded as being significant and the rate of making false discoveries will increase (there are some situations where alternatives to simple random sampling can result in p-values being wrong in the other direction, but the nature of commercial research makes this possibility rare enough to be ignored).

Probability samples never occur in the real world. Only a tiny fraction of people are ever really available to participate in surveys. The rest are: illiterate, in prisons, unwilling to participate, too busy, not contactable, etc. Consequently, it is important to keep in mind that the p-values that are computed are always rough approximation based upon implausible assumptions. However, without making the implausible assumptions there is no way of drawing any conclusion at all, so the orthodoxy is to make such untestable assumptions but to proceed with a degree of caution. Nevertheless, it is important to appreciate that it does not follow that because no sample is ever really a probability sample that this means that all samples are equally useful. The further a sample is from being a probability sample, the more dangerous it is to treat it as being a probability sample.

Statistical distributions

The mathematics used to compute the p-value involves various distributional assumptions. For example, t-tests assume that data from independent and identical draws from a normal distribution (or, "i.i.d. normal" for short).

It is helpful to disentangle such assumptions into its parts: the i.i.d. assumption and the specific distributions that are assumed.

The i.i.d. assumption

A read of the fine print of most statistical tests reveals assumptions relating to the data being independently and identically drawn from a distribution. This is generally just a different way of restating the assumption that the data is a simple random sample (which was discussed earlier on this page). Where this assumption is not satisfied there are alternatives, provided that the data can be viewed as being something of a probability sample, but they are technical and not available in most programs used for survey analysis (see [3][4][5] for more information).

The specific distribution

Most statistical tests make explicit assumptions about the specific distributions. Statisticians broadly group tests into two groups: parametric tests, which make relatively strong assumptions and non-parametric tests, which make weaker assumptions about the distributions.

Parametric tests

Parametric tests are derived from assumptions about the distribution of the data. Most commonly used parametric tests assume that the data is normal or binomial, but there dozens of other distributions assumed by more exotic tests. Where data is assumed to be binomial, as occurs in many tests of proportions, this assumption is almost unbreakable, provided that the i.i.d. assumption is met (this is because when the i.i.d. assumption is met the data becomes binomial by definition). In the case of the assumption of normality, however, the assumption can play a material role.

The most widely used tests in survey research - t-tests - assume that data, or, in some case the residuals, are drawn from the normal distribution. It is rarely the case that survey data complies with this assumption, with it being more common to follow other distributions (e.g., NBD) and/or have outliers. Where sample sizes are 'small' the failure of data to meet with the assumption can make the p-values of surveys misleading. Fortunately, due to a very helpful theory known as the central limit theoreom, with large sample sizes this assumption is usually not a problem. That is, many tests which assume normality - such as most z-test and t-tests - are able to accurately compute the p-value even when the data is not even remotely similar to a normal distribution, provided that the sample size is large. Unfortunately, there is no good guidance as to how large a sample needs to be before the assumption of normality can be ignored. It is common to read guidelines suggesting that samples as small as 10, 20, 25 or 30 can be sufficiently large. Unfortunately, it is not so simple. In samples in the thousands the departure from normality can still make a difference (e.g., heavily skewed samples). Perhaps the key thing to keep in mind is that with samples of larger than 30 the difference between the computed p-value and the correct p-value is unlikely to be large. (That is, it is routine to treat samples of 30 and above as being sufficiently large to make the assumption of normality irrelevant.)

Nonparametric tests

Nonparametric tests make milder distributional assumptions. The most common assumption is of finite expectations. This is a very technical assumption and is possibly always satisfied in survey research.

Other assumptions tend to depend upon the specific test. For example, simpler non-parametric tests used for testing differences in means and medians often assume that the data contains no ties (e.g., Kruskal-Wallis). Generally, where such assumptions are not met there are alternative variants of the tests which do not make these assumptions (but instead assume that the sample is large). Most software programs either default to these safer tests or use them when the assumption of ties is not met.

Some non-parametric tests make even more exotic assumptions. The Wilcoxon Sign-Rank Test, for example, make a symmetry assumption.

Other than the queston of ties, it is routine for researchers to proceed as if nonparametric tests make no assumptions (other than i.i.d.). There is no evidence to suggest this practice is routinely dangerous.

Sample size

It is common practice for commercial researchers to have rules of thumb regarding sample size. For example, various researchers do not test for statistical significance when samples are less than 30, 50 or 100. Such rules of thumb do not have any formal justification. All tests of statistical significance explicitly take the sample size into account. Many tests function quite adequately with very small sample sizes. The only time to be particularly concerned about sample size are when there is a problem regarding distributional assumptions:

  • As discussed earlier, assumptions regarding the specific distribution, such as whether it is normal or not, become key determinants of the accuracy of p-value computations when smaller sample sizes are used.
  • Deviations from simple random sampling (and, consequently, the i.i.d. assumption) can be particularly problematic with very small samples.

The key thing to keep in mind as regards sample size is that even if the sample size is small and the various assumptions are not met, it is still better to do a test with lots of violated assumptions than to do no test at all (as when no test is conducted a conclusion is still going to be reached regarding whether the difference is meaningful or not, so it is better to perform an inexact computation than to guess).

Number of comparisons

Traditional tests of significance assume that only a single test is conducted in the study. This is rarely a good assumption. When this assumption is not satisfied the p-values computed by the tests are smaller than they should be. See Multiple Comparisons (Post Hoc Testing) for more information.

The absence of other information

All the commonly used statistical tests assume that the only information available for performing the test is the data used in the test itself. Where there is other information, such as other studies or theory, this assumption is incorrect and the resulting p-values are also incorrect. The nature of the error in this instance is that where the p-value leading to a conclusion that is contrary to the other evidence then the p-value is smaller than it should be and vice versa. Bayesian statistics provides tools for incorporating other information into statistical testing.

Additional assumptions of specific tests

Z-Tests of Proportions

The proportions being tested are greater than 0 and less than 1.

Z-Tests of Means

The standard deviation has been measured without error.

t-Tests of Proportions

  • The proportions being tested are greater than 0 and less than 1.
  • That the variance is estimated (this assumption is rarely correct but is trivial in its implications).

t-Tests with unequal variance

The variance of each group is greater than 0.

Dependent samples tests (e.g., Quantum, Survey Reporter)

Any missing data is Missing Completely At Random (MCAR).

ANOVA-based Multiple Comparison Procedures (e.g., Fisher LSD, Duncan, Tukey HSD, Newman-Keuls, Dunnett)

Data is from a normal distribution with equal variance in each group, or, the sample size is large. Note that this assumption is not made by Bonferroni and False Discovery Rate corrections. (See Multiple Comparisons (Post Hoc Testing) for more about these methods.)

References

Template:Reflist

  1. Ehrenberg, A. S. C. (1988). Repeat Buying: Facts, Theory and Applications. New York, Oxford University Press.
  2. Cochran, W. G. (1977). Sampling Techniques, Third Edition. New York, Wiley.
  3. White, H. (2001). Asymptotic Theory for Econometricians, Revised Edition. San Diego, Academic Press.
  4. Cochran, W. G. (1977). Sampling Techniques, Third Edition. New York, Wiley.
  5. Wooldridge, J. M. (2001). Econometric Analysis of Cross Section and Panel Data, MIT Press.