There are several common fallacies (errors of reasoning) that are related to the use of NHST (null hypothesis significance testing) statistics. One fallacy, the NHST subgroup fallacy, involves analyzing the association between the predictor and the outcome within different subgroups. If in one subgroup the association is found to have a PUA (p under alpha) but not the other, it is concluded that the association is only present for one subgroup. In fact, one cannot approach it like this. One must test for an interaction effect directly. This is illustrated by simulating a large number of studies. Inspirered by this real paper making the error, we look at the effects of relationships to one's parents on some kind of bad outcome. For each study, we divide the sample into two by gender and look at the effects of the two parental predictors for each.
The default parameters are designed to reflect typical studies of this sort. After examining the results of these, try changing them to see how this affects the results.
In the plot below the proportion of simulated studies that found a PUA is displayed. Similar to many studies, models are fit by gender. We also include a full model that includes an interaction term, which is the proper way of establishing that a predictor works differently for each gender. Note that the full model includes a gender term. This is because including higher-order terms without their lower-order components can lead to odd or misleading results.
The plot below shows the proportion of models where one predictor is PUA for one gender but not the other. This is directly related to the power of the design. When power is around 50%, such discrepancies in PUA happen about 50% of the time. If there is no actual intereaction effect (default settings), then this is a false positive rate of about 50%.
The plot shows the distribution of effect sizes estimated by the models. Note that the models estimate effect sizes in logits which is a logistic transformation of the odds ratio. For this reason, the numbers in the left menu bar do not correspond to the numbers in the plot, but they relative comparisons are possible.
Note that gender is coded with male as the reference class. This affects the direction of the effect for the interaction effects.
This tab contains details about how the simulation works and where one can learn more.
The simulator generates a sample for each replication. To generate a sample, a random sample of 'male' and 'female' strings is generated with 50% of each gender. These are then assigned either a good or bad relationship to each of their parents at random, each with 50% probability. After this, each person is assigned a probability of them having the bad outcome. This probability is the sum of each cause that is true for them plus the base rate. In the default scenario, the base rate is 20% and each bad parental relationship adds another 20%. This gives a range of 20-60% chance. If gender effects or interactions are present, these are added as well. Before assigned an outcome to each person, a validity check is then done to see if any probability value is below 0% or above 100%. If so, an error is returned. If not, an outcome is generated for each person with the probability of the bad outcome being that which was calculated before.
After generating the samples, three logistic models (using glm) are fitted for each sample. One for only the males, one for only the females and one for both together. The first two models only contain the parental predictors, while the last model contains a gender predictor as well as two interaction terms.
Information is then extracted from the models and displayed in the plots.
The source code for this simulator can be found on Github.
Errors in reasoning related to the use of NHST are well known. In fact, many authors have been calling for the abolision of NHST statistics entirely advocating either confidence intervals, meta-analysis, Bayesian statistics or some combination of these.