Equivalence tests and effect sizes

1 Equivalence and non inferiority tests

Show that there is no difference between to sets of measures. Usually one say that if there is no significant difference between them than they are "equal" or "equivalent". This is wrong as we discussed.

The p-value could be high because there is not enough data. Even if there is "enough" data (there is no theory on how much is enough) it is almost wrong to claim that.

When do you want to say that two sets are equivalent

  • I have not seen that in computer science
  • this started in pharmacology - one want to say that a generic medicine is as good as the named brand
  • this is desirable in social sciences and education - group A performs as well as group B.

One need to define a threshold of "irrelevance" or "practical equivalence" δ Differences below this value are irrelevant, do not matter, ar of no practical consequence.

In bayesian analisys this δ is caller ROPE - region of practical equivalence

If you define δ you must argue why select this value. In other areas the community defines he δ

1.1 non inferiority

A is not inferior to B - at least as good as

μ(A)-μ(B) > -δ

1.2 equivalence

abs (μ(A) - μ(B) ) < δ

There are equivalence tests whee the null hypothesis is that | μ(A) - μ(B) | > δ If the p-value is low then you can claim that A and B are equivalent.

As far as I know there is only one parametric test: TOST - Two one sided t-tests

in R

2 Effect size

A family of measures for the size of the difference between A and B

There are the non-standardized and the standardized (adimensional)

Standardized effect sizes can be compared across different experiments.

2.1 The D family

Cohen D = \(\frac{\mu(A)-\mu(B)}{sd}\)

where sd is some measure of standard deviation of "both" A and B.

measures how separated are the distributions of A and B

animation and figure

  • sd = sd(A) = sd(B) if they are the same
  • pooled sd = \(\sqrt{\frac{n_A var(A) + n_B var(B)}{n_A + n_B}}\) square root of the weighted mean of the variance of both sets (the formula is slightly different)
  • sd = sd(B) when B is the "control" set. In this case the D is called Glass delta
  • Hedges g is a correction of Cohen D for small data sets.

2.2 Comparable across experiments

  • meta analisys - discussed in the bibliographic search class combine the effect sizes of different experiments into a single result
  • an effect size of 0.4 in increase of programmers productivity can be compared to other interventions. Is this effect size large? small?
  • meta-meta analysis describe the range of effect sizes in different disciplines

    Software Engineering e Education

  • there is a rule by Cohen, but every area should have its own table like this one.
    • d<0.2 small
    • 0.2 < d < 0.5 medium
    • 0.5 < d < 0.8 large
    • d > 0.8 very large

2.3 r family

r = correlation coefficient

examples

SStotal = \(\sum_i (y_i- \bar{y})^2\) sum of squares of errors (in relation to the mean \(\bar{y}\)

regression \(\hat{y} = \alpha + \beta x\)

SSreg = \(\sum_i (\hat{y_i}- \bar{y})^2\)

SSreg is the sum of errors that are due to the model - the fact that we predict the value \(\hat{y_i}\) for each \(x_i\).

explained variance = r2 = SSreg/SStotal

other metrics with corrections for many xs ou many groups: eta squared, partial eta squared, omega squared, partial omega squared, cohen's f

2.4 non parametric measures of effect size

Common language effect size: what is the probability that a random A will be higher than a random B. - 0.5 means no difference.

There are many other measures of effect sizes (unfortunately) but there are formulas to convert from one to another (given other info on A and B) in R

3 Confidence intervals on effect size

Effect size is one number, like the mean of a sample.

We can compute a confidence interval for the effect size. There are simpler methods - for large number of data, and there are methods for small n based on "non centrality" parameters. in R

If the effect size confidence interval includes the zero, than in a standard way, the differences are not statistically significant. But this is much more informative than just not publishing the paper.

4 Effect sizes for computer science/machine learning??

ATTENTION THIS IS MY SPECULATION

accuracy is an adimensional

but is difference in accuracy an effect size? A gain of 0.07 does not mean the same thing.

effect sizes for proportions (in the literature): odds ratio, log odds ratio, relative risk, cohen h. Accuracy is a proportion!

As far as I know, there are a few people that treats difference in accuracy as an effect size. I did (this is a paper were I experimented with some methodological inovations regarding ML - I no longer like some of them).

5 References

Author: Jacques Wainer

Created: 2018-04-23 Mon 12:03

Validate