Table of Contents

Multiple comparisons

T-test (paired and unpaired), Wilcoxon test (paired and unpaired) and McNemar only work for compariong 2 groups/sets of data

If you have more than two you have a problem of multiple comparison

1 Ombibus tests

Null hypothesis - all sets come from the same population.

If the null is not true (p-value <0.05) then it is not true that all came from the same distribution, but we do not know which one/ones are different and which ones are not significantly different

  • Omnibus for parametric nonpaired data - ANOVA
  • for parametric paired data - repeated measured ANOVA
  • non parametric non-paired - Kruskal Wallis
  • non parametric paired data - Friedman test

In the statistical literature paired data is also called within subject designs and non paired data is called between subjects

2 post hoc tests

Pairwise comparisons (using the appropriate 2 groups tests) plus p-values corrections (because of multiple comparions)

p-value correction many methods:

  • bonferroni
  • holms and others
  • FDR - controlling only the false discovery rate (false positives)

Use Holms.

There are multiple comparions that result in a critical difference - difference between means or ranks below which the difference is not significative

  • Tukey Test / Honest Significant Difference / Tukey Kramer test for nonpaired parametric comparisons
  • Nemnyi test - for paired non-parametric comparions
  • others (paired parametric or nonpaired nonparametric) I dont know

3 Comparing your algorithm with many others (paired)

Specially relevant for Machine Learning

Friedman + pairwise comparions with Holms correction

4 Comparing a set of different algoritms (paired)

Friedman + Nemenyi = Demsar procedure

further extensions of the demsar procedure implemented in the scmamp package in R

5 How to show that your classifier is better than the competition (Special for the ML students)

Comparing 2 groups - not a multiple comparison problem

Cross validation - separate training and test subsets of the data

  • holdout
  • k fold
  • bootstrap
  • and one can repeat each of them

Usual cvs

  • 80/20 or 70/30 holdout
  • 5 or 10-fold
  • 100 repetitions of bootstrap

Let us assume a 5-fold.

  • 5 test subsets (each fold)
  • you could use a McNemar test (paired 0/1 data) but I have never seen anyone do that. (maybe because McNemar tests are less powerfull - last exercise!)
  • you could measure the accuracy on the 5 test sets, and comprare the 5 numbers using Wilcoxon - but number of data is too low
  • you could repeat the 5 fold 10 times, and have 50 numbers to compare - this is possible considered cheating.
  • there is a problem that the measures for each fold are not really independent - they share 3/5 of the training set. You cannot trust the statistical tests that assume that the data is independent paper
  • this paper propose the correlated t-test to deal with the non independence/correlation of the data. But I dont know of any paper that have use it.
  • this paper propose a 5 repetition of a 2-fold (calles 5x2cv) - you will get 10 measures of accuracy - the author claims this balances the problem of correlation and enough number of data

Author: Jacques Wainer

Created: 2018-04-16 Mon 08:56

Validate