Artifact-based domain generalization of skin lesion models


Dataset bias prevents deep learning solutions from having success in the real world. Failure cases are abundant, particularly in the medical area. Recent studies in out-of-distribution generalization have advanced considerably on well-controlled synthetic datasets, but they do not represent medical imaging contexts. The problem is compounded by the difficulty of acquiring labeled medical data, often causing datasets to be small and to contain samples from a single center. In our work, we propose a pipeline that relies on artifacts to enable generalization evaluation and debiasing for the challenging skin lesion analysis context. First, we partition the data into levels of increasingly higher biased training and test sets for better generalization assessment. Then, we create environments based on skin lesion artifacts to enable domain generalization methods. Finally, after robust training, we perform a test-time debiasing procedure, reducing spurious features in inference images. Our experiments show our pipeline succeeds, improving performance metrics in biased cases, and better avoiding artifacts when using explanation methods. Still, when evaluating such models in out-of-distribution data, they did not prefer clinically-meaningful features. Instead, performance only improved in test sets that present similar artifacts from training, suggesting models learned to ignore the known set of artifacts. Our results raise a concern that debiasing models towards a single aspect may not be enough for fair skin lesion analysis.

In: ISIC Skin Image Analysis Workshop at ECCV’22