Can someone clarify for me how the training and testing sets were constructed? One problem is that cancerous and benign skin are unbalanced in a representative population. How was this imbalance handled in testing? How was the testing set constructed? And so on.
For each of the 3 tests the training sets were classified with a biopsy, images were randomly seleced then blurry images were filtered out by a separate dermatologist. The ratios Benign:Malignant were 70:65, 97:33, and 40:71 respectively.
These close-to-even ratios make for a more powerful test of classification. I would assume that these test samples have biopsy data means that some dermatologist thought that they might be malignant (unnecessary medical operations are unethical). This might lead to some bias towards samples that are difficult for humans to diagnose.
Separating these into binary classifications of specific tumor types makes it easier to classify than out of every possible tumor type (as a dermatologist does).
Still the claims this paper makes are very promising. A lot of the training data was classified by dermatologists, not biopsy. Using more biopsy data could lead to even better classification, as well as improvements to the model.