There's no algorithmic bias without raw data bias in the first place.

usaar333 · on July 31, 2021

Maybe? I've always found the ideas on algorithmic bias lacking formalization. I'm not really sure what data or algorithmic bias really means. I think "data bias" means non-representative data and "algorithm bias" indicates classifier accuracy differing between two differing groups, but in such a definition algorithm bias can exist without data bias.

Example: I might have a representative set of faces, of which perhaps 1% are wearing sunglasses. There's no particular reason to believe that a face detection classifier optimized over this set performs as well over sunglass wearing faces as those without, given the low frequency of sunglasses. (I might get a better ROC if I only consider faces with eyes visible)

In a sense the data isn't biased, but there is an algorithm bias between the groups.

ekianjo · on July 31, 2021

in your example you can easily avoid this issue with k fold sampling. its not an algorithmic bias rather a training data bias.

usaar333 · on July 31, 2021

I'm not following why cross validation helps here. One group (sunglasses) is simply much rarer so my algorithm may globally perform better on the overall set by accepting lower recall on the rarer subgroup (allowing for higher precision on the much larger group).

This isn't a hypothetical example; many years back I improved the F-score of a face detector I was working on by using eye detection and skin color detection in the classification pipeline. I assume automated systems could produce similar effects, tanking recall on a small subgroup to buy better precision on the larger one.

ekianjo · on July 31, 2021

> One group (sunglasses) is simply much rarer so my algorithm may globally perform better on the overall set by accepting lower recall on the rarer subgroup (allowing for higher precision on the much larger group).

I see what you mean. Well then my answer is still the same. If you have an unbalanced input data, this is what is causing the bias at the classifier level - not the classifier model/algorithm in itself.

usaar333 · on Aug 1, 2021

I'm still confused on the semantics here. My data in this case is broadly representative of the real world input - so what's wrong?

This is where I'm confused on the general idea of "algorithmic bias". Any maximally optimized classifier over the population may perform worse on any minority group (broadly defined).

Representative data is necessary to achieve highest efficiency (as otherwise you will underperform in the real world), but it doesn't solve the problem of say relatively reduced recall of a minority group (which is often discussed in these posts).

In fact, given that groupings are arbitrary, I don't think this is a solvable problem.