I'm on mobile so can't find the link but years ago there was a DARPA (iirc) program trying to solve this problem in the context of surveillance in a loud crowded room. Their conclusion was that there needed to be n+1 microphones in the room to be able to cleanly differentiate all of the noise, where n is the number of noise sources, which in their case was number of conversations going on in the room (assuming no other loud sources of noise like music).
I think it's totally doable but you'd need many more microphones in order to deal with real world noise. As MEMS microphone quality improves, this should eventually be possible with a combination of smartphone/headphone/some other device like something around your neck.
Apart from the dynamic range challenges for sensing, source separation is hard. There's been a pretty long line of research into the area - see "cocktail-party problem". AFAIK it's still a mostly unsolved problem.
I think it's totally doable but you'd need many more microphones in order to deal with real world noise. As MEMS microphone quality improves, this should eventually be possible with a combination of smartphone/headphone/some other device like something around your neck.