Curious and a little off topic of the post (but related), is there a way to detect speakers with Whisper or with a combination of models, similar to Descript?
No, speaker diarization is not part of Whisper. There are open source projects - such as Kaldi [1], but it's hard to get them running if you are not an area expert.