Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This seems like a pretty useless study as they don't collect any results from human doctors, therefore there is nothing to compare their GPT-4V results to.


Instead of comparing against some "average doctor", they used a few doctors as "source of truth"

> All images were evaluated by two senior surgical residents (K.R.A, H.S.) and a board-certified internal medicine physician (A.T.). ECGs and clinical photos of dermatologic conditions were additionally evaluated by a board-certified cardiac electrophysiologist (A.H.) and dermatologist (A.C.), respectively


I think the parent comment was referring to something else.

In the paper the tasks are only completed by GPT-4V. For a valid scientific investigation, there should be a control set completed by e.g. qualified doctors. When the panel of experts does their evaluation, they should rate both sets of responses so that the difference in score can be compared in the paper.


Agreed. Those are different evaluations (is what I meant by "Instead of comparing against"). The paper cannot conclude that "doctors are better/more correct"

It assumes that "here are 5 doctors which are always correct". Then measures GPT's correctness against them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: