Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can be completely off base, but it feels to me like benchmaxxing is going on with swe-bench.

Look at the results from multi swe bench - https://multi-swe-bench.github.io/#/

swe polybench - https://amazon-science.github.io/SWE-PolyBench/

Kotlin bench - https://firebender.com/leaderboard



I kind of had the feeling LLMs would be better at Python vs other languages, but wow, the difference on Multi SWE is pretty crazy.


Maybe a lot of the difference we see between peoples comments about how useful AI is for their coding, is a function of what language they're using. Python coders may love it, Go coders not much at all.


Not sure what you mean by benchmaxxing but we think there's still a lot of useful signals you can infer from SWE-bench-style benchmarking.

We also have SWE-bench Multimodal which adds a twist I haven't seen elsewhere: https://www.swebench.com/multimodal.html


I mean that there is the possibility that swe bench is being specifically targeted for training and the results may not reflect real world performance.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: