But a lot of what you pay humans $500k a year for is to work with enormous existing systems that an LLM cannot understand just yet. Optimizing small libraries and implementing fast functions though is a huge improvement in any programmer's toolbox.
Yes, that’s certainly true, and that’s why I selected that library in particular to try with it. The fact that it’s mathematical— so not many lines of code, but each line packs a lot of punch and requires careful thought to optimize— makes it a perfect test bed for this model in particular. For larger projects that are simpler, you’re probably better off with Claude3.5 sonnet, since it has double the context window.
Yes, but its reasoning ability is extremely poor in my experience with real world programming tasks. I’m talking about stuff that Claude3.5 Sonnet handles easily, and GPT4o can also handle if it can fit in its smaller context window, where Gemini 1.5 pro just completely fails.
Bigger context is definitely helpful, but not if it comes at the expense of reasoning/analytical ability. I’m always a bit puzzled why people stress the importance of these “needle in a haystack” tests where the model has to find one specific thing in a huge document. That seems far less relevant to me in terms of usefulness in the real world.
> I’m always a bit puzzled why people stress the importance of these “needle in a haystack” tests where the model has to find one specific thing in a huge document. That seems far less relevant to me in terms of usefulness in the real world.
How do you mean?
Half of writing code within a codebase, is knowing what functions already exist in the codebase for you to call in your own code; and/or, what code you'll have to change upstream and downstream of the code you're modifying within the same codebase — or even by forking your dependencies and changing them — to get what you want to happen, to happen.
And half of, say, writing a longform novel, is knowing all the promises you've made to the reader, the active Chekov's guns, and all the other constraints you've placed on yourself by hundreds of pages or even several books ago, that just became relevant again as of this very sentence. Or, moreover, which of those details it's the proper time to make relevant again for maximum impact and proper first-in-last-out narrative bridging structure.
In both cases, these aren't really literal "needle in a haystack" stress-tests; they should properly be tests of the model's ability to perform some kind of "associational priority indexing" on the context, allowing it to build concepts into associational sub-networks and then make long-distance associations where the nodes are entire subnetworks. (Which isn't something we really see yet, in any model.)
Yes agreed, I wasn’t trying to say it’s totally useless, but it’s not as helpful as synthesizing all that context intelligently. It’s more of a parlor trick. But that trick can be handy if you need something like that. Really, the main issue with Gemini is that it’s simply not very smart compared to the competition, and the big context doesn’t make up for that in the slightest.