That's a good call, I'll try to remember that for next time.

Imustaskforhelp · 2025-09-29T22:15:02 1759184102

I just wanted to say that I really liked your this comment which just showed professionalism and just learning from your mistakes/improving yourself.

I definitely consider you to be an AI influencer, especially in hackernews communities and so I wanted to say that I see influencers who will double down,triple down on things when in reality, people just wanted to help them in the first place.

I just wanted to say thanks with all of this in mind, also that your generate me a pelican riding a bicycle has been a fun ride and is always going to be interesting, so thanks for that as well. I just wanted to share my gratitude with ya.

typpilol · 2025-09-29T22:25:40 1759184740

Have you thought about benchmarking models a month or two after release to see how it competes vs the day 1 release

simonw · 2025-09-29T22:58:58 1759186738

For that to be useful I'd need to be running much better benchmarks - anything less than a few hundred numerically scored tasks would be unlikely to reliably identity differences.

An organization like Artificial Analysis would be a better fit for that kind of investigation: https://artificialanalysis.ai/

westurner · 2025-09-30T02:36:13 1759199773

Manually,

From https://news.ycombinator.com/item?id=40859434 :

> E.g promptfoo and chainforge have multi-LLM workflows.

> Promptfoo has a YAML configuration for prompts, providers,: https://www.promptfoo.dev/docs/configuration/guide/

openai/evals//docs/build-eval.md: https://github.com/openai/evals/blob/main/docs/build-eval.md

From https://news.ycombinator.com/item?id=45267271 ;

> API facades like OpenLLM and model routers like OpenRouter have standard interfaces for many or most LLM inputs and outputs. Tools like Promptfoo, ChainForge, and LocalAI also all have abstractions over many models.

> What are the open standards for representing LLM inputs, and outputs?

> W3C PROV has prov:Entity, prov:Activity, and prov:Agent for modeling AI provenance: who or what did what when.

> LLM evals could be represented in W3C EARL Evaluation and Reporting Language

"Can Large Language Models Emulate Judicial Decision-Making? [Paper]" https://news.ycombinator.com/item?id=42927611

"California governor signs AI transparency bill into law" (2025) https://news.ycombinator.com/item?id=45418428 :

> https://sb53.info/

Is this the first of its sort?:

> CalCompute