Hacker Newsnew | past | comments | ask | show | jobs | submit | ggnore7452's commentslogin

too bad that only the smaller on-device models support native audio input.


imo, for health related stuff. or most of the general knowledge doesn't require latest info after 2023. the internal knowledge of LLM is so much better than the web search augmented one.


I’m fine with hires without degrees. But if Google still filters people with LeetCode style coding questions, what’s the point of that in this day and age?


so immersive i actually hit ctrl+w and closed the whole tab.


I’ve done a similar PDF → Markdown workflow.

For each page:

- Extract text as usual.

- Capture the whole page as an image (~200 DPI).

- Optionally extract images/graphs within the page and include them in the same LLM call.

- Optionally add a bit of context from neighboring pages.

Then wrap everything with a clear prompt (structured output + how you want graphs handled), and you’re set.

At this point, models like GPT-5-nano/mini or Gemini 2.5 Flash are cheap and strong enough to make this practical.

Yeah, it’s a bit like using a rocket launcher on a mosquito, but this is actually very easy to implement and quite flexible and powerfuL. works across almost any format, Markdown is both AI and human friendly, and surprisingly maintainable.


>are cheap and strong enough to make this practical.

It all depends on the scale you need them, with the API it's easy to generate millions of tokens without thinking.


You don't need full reasoning to get accurate results, so even with GPT5 it's still pretty cheap for a one-time job and easy to reason about costs. It's certainly cheaper if you have data where reliability is key and classical OCR will undoubtedly require some manual data cleaning...

I can recommend the Mistral OCR API [1] if you have large jobs and don't want to think about it too much.

[1] https://mistral.ai/solutions/document-ai


In that case you should run a model locally, this one for example: https://huggingface.co/ds4sd/docling-models


I’ve been using LLMs for this kind of geo-guessing since Gemini 2.0. Even without access to internet search like o3, they perform surprisingly well.


appreciate the question on hparams for websearch!

one of the main reasons i build these ai search tools from scratch is that i can fully control the depth and breadth (and also customize loader to whatever data/sites). and currently the web search isn't very transparent on what sites they do not have full text or just use snippets.

having computer use + websearch is definitely something very powerful (openai's deep research essentially)


How’s this compare to likes of Fish audio? Wish they support voice clone using longer audio tho .

Haven’t looked into this space for few months , but iirc, previously SOTA was like GPT VITS or something ?


This is the clear SOTA at the moment, even better than ElevenLabs in a technical sense, because you can specify emotion, speed, etc.


anyone tried this? is this actually overall better than xgboost/catboost?


Benchmark of tabpfn<2 compared to xgboost, lightgbm, and catboost: https://x.com/FrankRHutter/status/1583410845307977733 .. https://news.ycombinator.com/item?id=33486914


Yes it actually is but the limitations of rows and features could be a hindrance.


if anything i would consider embeddings bit overrated, or it is safer to underrate them.

They're not the silver bullet many initially hoped for, they're not a complete replacement for simpler methods like BM25. They only have very limited "semantic understanding" (and as people throw increasingly large chunks into embedding models, the meanings can get even fuzzier)

Overly high expectations lets people believe that embeddings will retrieve exactly what they mean, and With larger top-k values and LLMs that are exceptionally good at rationalizing responses, it can be difficult to notice mismatches unless you examine the results closely.


Absolutely. Embeddings have been around a while and most people don’t realize it wasn’t until the e5 series of models from Microsoft that they even benchmarked as well as BM25 in retrieval scores, while being significantly more costly to compute.

I think sparse retrieval with cross encoders doing reranking is still significantly better than embeddings. Embedding indexes are also difficult to scale since hnsw consumes too much memory above a few million vectors and ivfpq has issues with recall.


Off the shelf embedding models definitely underpromise and overdeliver. In ten years I'd be very surprised if companies weren't fine-tuning embedding models for search based on their data in any competitive domains.


My startup (Atomic Canyon) developed embedding models for the nuclear energy space[0].

Let's just say that if you think off-the-shelf embedding models are going to work well with this kind of highly specialized content you're going to have a rough time.

[0] - https://huggingface.co/atomic-canyon/fermi-1024


> they're not a complete replacement for simpler methods like BM25

There are embedding approaches that balance "semantic understanding" with BM25-ish.

They're still pretty obscure outside of the information retrieval space but sparse embeddings[0] are the "most" widely used.

[0] - https://zilliz.com/learn/sparse-and-dense-embeddings


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: