You are right that the concept of "safe" is nebulous, but the goal here is specifically to be XSS-safe [1]. Elements or properties that could allow scripts to execute are removed. This functionality lives in the user agent and prevents adding unsafe elements to the DOM itself, so it should be easier to get correct than a string-to-string sanitizer. The logic of "is the element currently being added to the DOM a <script>" is fundamentally easier to get right than "does this HTML string include a script tag".
It's certainly an improvement over people trying to homebrew their own sanitisers. But that distinction of being XSS-safe is a potentially subtle one, and could end up being dangerous if people don't carefully consider whether XSS-safe is good enough when they're handling arbitrary users input like that.
Also has made me nervous for years that there's been no schema against which one can validate HTML. "You want to validate? Paste your URL into the online validation tool."
But for html snippets you can pretty much just check that tags follow a couple simple rules between <> and that they're closed or not closed correctly.
The languagemodels[1] package that I maintain might meet your needs.
My primary use case is education, as myself and others use this for short student projects[2] related to LLMs, but there's nothing preventing this package from being used in other ways. It includes a basic in-process vector store[3].
It would be nice to see the Phind Instant weights released under a permissive license. It looks like it could be a useful tool in the local-only code model toolbox.
The speedup would not be that high in practice for folks already using speculative decoding[1]. ANPD is similar but uses a simpler and faster drafting approach. These two enhancements can't be meaningfully stacked. Here's how the paper describes it:
> ANPD dynamically generates draft outputs via an adaptive N-gram module using real-time statistics, after which the drafts are verified by the LLM. This characteristic is exactly the difference between ANPD and the previous speculative decoding methods.
ANPD does provide a more general-purpose solution to drafting that does not require training, loading, and running draft LLMs.
You might be interested in "Text Embeddings Reveal (Almost) As Much As Text":
> We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes.
This included full model weights along with a detailed description of the dataset, training process, and ablations that led them to that architecture. T5 was state-of-the-art on many benchmarks when it was released, but it was of course quickly eclipsed by GPT-3.
It was common practice from Google (BERT, T5), Meta (BART), OpenAI (GPT1, GPT2) and others to release full training details and model weights. Following GPT-3, it became much more common for labs to not release full details or model weights.
[1] https://developer.mozilla.org/en-US/docs/Web/API/Element/set...