Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Investing in new vector database development vs enhancing existing databases (det.life)
98 points by mnming on Nov 25, 2023 | hide | past | favorite | 60 comments


Has anyone seen a convincing argument for why you would want a dedicated vector database in place of a normal database with a good, fast vector index implementation?

The existing DB + vector index option seems so obvious to me that I'm worried I'm missing something.


The two main classes of approximate k-NN algorithms om high dimensional data would involve either (1) linear table scans (e.g., IVF / cell-probe indexes; brute-force exact indexing or the coarse quantizer for an IVF index) or (2) graph walking (HNSW, NN-Descent, CAGRA, etc). Some other approaches like LSH (depending on how it's represented as a data structure) can be in between the two.

For (1), all of the indexing features of a typical RDBMS are overkill for IVF, where the only indexing you need is by IVF bucket, and then you simply sequentially scan and evaluate distances for everything in those buckets.

However, if you want to perform filtering based on other criteria (e.g., filtering based on metadata associated with each vector), then multiple column indexes could be useful with a RDBMS, as you can more efficiently extract just the vectors that meet criteria prior to a sequential scan (e.g., give me all vectors which have a date in the range between X and Y and whose name contains the string "red").

For (2), these algorithms would be inefficient to express in a typical RDBMS. While you can easily encode a graph in a RDBMS, walking the graph sequentially based on criteria is another matter, because each time you'd have to hit the disk (or in-memory cache) to find the next row to look at, then repeat. Latency could be an issue unless it's all in memory, and if it's all in memory, a RDBMS-type representation is overkill versus more simple pointer chasing in memory.

I guess it also depends upon latency / throughput needs (caching in memory on front of a database could help, but a pure in-memory vector index would be faster), persistence / fault tolerance / replication needs (in memory databases would need some kind of persistent store), or update needs (adding/removing vectors, though adding/removing vectors from an IVF or graph-based index becomes non-trivial if you add or remove too many vectors). Traditional databbases that have been around for a while might provide better guarantees on reliability.

(I'm the author of the GPU half of the Faiss library)


It's kind of a tradeoff. Performance is just one factor when choosing the vector database. In pgvecto.rs https://github.com/tensorchord/pgvecto.rs, we store the index separately from PostgreSQL's internal storage, unlike pgvector's approach. This enable us to get multi-threaded indexing, async indexing without blocking the insertion, and faster search speed comparing to pgvector.

I don't see any fundamental reason why the index in Postgres would be slower than a specialized vector database. The query pattern of the vector database is simply a point query using an index, similar to other queries in an OLTP system.

The only limitation I see is scalability. It's not easy to make PostgreSQL distributed, but solutions like Citus exist, making it still possible.

(I'm the author of pgvecto.rs)


Would a graph database (e.g. Neo4j) be any better?


I cannot think of a situation where neo4j does anything better than traditional relational databases


Graph databases, neo4j including, handle many-to-many relationships better than any RDBMS.


In what way do they handle them better?


Very useful for me to read this (and look up some of the mentioned concepts). Thank you!


Thanks, that was really interesting!


There are good reasons to split embeddings off from your regular DB load. Embeddings can be quite large (kilobytes per item), and querying is CPU and IO intensive, so they are good candidate for scaling separately.

The embeddings are not primary data, they're derived from other content and can be recreated if necessary. The vectors will probably be fed or read in bulk by your ML pipeline at times.

I'd say you probably want the primary storage of your embeddings to be file storage like S3 or similar, and then feed them directly into a search engine like Vespa or Elasticsearch.


The only production case I've seen of someone switching to a vector database totally in-lieu of a RDBMS is when their raw data lives in parquet / arrow files and they need fast read-only searches in production. Then a periodic indexing daemon with some additional metadata bolted onto the vector definitions worked well and saved them from having to dip their toes into classic database optimization.

Every place where I'm deploying vector databases, though, I'm using them as an alternative index just for searches (indexing embedding/id only) and doing the full object value lookup in the RDBMS. It's easier to optimize their vector clustering / sort performance outside the constraints of postgres - which is perhaps why pgvector suffers more under higher load.


Initially I thought we need a dedicated vector database but as we tried to build even a simple gen-ai applicaitons we realized that we need another "regular" database to build a complete application.

Instead with DataStax's Vector Search, we designed a document style API and corresponding clients that give you a Vector Native experience to do CRUD of vector and meta-data as well CRUD of other data models. Here is one client ref for example https://docs.datastax.com/en/astra/astra-db-vector/clients/p...


txtai can use Postgres as a data store and combine that with a vector index like Faiss or Hnswlib.

In this case, you get the robustness of Postgres (which has been around a while) and Faiss (which is one of the most mature vector indexes).

https://neuml.hashnode.dev/external-database-integration


txtai looks cool. Did you use it before in scenarios with all 4 components?

I’m curious about which use cases it enables.


In case it wasn't clear, I'm the main author of txtai.

I use txtai in my everyday work. While txtai is open source, I also support a number of consulting projects that are txtai focused. In most cases, I've used SQLite + Faiss for the vector database config paired with an LLM for RAG support.

This setup has worked well and it's what I would start with until it didn't scale.


I guess that at a big enough scale it starts making sense. But not sure what “big” really is.

Pgvector seems to perform very poorly compared to Qdrant: https://nirantk.com/writing/pgvector-vs-qdrant/


I absolutely believe that at the moment - Qdrant is clearly a fantastic piece of technology.

But if I was using Qdrant myself I'd treat it like ElasticSearch - I'd denormalize some of my data into it, but I'd still do most of my work in the relational database and treat Qdrant as effectively an external index for my stuff.

Maybe I'm getting hung up on the world "database" here as indicating that you use that instead of an RDBMS, when actually everyone selling a vector database expects you to use it as effectively an external indexing mechanism instead.


I personally read it as “vector cache”, because you can use it to, eg, store a phrasing to repeat later.


i wish nirantk would add an addendum noting the pgvector issue was fixed or people like me will have to put up the counter response every single time https://youtu.be/MDxEXKkxf2Q?si=eUhNtghbiLRJ7yHB


Very interesting. I didn’t know about it. Thanks!


Nah you nailed it. Beyond the tech even, I don't much understand why there's a cottage industry for it. Either you're touching people's personal documents and should be doing it locally and preferably in-memory, or you're touching people's personal documents and you should own where the data is stored.


why do companies use databases like elastic search when they could just use inverted indexes in postgres?

Performance, features, and scalability


I love Postgres, but you're comparing apples and oranges. You don't have the same level of query sophistication neither the same indexing quality you get from eg. elasticsearch. As a quick example, the other day I got a bug in PostgreSQL because "searching" does not perform diacritics conversion. Also, stuff like stop words, similarity search and whatnot either required dedicated indexes or are downright "round hole square peg" problems.


I believe they're being rhetorical, and agree that there are good reasons to use elasticsearch over postgres.


I use Elasticsearch along with PostgreSQL, but I don't think of Elasticsearch as a database. It's an external index that I denormalize some of my data into.


ES is a good parallel to a vector database.

run a search against ES => hydrate records from postgres

run a search against a vector database => hydrate LLM context from documents

Neither is a primary canonical datastore. Postgres is for ES and S3 is for documents.


so, its just happened that there is no good implementation of full text search in postgres, but fundamentally nothing prevents from implementing such search, except horizontal scalability if it is needed.


the pg extension called "rum" is very performant on millions of pdfs.

https://github.com/postgrespro/rum


my case is tens of billions of short docs, and default search is fine except index creation is single threaded, so it takes multiple days to build index, and github page says rum is even slower..


Quick "ask HN": I'm currently working on a semantic search solution, and one of the challenges is to be able to query billions of embeddings easily (single-digit seconds). I've been testing different approaches with a small dataset (50-100 million embeddings, 512 or 768 dimensions), and all databases I've tried have somewhat severe issues with this volume of data (<100GB of data) on my local machine. I've tried milvus, chroma, clickhouse, pgvector, faiss and probably some others I don't recall right now. Any suggestions on additional databases to try out?



My team at Microsoft works on DiskANN, which was designed for this kind of use case.

https://github.com/microsoft/DiskANN


You can try txtai (https://github.com/neuml/txtai) with a Faiss backend.

This Faiss wiki article might help (https://github.com/facebookresearch/faiss/wiki/Indexing-1G-v...).

For example, a partial Faiss configuration with 4-bit PQ quantization and only using 5% of the data to train an IVF index is shown below.

faiss={"components": "IVF,PQ384x4fs", "sample": 0.05}


I've used Elasticsearch and Pinecone before but never locally. Easy to scale, relatively low latency but can get pricey.


Why is this flagged? Is there a problem with the content of the article?

I don't see this reflected in the comments here


as of midnight Saturday pacific time i do not see it flagged:


ANN-benchmarks [1] compares 30 OSS libraries. pgvector comes up dead last.

[1] https://ann-benchmarks.com/


If I read it correctly, the benchmark is from April 2023, before the pgvector improvements with hnsw. Based on [1], this could be an order of magnitude difference.

1: https://m.youtube.com/watch?si=eUhNtghbiLRJ7yHB&v=MDxEXKkxf2...


Pg_vector seems to be getting a lot of mindshare but it's worth noting that pgvecto.rs also exists and has some features that pg_vector doesn't, including multithreaded indexing.


pgvector is "only" at v 0.5 but does look interesting to me. Any HNers tried it?


It's good enough and only going to get better. The postgres community only has to improve a vector index extension. These vector databases have to improve everything else (all of the really hard stuff) about a database. My money is on postgres.

EDIT>> Plus pgvector is enabled by default on AWS RDS databases. Huge bonus.


> It's good enough […]

«Good enough» is not a quantative metric by any conceivable measure. Furthermore, people in surrounding threads have reported inadequate performance of the pg_vector PgSQL extension (which, indubitably, can and will be fixed at some point).

An example of a useful quantative metric would: «compared to alternative vector processing/indexing engines/databases A, B and C and a dataset comprising N inputs and a total size of S Mb/Gb/Tb and using the M, N and P use cases in the performance testing scenario and pg_vector being the performance baseline, the following results have been obtained: X, Y and Z» – do you have anything similar to that you could share? Not only the comparison results would be insightful, they could also feed into someone's decision making process of whether X should be favoured over Y.

> […] and only going to get better.

Beyond doubt, it will. In the future. But people need to solve problems in the present.


There are other approaches that could work too. For example, txtai can use Postgres as a data store and combine that with a vector index like Faiss or Hnswlib.

In this case, you get the robustness of Postgres (which has been around a while) and Faiss (which is one of the most mature vector indexes).

https://neuml.hashnode.dev/external-database-integration


> These vector databases have to improve everything else

No they don't. Because they aren't trying to build general purpose databases.

And often trying to jam everything into one product just makes it worse not better.


User management, documentation, distribution, UI’s, upgrades, high availability and clustering, backups, libraries, skills required to administer and therefore hire for.

I’ve worked in companies where we grabbed any tech we want, and I’ve worked in companies with one default DB unless you had an excellent reason. The latter meant a much simpler universe where entire teams knew how to fix things that broke at 2am. You just hire one type of DBA. It also meant dev teams really knew how to squeeze the one DB platform rather than learn a little about 10 database types.


I am using it successfully for doing semantic search in a database of ~1M video clips. The CLIP embeddings of key frames are stored in Postgres/pgvector and it has worked well so far.


it's significantly slower than dedicated vector dbs at >100k vectors


It is not very fast (yet). On bigger sets it grinds to a halt.


Adding txtai to the list https://github.com/neuml/txtai

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows. txtai can satisfy most vector database use cases such as being a knowledge source for retrieval augmented generation (RAG).

txtai is independently developed (not VC-backed) and released under an Apache 2.0 license.


it is just a python lib right? not db like qdrant.


Yes, txtai is written in Python as others in that article are (such as Chroma). We don't know what closed source options behind an API service are written in.

I consider txtai to be a vector database and satisfy many of the vector database use cases.


Most devs would be better off upgrading their database and using vectors there. They need to change the schema, but it's way easier than introducing a new database.

A specialized vector database would be needed in just a few relatively rare use cases.


Just use PostgreSQL and pgvector


One from the Qdrant team here. We genuinely recommend starting with whatever you already have in your stack to prototype or produce applications with vector search. From the beginning, one should probably not start a new project with a complex micro-service architecture. The same as you can start with a full-text search using just Postgres or whatever you use as the main DB. However, as the requirements are getting higher and more complex, you probably should switch to a dedicated search engine to avoid the monolithic persistence anti-pattern. https://learn.microsoft.com/en-us/azure/architecture/antipat...

Regarding the dedicated solution, the difference is always in the details. That detail might significantly impact if you are looking for precision or working with large data. First of all, filtering the vector similarity search is quite tricky, and it does not work with pre and post-filtering because the graph's connectivity gets lost. That is why we introduced filterable HNSW at the very beginning. https://qdrant.tech/articles/filtrable-hnsw/ Here is another quite good explanation by James Briggs: https://www.pinecone.io/learn/vector-search-filtering/

Going further, it is not only about being able to search. It is also about being able to scale. Optionally, use disk for cold data as more affordable storage and keep hot data in expensive RAM. Or using one of the built-in compression functionalities. The recently introduced Binary Quantization makes it possible to compress vector embeddings up to 32 times and speed up the search up to 40 times in parallel. This makes billion-scale vector search affordable only for enterprise companies. https://qdrant.tech/articles/binary-quantization/

Native vector databases build all their features around vectors, and vectors are the first-class citizens in the architecture of the database core engine, not just another type of index support.

Vector search is not only about Text Search. See: https://qdrant.tech/articles/vector-similarity-beyond-search... It is also not even only about Search. Qdrant offers, for example, a dedicated Recommendation API, where you can submit negative and positive vector examples and get ready-made recommendation results back. https://qdrant.tech/articles/new-recommendation-api/ And the upcoming version of the engine will introduce even more functionalities like a new Discovery API, for example.

Ultimately, you do not need a vector database if you are looking for a simple vector search functionality. But you need one if you are looking to do more around it, and it is the central functionality of your application. It is just like using a multi-tool to make something quick or using a dedicated instrument highly optimized for the use case. Thank you for your attention.


from the article: "Instead of investing in new vector database products, it would be better to focus on existing databases and explore how they can be enhanced by incorporating vector search functionalities to become more powerful."

Didn't read the rest but this looks like sound advice; same would go for graph DBs.


With due respect to the author, its pretty bog-standard advice. Why "build" when cheaper to integrate a "buy." There is a reason most development isn't in raw Assembly anymore!


In a world enamored with the always-new, always-cutting-edge it bears repeating bog-standard advice, I guess.


This article is about why you shouldn't enter the vector database field, and it's reasonable.

But I want to comment on another thing I often hear: "You don't need a vector database - just use Postgres or Numpy, etc". As someone who moved to Pinecone from a Numpy-based solution, I have to disagree.

Using a hosted vector database is straightforward. Get an API key from Pinecone, send them your vectors, and then query it with new vectors. It's fast, supports metadata filtering, and scales horizontally.

On the other hand, setting up pgvector is a hassle - especially since none of the Cloud vendors support it natively, and a Numpy-based solution, while great for a POC, quickly becomes a hassle when trying to append to it and scale it horizontally.

If you need a vector database, use a vector database. You won't regret it.


Postgres RDS has version 0.5 of the pgvector extension installed by default.

Adding vector support to our app was as easy as enabling the extension and created a table with vector columns. No additional database required and trivial to do mixed queries. Maybe Pinecone and friends have better scalability, but if you need basic vector support, you can do it easily on RDS.

See here for details:

https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-rd...


Fair enough. I researched it earlier in the year and it wasn't available.

I stand by my point: we often hear that adding a new tool to your stack is introducing unnecessary complexity, when more often not using the right tool is the path of most complexity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: