Something that's missing from this which I'm curious about is how far can't postgres search take you?
That is, what tends to be the "killer feature" that makes teams groan and set up Elasticsearch because you just can't do it in Postgres and your business needs it?
Having dealt with ES, I'd really like to avoid the operational burden if possible, but I wouldn't want to choose an intermediary solution without being able to say, "keep in mind we'll need to budget a 3-mo transition to ES once we need X, Y, or Z".
Used to work on Google Search, used ES extensively for a startup I founded (which was sort of quasi-search...it was built around feed ranking, where the query is constant and a stream of documents is constantly coming in), and have also used Postgres extensively in other companies.
The big problem with all the off-the-shelf search solutions (RDBMS full-text search, ES, Algolia) is that search ranking is a complicated and subtle problem, and frequently depends on signals that are not in the document itself. Google's big insight is that how other people talk about a website is more important than how the website talks about itself, and its ranking algorithm weights accordingly.
ES has the basic building blocks to construct such a ranking algorithm. In terms of fundamental infrastructure I found ES to be just as good as Google, and better in some ways. But its out-of-the-box ranking function sucks. Expect to put a domain expert just on search ranking and evaluation to get decent results, and they're going to have to delve pretty deeply into advanced features of ES to get there.
AFAICT Postgres search only lets you tweak the ranking algorithm by assigning different weights to fields, assuming that the final document score is a linear combination of individual fields. This is usually not what you want - it's pretty common to have non-linear terms from different signals.
So let’s say that you are building a search engine for performance car parts. There are going to be a bunch of technical terms you use there that are not necessarily going to stand out in the document itself but you know them to be important. For example, the amount of boost pressure a turbo can provide or the number of pistons in a brake caliper. Is there some structured way to specify the grammar which is used for such terms and treat those as important such that when a user puts in “19psi turbo” that they don’t get a bunch of results for just “turbo” which isn’t exactly what they want?
This subproblem is called "bigram detection", or more generally "n-gram detection" for phrases of > 2 words. I'm not going to give away Google's algorithms, but you should be able to look up public academic research with that keyword.
When I implemented search for my app, I built a hierarchy of the different attributes weighted by levels and scored search phrases for how closely they matched. This is for mobile phones. For example, searching for "Silver iphone 11 pro" would give a higher ranking to all iphone 11 pros that are Silver, then comes differently coloured 11 Pros, followed by 11 Pro Max and 11s, lastly all silver coloured phones.
>search ranking is a complicated and subtle problem
Absolutely !
I think never before the saying "the devil is in the details"... is more appropriate than here.
Sure most users agree on the extreme's the REALLY bad search result (putting in 'apple' getting out LCD TV's [maybe the tv's are in parent-category 'Electronics' and you have category and popularity boost to high' ?]) and the REALLY good (results that you expected)
Search Results are commonly evaluated with Recall(did ALL the documents that are relevant got returned)
vs Precision (how many of the results are 'correct')
But that is "one" of many-many metrics.
The biggest issues are non-tech ppl (like your boss or manager) walking in and "discussing" his/her pet-peeve-search query, cause in his mind when he put in APPLE iPhone we should be returning ONLY apple-iphones and NOT apple-iPhone accessories) or maybe we should be returning ONLY the "latest iphone" not the model from 2 generations back
I've commented this before, you can usually only shoot for an "average amount of happiness" (sounds like Arthur Schopenhauer ? :P) for most users. Never "ok ppl, search works perfect" for everyone one now
As to the "practical matters", we found that building a "search-test-suite" where you put in the "manager's pet-peeve" as well as any angry-emails about search queries, and whenever do you search-tuning it's easy to see any-query-regression oh and of course this needs to be an automated search-test-suite.
Nice insightful comment. Can you give a bit additional insight related to when it makes sense to use ES vs Postgres? My takeaway from your comment is that unless your product depends on search as a central component that inbuilt search is good enough. Is that an incorrect takeaway?
That's probably a decent takeaway. If full-text search is a "nice to have" (a bonus feature for advanced users, for example) you can probably rely on built-in Postgres search. If search is a fundamental way that users interact with your product, you want to spend the time and money to get it right, and that probably means ElasticSearch and a custom ranking function.
> Google's big insight is that how other people talk about a website is more important than how the website talks about itself, and its ranking algorithm weights accordingly.
I'm having trouble believing that seeing how top results on opinionated keywords are all SEO spam of websites no one visits by themselves.
You don't remember what a game-changer Google was back in 2000 or so. Its results are pretty awful now, probably because of Goodhart's Law, but they were such a huge improvement over Jeeves et al..
An interesting exercise that most reasonably competent programmers can do now is to setup their own websearch engine with Common Crawl, Elastic MapReduce, JSoup or Gumbo or equivalent HTML parser, and ElasticSearch. It costs on the order of high-hundreds to low-thousands of $$$ to process the ~2B webpages in the Common Crawl corpus (about half the size of the first tier of Google's index) and stand up an ElasticSearch cluster with it for a few days.
You too can build your very own search engine.
Unfortunately, the result quality is roughly what AltaVista was like in 1995. That is why people keep going back to Google.
I remember when Google came into being and how amazed I was. It was quite something totally different and I switched and never looked back for a lot of years.
In recent years however I tend to massively agree with your sentiment and experience. Every day I do not find the things that are really helpful on page 1 - 3, sadly.
There is an incredibly large difference between effective scoring for data that is being gamed and data that is not being gamed. Most people's full text data isn't being gamed quite like web search SEO.
There is also an incredibly large difference between effective scoring for data that has deep relationships and data that does not. Most people's full text data doesn't have deep relationships like web pages do with inbound links.
Might you or anyone else have some recommendations for books or other resource on large scale search architecture that you think are are worthy reads on the subject?
I would also be interested in hearing if you or anyone else might ave any similar resources you could recommend on the subject of "search ranking"?
"Managing gigabytes" is one of the standard introductory textbooks. It's quite dated now (published 1994, before Google was started, and hence why it says "gigabytes" rather than "exabytes"), but a lot of the fundamental concepts are the same. At least it'll give you a vocabulary and conceptual framework for thinking about search.
I’ve never really had a good idea of what Percolator is. I understand it to be a “streaming” indexer that replaced a “batch” indexer based on MapReduce. ES allows indexing individual documents (or small batches) vs. rebuilding the entire index; is that the connection you’re making?
Hi, I started an Elasticsearch hosting company, since sold, and have built products on PG's search and SQLite FTS search.
There are in my mind two reasons to not use PG's search.
1. Elasticsearch allows you to build sophisticated linguistic and feature scoring pipelines to optimize your search quality. This is not a typical use case in PG.
2. Your primary database is usually your scaling bottleneck even without adding a relatively expensive search workload into the mix. A full-text search tends to be around as expensive as a 5% table scan of the related table. Most DBAs don't like large scan workloads.
Do you see any particular reasons to use or not use ZomboDB [1]? It claims to lets you use ElasticSearch from PG seamlessly e.g. it manages coherency of which results ought to be returned according to the current transaction. (I've never quite ended up needing to use ES but it's always seemed to me I'd be likely to need ZomboDB if I did.)
Though I have only a recreational interest in datastores, I would be pretty wary of a service that claims to strap a Consistant-Unavailable database (Postgres) to an Inconsistent-Available database (ElasticSearch) in a useful way.
Doing so would require ElasticSearch to reach consensus on every read/write, which would remove most of the point of a distributed cluster. Despite this, ZomboDB's documentation says "complex aggregate queries can be answered in parallel across your ElasticSearch cluster".
They also claim that transactions will abort if ElasticSearch runs into network trouble, but the ElasticSearch documentation notes that writes during network partitions don't wait for confirmation of success[1], so I'm not sure how they would be able to detect that.
> Doing so would require ElasticSearch to reach consensus on every read/write
ZomboDB only requires that ES have a view of its index that's consistent with the active Postgres transaction snapshot. ZDB handles this by ensuring that the ES index is fully refreshed after writes.
This doesn't necessarily make ZDB great for high-update loads, but that's not ZDB's target usage.
> They also claim that transactions will abort if ElasticSearch runs into network trouble...
I had to search my own repo to see where I make this claim. I don't. I do note that network failures between PG & ES will cause the active Postgres xact to abort. On top of that, any error that ES is capable of reporting back to the client will cause the PG xact to abort -- ensuring consistency between the two.
Because the ES index is properly refreshed as it relates to the active Postgres transaction, all of ES' aggregate search functions are capable of providing proper MVCC-correct results, using the parallelism provided by the ES cluster.
I don't have the time to detail everything that ZDB does to project Postgres xact snapshots on top of ES, but the above two points are the easy ones to solve.
ZomboDB scratches an itch in a way I find fascinating, though I have yet to do more with it than shoehorn it into a prototype that was a bit square-peg-in-a-round-hole. It is in my catalogue of technologies I hope exploit someday. And I hope you're enjoying the work and making some $$$ too.
While I'm here might I ask, are you finding the hosted PostgreSQL services (AWS, Azure, etc.) growing or shrinking your market opportunities? Also, does it play nice with Citus?
> As such, any sort of failure either with ZomboDB itself, between Postgres and Elasticsearch (network layer), or within Elasticsearch will cause the operating Postgres transaction to ABORT. [1]
In my defense, there is a fairly important distinction between "any error that ES is capable of reporting back" and "any sort of failure within Elasticsearch".
That said, I and trust that you're more familiar with consistency levels than me, so I'll bow out here.
That seems like one for the philosophers. If you completely rewrite a Flash site in HTML5, but it looks the same and has the same URL, is it still the same site?
Regarding point 2:
Shouldn't you be moving your search queries from your transaction server to a separate analysis or read-replica server? OLTP copies to OLAP and suddenly you've separated these two problems.
I agree with reason 1, but reason 2 is an answer for, "should I use PG search in the same PG instance I already have", and that's a different discussion. You can set up a replica for that.
Disclaimer: I use Postgres fulltext search in production, very happy with it although maintaining the various triggers and stored procs it requires to work becomes cumbersome whenever you have to write a migration that alters any of them (or that in fact touches any related field, as you may be required to drop and recreate all of the parts in order not to violate referential integrity)
It is certainly nice having not to worry about 1 additional dependency when deploying, though
Regarding point 2, my implementation was to have a duplicate database server where all the search queries are sent. This would ensure that the search wouldn't slow down the main database. And most of the duplication would happen quick enough so that the search results were almost up to date.
I moved from ElasticSearch to PG FTS in production, and here are the things I had to give up:
1. PostgreSQL has a cap on column length, and the search index has to be stored in a column. The length of the column is indeterminate - it is storing every word in the document and where it's located, so a short document with very many unique words (numbers are treated as words too) can easily burst the cap. This means you have to truncate each document before indexing it, and pray that your cap is set low enough. You can use multiple columns but that slows down search and makes ranking a lot more complicated. I truncate documents at 4MB.
2. PostgreSQL supports custom dictionaries for specific languages, stemmers, and other nice tricks, but none of those are supported by AWS because the dictionary gets stored as a file on the filesystem (it's not a config setting). You can still have custom rules like whether or not numbers count as words.
My experience has been that sorting by relevance ranking is quite expensive. I looked into this a bit and found https://github.com/postgrespro/rum (and some earlier slide decks about it) that explains why the GIN index type can't support searching and ranking itself (meaning you need to do heap scans for ranking). This is especially problematic if your users routinely do searches that match a lot of documents and you only want to show the top X results.
Edit: if any of the Crunchy Data people are reading this: support for RUM indexes would be super cool to have in your managed service.
Actually thats a great suggestion. We need to take a deeper look at the code itself and ability to support the extension, but we'll definitely take and evaluate it at some of our upcoming roadmap planning for Crunchy Bridge (https://www.crunchybridge.com).
If I could have your other ear for a moment: support for any EU-based cloud provider would be super nice. Ever since the Privacy Shield fig leaf was removed, questions about whether we store any data under US jurisdiction have become a lot more frequent.
We're currently on the big 3 US ones, in process of working on our 4th provider, our goal is very much to deliver the best Postgres experience whether on bare metal/on-premise or in the cloud, self hosted or fully managed.
Edit: Always feel free to reach out directly, I'm always happy to spend time with anyone that has questions and usually pretty easy to track me down.
This is the key one for us that makes Postgres a non-starter for FTS (We use postgres for everything else)
We begrudgingly use Solr instead (we started before ES was really a thing and haven't found a need to switch yet)
When you get more than about two different types of filters (e.g. types of filters could be 'tags', 'categories', 'geotags', 'media type', 'author' etc), the combinatorial explosion of Postgres queries required to provide facet counts gets unmanageable.
For example, when I do a query filtered by `?tag=abc&category=category1`, I need to do these queries:
- `... where tag = 'abc' and category_id = 1` (the current results)
- `count(*) ... where tag = 'def' and category_id = 1` (for each other tag present in the results)
- `count(*) ... where tag = 'abc' and category_id = 2` (for each other category present in the results)
- `count(*) ... where category_id = 1`
- `count(*) ... where tag = 'abc'`
- `count(*)`
There are certainly smarter ways to do this than lots of tiny queries, but all this complexity still ends up somewhere and isn't likely to be great for performance.
Whereas solr/elasticsearch have faceting built in and handle this with ease.
You could run this in a single query `count(*) .. group by (tag, category_id)` and then post process the results.
counting by groups is still relatively expensive though in postgres; I built an implementation which uses estimate counting (via query planner) for larger buckets, and exact counting for only smaller buckets, but my use case allowed for some degree of inconsistency.
This is exactly the issue I’m currently facing. We do a bunch of count queries to calculate facets and am looking for something that can do this out of the box.
I’m glad I came to the same conclusion myself, either solr or elasticsearch might be the way to go. Starting this from scratch, which of the two would you recommend and why?
I myself is currently setting up Solr for a project as my experience with ES as a DevOps is not a happy one. Always nodes/indexes having some kind of problem. In the same time i have also worked with Solr for years before and never met any major issues. It just works and does the job well.
Unfortunately I don't know enough about elasticsearch to provide a useful comparison. I like the way ES queries are structured JSON instead of obscure compact querystring parameters, but that's not a good reason to choose one over the other :)
Postgres was not good at (for us)
- IDF and other corpus based relevancy measures. had to hand roll
- thesaurus and missspelling - again possible of course with preprocessing and by adding config files
- non Latin alphabet languages. E.g. Arabic - needed filesystem access (we used aws rds so couldn’t do it) to add a dictionary based stemmed/word breaker
We used es or solr for those cases. For English FTS with 100k documents doing it in PG is super easy and one less dependency
Semantic search using text embeddings. With Open Distro for Elasticsearch you can store your text embeddings and then perform a nearest-neighbor search[1] to find most similar documents using cosine similarity[2]. Elasticsearch (vanilla) will get this feature with 8.0.
If migrating to ES makes you groan you can use a managed service like Pinecone[3] (disclaimer: I work there) just for storing and searching through text embeddings in-memory through an API while keeping the rest of your data in PG.
It can't go that far. Postgres is limited by dictionaries, stemming/processing, query semantics (like fuzzy searching), and the biggest issue of all being a lack of modern relevance algorithms. It's good for limited scenarios where you just need more than a SQL LIKE statement, and chaining some functions together can get you decent results [1] without adding another datastore.
However search tech is pretty mature with Lucene at the core and there are many better options [2] from in-process libraries to simple standalone servers to full distributed systems like Elastic. There are also other databases (relational like MemSQL, or documentstores like MongoDB/RavenDB) that are adding search as native querying functions with most of the abilities of ES. If search is a core or complex part of your application (like patterns in raw image data or similarities in audio waveforms) then that's where ES will excel.
Looks like we get 37 results, of which 2 are true positives.
Looks like "your" and "own" are both contained in the english.stop stopwords list. So you could fix this by removing stopwords from your dictionary.
While disabling the stemmer is relatively easy (use the 'simple' language setting for your ts_query), altering the stopword dictionaries is more involved, and not easy to maintain or pass between developers/environments, and not at all easy to share between queries.
And so the most common suggestion is to use ILIKE.
I had some issues with this recently as I couldn't get a FTS query to find something looking like a path or url in a query. As an example from your site: https://simonwillison.net/2020/Jan/6/sitemap-xml/ contains the exact text https://www.niche-museums.com/, but I cannot find a way to search for that phrase exactly (trying https://simonwillison.net/search/?q=%22www.niche-museums.com... works though). I tried both in my own psql setup and on your site, and it seems like exact phrase searching is limited to the language used, even if it would be an exact string match.
ILIKE itself is a linear scan. The only way to index them are trigram indicies, which are very inefficient (and sometimes not usable) if you're searching document-length content.
Whether or not it works in your specific situation depends on your use case.
If I recall correctly, Postgres search doesn't scale well. Not sure where it falls apart but it isn't optimized in the same way something like Solr is.
I have a table with over a billion rows and most full-text searches still respond in around a few milliseconds. I think this will depend on a lot of factors, such as proper indexing, and filtering down the dataset as much as possible before performing the full-text ops. I've spent a considerable amount of time on optimizing these queries, thanks to tools like PgMustard [0]. Granted, I do still have a couple slow queries (1-10s query time), but that's likely due to very infrequent access i.e. cold cache.
I will say, if you use open source libraries like pg_search, you are unlikely to ever have performant full-text search. Most full-text queries need to be written by hand to actually utilize indexes, instead of the query-soup that these types of libraries output. (No offense to the maintainers -- it's just how it be when you create a "general" solution.)
Silly question, I'm using pg right now and most of my queries are something like this (in english)
Find me some results in my area that contain these categoryIds and are slotted to start between now and next 10 days.
Since its already quite a filtered set of data, would that mean I should have little issues adding pg text search because with correct indexing and all, it will usually be applied to a small set of data?
I'm not a DBA, so I can't say for certain simply due to a gap in my knowledge. But in my experience, it depends on a lot of factors. Sometimes pg will use an index before performing the search ops, other times a subquery is needed. Check out pgmustard and dig into your slow query plans. :)
You might be just fine adding an unindexed tsvector column, since you've already filtered down the results.
The GIN indexes for FTS don't really work in conjunction with other indices, which is why https://github.com/postgrespro/rum exists. Luckily, it sounds like you can use your existing indices to filter and let postgres scan for matches on the tsvector. The GIN tsvector indices are quite expensive to build, so don't add one if postgres can't make use of it!
I'm not well versed on modern pg for this use, but when I managed a Solr instance ~5 years ago, it was the ranking of the results that was the killer feature. Finding results fast most systems can do. Knowing which results to present is harder.
Our case was a domain specific knowledge base, with certain terms occurring often in many articles. Searching for a term could bring up thousands of results, but few of them were actually relevant to show in context of the search, they just happened to use the term.
It lays the groundwork by storing the needed metadata in the index, yes it needs more work and I wish there was more interest to do it.
I really think the is a neglected area and if PG was able to merge in TF-IDF and BM25 there would be little reason to use a separate search db / engine and many advantages with it being integrated.
If you are looking to do semantic search (Cosine similarity) + filtering (SQL) on data that can be represented as vectors (audio, text, video, bio) I suggest, https://github.com/ankane/pgvector
From what I know, full text search in Postgres (and MySQL) does not have faceted search. So it only supports returning full text results from the entire index.
Actually, it is possible, but doing a search on a particular segment of rows is a very slow operation - say text search for all employees with name matching 'x', in organization id 'y'.
It is not able to utilise the index on organization id in this case, and it results in a full scan.
A few years ago we added yet-another part to our product and, whilst ES worked "okay", we got a bit weary of ES due to "some issues" (some bug in the architecture keeping things not perfect in sync, certain queries with "joins" of types taking long, demand on HW due to the size of database, no proper multi-node setup due to $$$ and time constraint, etc.; small things piling up over time).
Bright idea: let's see how far Postgres, which is our primary datastore, can take us!
Unfortunately, the feature never made it fully into production.
We thought that on paper, the basic requirements were ideal:
- although the table has multiple hundreds of millions of entries, natural segmentation by customer IDs made possible individual results much smaller
- no weighted search result needed: datetime based is perfect enough for this use-case, we thought it would be easy to come up with the "perfect index [tm]"
Alas, we didn't even get that far:
- we identified ("only") 2 columns necessary for the search => "yay, easy"
- one of those columns was multi-language; though we didn't have specific requirements and did not have to deal with language specific behaviour in ES, we had to decide on one for the TS vectorization (details elude me why "simple" wasn't appropriate for this one column; it was certainly for the other one)
- unsure which one, or both, we would need, for one of the columns we created both indices (difference being the "language")
But once the second index was done, and had not even rolled out the feature in the app itself (which at this point was still an ever changing MVP), unrelated we suddenly got hit by lot of customer complains that totally different operations on this table (INSERTs and UPDATEs) started to be getting slow (like 5-15 seconds slow, something which usually takes tiny ms).
Backend developer eyes were wide open O_O
But since we knew that second index just finished, after checking the Posgres logs we decided to drop the FTS indices and, lo' and behold, "performance problem solved".
Communication lines were very short back then (still are today, actually) and it was promptly decided we just cut the search functionality from this new part of the product and be done with it. This also solved the problem, basically (guess there's some "business lesson" to be learned here too, not just technical ones).
Since no one within the company counter argued this decision, we did not spend more time analyzing the details of the performance issue though I would have loved to dig into this and get an expert on board to dissect this.
--
A year later or so I had a bit free time and analyzed one annoying recurring slow UPDATE query problem on a completely different table, but also involving FTS on a single column there also using a GIN index. That's when I stumble over https://www.postgresql.org/docs/9.6/gin-implementation.html
> Updating a GIN index tends to be slow because of the intrinsic nature of inverted indexes: inserting or updating one heap row can cause many inserts into the index (one for each key extracted from the indexed item). As of PostgreSQL 8.4, GIN is capable of postponing much of this work by inserting new tuples into a temporary, unsorted list of pending entries. When the table is vacuumed or autoanalyzed, or when gin_clean_pending_list function is called, or if the pending list becomes larger than gin_pending_list_limit, the entries are moved to the main GIN data structure using the same bulk insert techniques used during initial index creation. This greatly improves GIN index update speed, even counting the additional vacuum overhead. Moreover the overhead work can be done by a background process instead of in foreground query processing.
In this particular case I was able to solve the occasional slow UPDATE queries with "FASTUPDATE=OFF" on that table and, thinking back about the other issue, it might have solved or minimized the impact.
Back to the original story: yep, this one table can have "peaks" of inserts but it's far from "facebook scale" or whatever, basically 1.5k inserts / second were the absolute rare peak I measured and usually it's in the <500 area. But I guess it was enough for this scenario to add latency within the database.
--
Turning back my memory further, I was always "pro" trying to minimize / get rid of ES after learning about http://rachbelaid.com/postgres-full-text-search-is-good-enou... even before we used any FTS feature. At also mentions the GIN/GiST issue but alas, in our case: ElasticSearch is good enough and, besides the thwarts we've with it, actually easier to reason about (so far).
This is anecdote, not proper feedback, since I wasn't directly involved in the topic.
My company relied on PG as its search engine and everything went well from POC to production. After a few years of production and new clients requiring volumes of data an order of magnitude above our comfort zone, things went south pretty fast.
Not many months later but many sweaty weeks of engineering after, we switched to ES and we're not looking back.
tl;dr; even with great DB engineers (which we had), I'd suggest that scale is a strong limiting factor on this feature.
Can you tell us about your scoring function? Selecting 40M results from a dataset of ~1B and returning the top 10 based on some trivial scoring function is easy, any reasonable search system will handle that. The problem is when you have to run your scoring function on all 40M matching docs to decide which 10 are the most relevant. It's even more of a problem when your scoring function captures some of the complexities of the real word, rather than something trivial like tf/idf or bm25.
Zulip's search is powered by this built-in Postgres full-text search feature, and it's been a fantastic experience. There's a few things I love about it:
* One can cheaply compose full-text search with other search operators by just doing normal joins on database indexes, which means we can cheaply and performantly support tons of useful operators (https://zulip.com/help/search-for-messages).
* We don't have to build a pipeline to synchronize data between the real database and the search database. Being a chat product, a lot of the things users search for are things that changed recently; so lag, races, and inconsistencies are important to avoid. With the Postgres full-text search, all one needs to do is commit database transactions as usual, and we know that all future searches will return correct results.
* We don't have to operate, manage, and scale a separate service just to support search. And neither do the thousands of self-hosted Zulip installations.
Responding to the "Scaling bottleneck" concerns in comments below, one can send search traffic (which is fundamentally read-only) to a replica, with much less complexity than a dedicated search service.
Doing fancy scoring pipelines is a good reason to use a specialized search service over the Postgres feature.
I should also mention that a weakness of Postgres full-text search is that it only supports doing stemming for one language. The excellent PGroonga extension (https://pgroonga.github.io/) supports search in all languages; it's a huge improvement especially for character-based languages like Japanese. We're planning to migrate Zulip to using it by default; right now it's available as an option.
All of this. It’s such a good operational experience that I will actively fight against the introduction of a dedicated search tool unless it’s absolutely necessary.
Nice – looks like the ~same approach recommended here of adding a generated `tsvector` column with a GIN index and querying it with `col @@ @@ to_tsquery('english', query)`.
Offtopic, but currious what are your use cases when searching all HN and reddit comments? Im at the beggining of this path, just crawled HN, but what to do with this, still a bit cloudy.
I built a search engine that quantified the expertise of authors of comments. Then I created what I called “expert rank” that allowed me to build a really good search engine.
Huh, just yesterday I blogged[0] about using FTS in SQLite[1] to search my PDF database. SQLite's full-text search is really excellent. The thing that tripped me up for a while was `GROUP BY` with the `snippet`/`highlight` function but that's the point of the blog post.
We've been using SQLite's FTS capabilities to index customer log files since 2017 or so. It's been a wonderful approach for us. Even if we move to our own in-house data store (event sourced log), we would still continue using SQLite for tracing because it brings so many of these sorts of benefits.
A very well written article about SQLite FTS5!
One question - it seems that your search result displays the matching paging number, how did you do that? because as far as I know, unlike FTS4, FTS5 has no `offsets` function.
I index each page individually. It's not in the article (because I also don't explain separating out the metadata) but I did weigh the options. I considered indexing page pairs, for example (and having every page indexed twice). But I figured that if search terms are broken across pages, they're likely to also be separated by text in headers & footers, page numbers, and footnotes. So in the end I decided to just index individual pages. The FTS table now has `id`, `page_number`, and `content` columns (where `id` is the foreign key to a table that stores the metadata).
I know Postgres and SQLite have mostly different purposes but FWIW, SQLite also has a surprisingly capable full-text search extension built right in: https://www.sqlite.org/fts5.html
It's very impressive, especially considering the SQLite version you're already using probably has it enabled already. I use it for a small site I run and it works fantastic. Little finicky with deletes and updates due to virtual tables in SQLite, but definitely impressive and has its uses.
> You could also look into enabling extensions such as unaccent (remove diacritic signs from lexemes) or pg_trgm (for fuzzy search).
Trigrams (pg_trgm) are practically needed for usable search when it comes to misspellings and compound words (e.g. a search for "down loads" won't return "downloads").
I also recommend using websearch_to_tsquery instead of using the cryptic syntax of to_tsquery.
Trigrams are amazing. I was doing a sideproject where I wanted to allow for substring searching, and trigrams seemed to be the only way to do it (easily/well) in postgres. Gitlab did a great writeup on this a few years ago that really helped me understand it:
Keep wondering if RUM Indexes [1] will ever get merged for faster and better ranking (TF/IDF). Really would make PG a much more complete text search engine.
Postgres Full-Text search is a great way to get search running for a lot of standard web applications. I recently used just this in Elixir to set up a simple search by keyword. My only complaint was Ecto (Elixir's query builder library) doesn't have first class support for it and neither does Postgrex the lower level connector they use. Still, using fragments with sanitized SQL wasn't too messy at all.
We get really nice results with gist indexes (gist_trgm_ops) searching across multiple entity types to do top X queries. It’s very useful to be able to make a stab at a difficult-to-spell foreign football player’s name, possibly with lots of diacritics, and get quick results back. I’m always surprised when I find a search engine on any site that is so unkind as to make you spell things exactly.
TBH I hadn't known you could do weighted ranking with Postgres search before.
Curious there's no mention of zombodb[0] though, which gives you the full power of elasticsearch from within postgres (with consistency no, less!). You have to be willing to tolerate slow writes, of course, so using postgres' built-in search functionality still makes sense for a lot of cases.
Zombo is definitely super interesting and we should probably add a bit in the post about it. Part of the goal here was that you can do a LOT with Postgres, without adding one more system to maintain. Zombo is great if you have Elastic around, but want Postgres as the primary interface, but what if you don't want to maintain Elastic.
My ideal is always though to start with Postgres, and then see if it can solve my problem. I would never Postgres is the best at everything it can do, but for most things it is good enough without having another system to maintain and wear a pager for.
Zombo does at least promise to handle "complex reindexing processes" for you (which IME can be very painful) but yeah, I assume you'd still have to deal with shard rebalancing, hardware issues, network failures or latency between postgres and elastic, etc etc.
The performance and cost implications of Zombo are more salient tradeoffs in my mind – if you want to index one of the main tables in your app, you'll have to wait for a network roundtrip and a multi-node write consensus on every update (~150ms or more[0]), you can't `CREATE INDEX CONCURRENTLY`, etc.
All that said, IMO the fact that Zombo exists makes it easier to pitch "hey lets just build search with postgres for now and if we ever need ES's features, we can easily port it to Zombo without rearchitecting our product".
My worst search experiences always come from the features applauded here. Word stemming and removing stop words is a big hurdle when you know what you are looking for but get flooded by noise because some part of the search string was ignored. Another issue is having to type out a full word before you get a hit in dynamic search boxes (looking at you Confluence).
I'd argue that isn't a problem with the feature, but a thoughtless implementation.
A good implementation will weigh verbatim results highest before considering the stop-word stripped or stemmed version. Configuring to_tsvector() to not strip stop words or using a stemming dictionary is, in my opinion, a little clunky in Postgres: You'll want to make a new [language] dictionary and then call to_tsvector() using your new dictionary as the first parameter.
After you've set up the dictionary globally, this would look something like:
I think blaming Postgres for adding stemming/stop-word support because it can be [ab]used for a poor search user experience is like blaming a hammer for a poorly built home. It is just a tool, it can be used for good or evil.
PS - You can do a verbatim search without using to_tsvector(), but that cannot be easily passed into setweight() and you cannot use features like ts_rank().
It's just a project to learn all the things that are web.
It's mainly a database for a game now with most of the information sourced from the game (including its localization files).
I'm slowly transitioning from MariaDB to Postgres - again as a learning experience. There is cool stuff and there is annoying stuff to reproduce things like case-insensitive + ignore accents (utf8_general_ci) in Postgres.
I've looked into FTS and searching for missing dictionaries to support all the locales but Chinese is one of the harder ones.
One (dev) project here, we're up to 5 locales at a surprisingly small number of customers. Problem is when your customers are global, all of a sudden a single customer can bring along multiple locales. I very much regret not taking localization far more seriously early in development but we were blindsided by the interest outside the Angleosphere.
I used Postgres full-text search for mail log feature on my email forward app https://hanami.run
Essentially allow arbitraty query in from/to/subject/body. One thing that make full-text serch work great for me is that I don't need to sort or rank the relevant of query. I just show a list of email that match the query order by their id.
I also don't do pagination and counting, instead users has to load more paged and the ID of the email is pass to the query as a point to compare( where id < requests.get.before).
And with those strategy, full text search works great for us since we don't really want to bring in ElasticSearch because only about 20% of users use this features.
I've seen Elasticsearch set up for applications that would have equal benefit from just using the postgresql db's full-text search they already have access to.
The additional complexity is usually incurred when the data in postgresql changes, and those changes need to be mirrored up to Elasticsearch. Elasticsearch obviously has its uses, but for some cases, postgresql's built in full-text search can make more sense.
Is there alternative to ES that scales nicely? I'm running ELK stack for logging using AWS Elasticsearch. Logs have unpredictable traffic volume and even overprovisioned ES cluster gets clogged sometimes. I wonder is there something more scalable than ES, and have nice GUI like Kibana?
It's more a matter of configuring it right. I'd recommend trying out Elastic Cloud. It's a bit easier to deal with than Amazon's offering and much better supported. AWS has always been a bit hands-off on that front. Their opensearch project does not seem to break that pattern so far.
Also, with Elastic Cloud you get some access to useful features for logging (like life cycle management and data streams) that will help you scale the setup.
Kibana in recent iterations has actually improved quite a bit. The version you are getting from Amazon is probably a bit bare bones in comparison. One nice thing with Elastic is that going with the defaults gets you some useful dashboards out of the box if you use e.g. file or docker beats for collecting logs.
If you use life cycle management and data streams (which AWS doesn't have), you'd be able to control the sizes of your hot indices (i.e. the ones you write to). Basically keeping your hot indices small helps keeping things fast. If you have issues with app writes spiking, I'd use some queuing solution in between. Plenty of solutions for that.
The rest is just a matter of configuring things right in terms of number of shards and setting up properly. Basically, you get what you pay for in the end.
> If you use life cycle management and data streams (which AWS doesn't have), you'd be able to control the sizes of your hot indices (i.e. the ones you write to)
AWS has lifecycle stuff, a year ago. And index management is not cluster scaling, its a good advice, but not what I asked about.
I don't know what features AWS provides but in general terms logs benefit a lot from compression, so if I were to set up this on my own I'd probably want to try something like a VDO or ZFS backed storage system as well as compressed transfers (perhaps in batch if that's required).
I have no idea, but I’m assuming that regardless of which one you use there will be disk level duplication that could be taken care of to alleviate uncontrolled growth somewhat.
That is, what tends to be the "killer feature" that makes teams groan and set up Elasticsearch because you just can't do it in Postgres and your business needs it?
Having dealt with ES, I'd really like to avoid the operational burden if possible, but I wouldn't want to choose an intermediary solution without being able to say, "keep in mind we'll need to budget a 3-mo transition to ES once we need X, Y, or Z".