Lucene: The Good Parts (2015)

weeksie · on March 30, 2022

Memory lane. Way back in the mists of time, circa 2004 or 5 I wanted to learn about search indexing. I was a pre-rails ruby head and translated a large portion of lucene's index code into ruby as a learning exercise. The result was abysmal and somewhat wonky, but I did learn a bunch. Both about lucene and ruby's FFI.

Reading and translating code is such a great way to internalize a concept that you're unfamiliar with, while getting a glimpse into someone else's mental model.

Lots of people tell you to read code, but it's hard to overstate the power of filtering a codebase through your brain and out your fingertips.

rjbwork · on March 30, 2022

Neat! A bit over a decade ago, I had a similar experience with Lucene (not that I implemented it from scratch, but certainly used it in a fairly unorthodox, for the time, manner). I had to implement some search stuff and Elastic Search was still in its infancy so was not necessarily the "obviously right choice" as it has been recently for this kind of document search job.

I implemented a multi-tenant search engine on top of Lucene using C# and Azure Blob Storage under the direction of my manager at the time. This was actually pretty cool, because I had actually learned about TF-IDF and search technologies in school so it was nice to be using some of that knowledge. And there were a lot of problems to solve with regards to locking, index update coordination, etc. that, as we know, ES takes care of for us today. Anyway, the project was a success, and launched, and backed a couple of products for a couple of years until it was decommissioned due to outside forces basically making it irrelevant.

That knowledge and experience seems to have ultimately led me down the path to becoming the resident ES "expert" at my current position.

madmax108 · on March 30, 2022

Lucene was my introduction to concepts in document search such as TF-IDF and attribute-based information retrieval working at an eCommerce company where these problems were our bread and butter, but while it was incredibly good at what it did, the concepts it used were so 'low-level' that I always felt like higher level "wrappers" around Lucene such as Solr/ElasticSearch were so much easier to get started with and scaled up (and in many ways, more idiot-proof), even as someone who was not a novice to the field.

Lucene in Action was an incredible book though, esp. given the time when it came out, and somehow has remained quite relevant through the years too! Strong recommend!

blakesterz · on March 30, 2022

"In 2004, Solr was created by Yonik Seeley at CNET Networks as an in-house project to add search capability for the company website."

I have no idea how I never knew that CNET created Solr! (Solr uses Lucene)

tomwheeler · on March 30, 2022

About ten years after creating Solr, Yonik Seeley joined Cloudera to work on integration between Apache Hadoop and Solr.

There's an interesting connection here: Doug Cutting, perhaps best known as the creator of Hadoop, was the Chief Architect at Cloudera. Most people recognize Doug as the creator of Apache Hadoop, but he also created Lucene. In fact, Hadoop originated from a Lucene subproject called Nutch, which aimed to build a scalable web crawler.

ideonode · on March 30, 2022

Wasn't the inspiration for Hadoop not just a web crawler use-case, but also Google's famous MapReduce paper?

tomwheeler · on March 30, 2022

Yes and no. The goal of the Nutch project was simply to create a web crawler, but it hit some scalability limits. Since Google had recently published two papers (MapReduce and Google Filesystem) that were quite relevant to scaling data processing and storage for a web crawler, Doug and Mike created an open source implementation of those ideas and redesigned the web crawler to use it.

The technology had many applications beyond a web crawler, of course, but that was the original use case.

notepalf · on March 30, 2022

Solr is a great tool, we use it at my job to index documents and never had any problem with it. We choosed it over elasticsearch because it seems simpler to setup and administer.

pyuser583 · on March 30, 2022

I didn’t know CNET created any internal tech.

dang · on March 30, 2022

Lucene: The Good Parts - https://news.ycombinator.com/item?id=9198092 - March 2015 (16 comments)

dagenix · on March 30, 2022

> This meant Lucene was less concerned with things like MVCC, ACID, and 3-NF, and was instead concerned with much more practical concerns, like how to build a fast and humane interface for unstructured data.

I absolutely hate this attitude. Different use cases have different requirements. The author here appears to be dismissing any use case different than their own as not practical.

nemo44x · on March 30, 2022

You can use Lucene and implement these types of features on top of it if they're important to you. I think what the author was trying to say was the Lucene contributors decided to focus on a certain thing and leave other implementation features up to people using the library.

Lucene gives you a lot of levers to control how it works and if you want to build a distributed, MVCC, ACID compliant datastore on top of it, you can. It's just not a concern of the library.

rjbwork · on March 30, 2022

Totally agree. I have a hard time taking seriously a perspective on the merits of a data technology that is so dismissive towards concerns like MVCC, ACID, and normal forms. These have been foundational to data technologies for nearly 50 years at this point for a reason. To discard them as "impractical" indicates to me a severe immaturity of perspective.

orf · on March 30, 2022

> SQL was not then, and is still not now, a very good blob or document storage system. Yet, there seemed to be no alternative to SQL for durability, short of relying directly upon the filesystem.

Yep, because it's a query language.

pixelmonkey · on April 5, 2022

(Author here.) I cleared this up a bit on a past thread. I was using "SQL" here as a shorthand for "SQL RDMBSes", not quite "SQL the language".

https://news.ycombinator.com/item?id=9670379

Also, SQL RDBMSes have changed in the intervening ~7 years. For example, PostgreSQL now has a native JSON data type[1] and many functions for working with JSON blob storage[2]. Lucene still handles unstructured text indexing and search use cases in a more direct way, however, and is still a magical analytical engine when wielded in the right way.

[1]: https://www.postgresql.org/docs/current/datatype-json.html

[2]: https://www.postgresql.org/docs/9.5/functions-json.html