A SQLite extension for reading large files line-by-line

alexgarcia-xyz · on July 30, 2022

Hey, author here, happy to answer any questions! Also checkout this notebook for a deeper dive into sqlite-lines, along with a slick WASM demonstration and more thoughts on the codebase itself https://observablehq.com/@asg017/introducing-sqlite-lines

I really dig SQLite, and I believe SQLite extensions will push it to another level. I rarely reach for Pandas or other "traditional" tools and query languages, and instead opt for plain ol' SQLite and other extensions. As a shameless plug, I recently started a blog series on SQLite and related tools and extensions if you want to learn more! Next week I'll be publishing more SQLite extensions for parsing HTML + making HTTP requests https://observablehq.com/@asg017/a-new-sqlite-blog-series

A few other SQLite extensions:

- xlite, for reading Excel files, in Rust https://github.com/x2bool/xlite

- sqlean, several small SQLite extensions in C https://github.com/nalgeon/sqlean

- mergestat, several SQLite extensions for developers (mainly Github's API) in Go https://github.com/mergestat/mergestat

simonw · on July 30, 2022

I'm so interested to hear about the HTTP requests stuff.

I've played a tiny bit with SQLite Python extensions that perform HTTP requests, but I'm very unsure of the implications of this kind of thing.

What happens to the rest of the SQLite query engine if your custom function hits a page like https://httpbin.org/delay/10 with a deliberate 10 second delay before the request returns?

Maybe this is fine - your transaction takes 10 seconds longer but everything else continues to work as normal. I've not done the work to figure that out.

alexgarcia-xyz · on July 30, 2022

Yup, if it hits https://httpbin.org/delay/10 then the entire script will delay for 10 seconds (unless you have a custom lower timeout setting). Which means you can only make HTTP requests in serial requests, which sucks, but much simpler than coordinating with other tools

It'll come out in a week, but lmk if you want to play with it beforehand!

simonw · on July 30, 2022

How will it affect other SQLite connections against the same database?

My guess is they'll be OK, but presumably any transactions will lock tables for the duration?

alexgarcia-xyz · on July 30, 2022

I think you're right, tho not an expert with multiple connections. I think if the DB is in WAL mode and you're only issuing read-only requests, then there will be no lock and long HTTP requests won't lock anything up. But if you issue a write-request that locks tables, then other write-requests will be blocked with long requests.

But if you're in journal mode then there's not much you can do. Please correct me if I'm wrong!

simonw · on July 30, 2022

How difficult do you think it would be compile SpatiaLite (a big hairy C codebase) to WASM and load it as a dynamic module, like you've done with sqlite-lines?

alexgarcia-xyz · on July 30, 2022

This has been done by other people already!

- https://github.com/jvail/spl.js

- https://observablehq.com/@visionscarto/hello-spl-js

simonw · on July 30, 2022

Oh wow! I wonder how hard it would be to load that module into https://github.com/simonw/datasette-lite

Hah, turns out I have an issue for that already, which I'd forgotten about: https://github.com/simonw/datasette-lite/issues/30

simonw · on July 30, 2022

I love this example from the docs:

    select
      line -> '$.id' as id,
      line -> '$.name' as name
    from lines_read("my-file.ndjson");

This is using the new -> JSON operator which was added to SQLite a few months ago.

The lines_read() thing there is a table-valued function, which means it returns a virtual table that you can query. This is a streaming operation which means it's safe to run it against a 100GB+ file without worrying about it sucking all of the data into memory at once.

Where this gets really useful is when you combine it with a create table statement:

    create table names as select
        line -> '$.id' as id,
        line -> '$.name' as name
      from lines_read("my-file.ndjson");

This should efficiently create a table with that exact subset of the data pulled from the newline-delimited JSON file.

nextaccountic · on July 30, 2022

> This is using the new -> JSON operator which was added to SQLite a few months ago.

Is this available in Postgresql? Or what would be the equivalent?

simonw · on July 30, 2022

I believe the SQLite operators were directly inspired by PostgreSQL and MySQL. There's a detailed comparison here: https://sqlite.org/src/doc/json-enhancements/doc/json-enhanc...

SPBS · on July 31, 2022

I believe this https://sqlite.org/src/doc/json-in-core/doc/json-enhancement... is more up to date, as json_nextract is no longer in the core functions.

dragonwriter · on July 30, 2022

Postgres has similar JSON access operators, but slightly different syntax:

https://www.postgresql.org/docs/current/functions-json.html

cube2222 · on July 30, 2022

Hey!

OctoSQL[0] author here, this is really impressive! I like this much more than the approach taken by other sqlite-based tools which first load stuff into SQLite and then let you query it.

On the other hand, it does have a cons that it doesn't automatically infer the schema of the input JSON and you still have to manually parse the raw lines. Maybe it would be possible to surmount this by exposing a json-dedicated file reading function which also does SQL inference (I'm not knowledgable about SQLite internals)?

One piece of feedback is with regard to the benchmarks: I think it would be worth it to add additional benchmarks which work on slightly more complex datasets than the one used here. I did a comparison of this vs OctoSQL on the Brazil dataset, and - as expected - sqlite-lines wiped the floor with it. However, then I ran the following queries on a slightly more complex dataset (the Amazon review dataset in this case, from SPyQL's benchmark notebook[1]):

  ~> time OCTOSQL_NO_TELEMETRY=1 ./octosql "SELECT COUNT(*), AVG(overall) 
                                            FROM books.json
                                            WHERE reviewerName = 'James'"
  +-------+-------------------+
  | count |    avg_overall    |
  +-------+-------------------+
  |  3010 | 4.402325581395349 |
  +-------+-------------------+
  real 0m49.805s
  user 0m32.169s
  sys 0m9.163s
  
  ~> time ./lines0-linux-amd64-sqlite3 :memory: "SELECT COUNT(*), AVG(json_extract(line, '$.overall'))
                                                 FROM lines_read('books.json') 
                                                 WHERE json_extract(line, '$.reviewerName') = 'James'"
  3010|4.40232558139535
  real 1m47.933s
  user 1m27.024s
  sys 0m11.559s

and as you can see, the results go in a very different direction.

But anyhow, congrats on the project, and I'm pumped to see what you come up with next!

[0]: https://github.com/cube2222/octosql

[1]: https://github.com/dcmoura/spyql/blob/master/notebooks/json_...

alexgarcia-xyz · on July 30, 2022

Hey thanks for sharing!

Re infering schema of input JSON: That would be slick! Though SQLite does have some limitations here with table-valued functions vs virtual table. I won't go into the specifics, but something like this isn't possible in SQLite:

  select name, age from lines_json_read('students.json')

The "name" and "age" dynamic columns aren't possible when using the "table function" syntax, but something like this is possible using traditional "virtual table" syntax:

  create virtual table students using lines_json_read(filename="students.json");
  
  select name, age from students;

It's a small difference, but definitely possible! Though parsing JSON in C is tricky, but would definitely accept contributions that figure it out.

And re benchmarks - thanks for sharing! Yeah, they're pretty basic, so would love to add more complex ones. With the books.json example, I think what's happening is that in SQLite's JSON function, it parses the JSON each time in each json_extract function - so it parses twice for each row in that query. I also suspect that the long strings in "reviewText" might slow down, but can't be sure. Once I get some free time I'll add OctoSQL to the benchmark suite and this new books dataset

tstack · on July 30, 2022

Another option would be to create the virtual table used for the table-valued function using a schema to define the columns:

    CREATE VIRTUAL TABLE students_json_read USING lines_json_read(schema="students-schema.json);

Then, that one table could be used for multiple files:

    SELECT * FROM students_json_read('school1-students.json');
    SELECT * FROM students_json_read('school2-students.json');

freecodyx · on July 30, 2022

I like the fact that author is extensively documenting the source code, following the sqlite spirit

bicijay · on July 30, 2022

Thats interesting, is there any way to add "custom delimiter templates" (or alternative tools)? I have some files where each line contains different kind of informations, but those are separated by their index on the row, so for example, customer name is the letter index 3 to letter index 10.

alexgarcia-xyz · on July 30, 2022

Hmm, so you have lines that have fields in fixed indexed positions? Maybe something line this:

  $ cat hn.txt 
  11alex    a
  12brian   b
  12craig   c
  $ cat test.sql
  .header on
  .mode box
  .load dist/lines0
  select
    substr(line, 1, 2) as id,
    substr(line, 3, 7) as name,
    substr(line, 11, 2) as age
  from lines_read('hn.txt')
  $ sqlite3x :memory: '.read test.sql'
  ┌────┬─────────┬─────┐
  │ id │  name   │ age │
  ├────┼─────────┼─────┤
  │ 11 │ alex    │ a   │
  │ 12 │ brian   │ b   │
  │ 12 │ craig   │ c   │
  └────┴─────────┴─────┘

Unfortunately SQLite's substr function is a bit awkward, but if each field has constant indicies it may work

wodenokoto · on July 30, 2022

These I believe are what is called fixed width files.

If you use pandas, there is a reader for those `pd.read_fwf()`

I’m sure there are many more languages that support this, it’s just the only one where I had a need.

bicijay · on July 30, 2022

Exactly that, all banks here in Brazil are exporting transactions with this kind of format (Its called CNAB file over here and they are probably using COBOL systems). You have different kinds of CNAB files, CNAB 200 where you have lines with 200 characters, CNAB 400...

Inside those files you have multiple chunks and every line contains information about a transaction. So if you wanna get the transaction amount you would look for all characters between character n10 and n20 for example. (Those positions are fixed and follow the bank template).

We had success parsing it with python/node, but i think spinning up a sqllite instance, reading straight from the file and doing some calculations would be much faster.

muxator · on July 31, 2022

Just thinking out loud and assuming you only have this extension and want to do a quick poc: you could try to delegate the preliminary line parsing to awk (which is very powerful and works in a streaming fashion).

CNAB file -> awk -> select from line_read().

Awk could, for example, transform the fixed-width file in a delimited one (I believe this extension does not strictly require json-formatted inputs). You vould also perform pre-filtering of the input data on awk side (see the "/ /" syntax).

If this extension supports reading from stdin, you could build a traditional unix command pipe, otherwise you would have to glue together the two commands with a named pipe.

Depending on your sensibility, this could feel ugly or very elegant. I am confident it would be very fast, not memory bound, and can be made strict and correct.

MobiusHorizons · on July 31, 2022

SQLite also has support for csv (even without this extension) so if you can process your cnab into csv that may be the easiest most supported way to go.

qbasic_forever · on July 30, 2022

Very cool looking! I love using SQL for ad-hoc data analysis, my brain just groks the language so much more easily than other data query languages like python + pandas.

jokoon · on July 30, 2022

I've read that spatialite is not well maintained, is that true?

randyrand · on July 30, 2022

nice simple & reliable approach to a hard perf problem.

didgetmaster · on July 31, 2022

I have built my own database engine on top of a general-purpose data management system I invented. It has an application that will let you create a DB table from a file (Json, JsonL, CSV, etc.). The application uses the Qt windowing framework that also includes a Json parser so I just used it.

It seems slower than I would like, so I wonder if your code could be used as a stand alone library for parsing the files? My current implementation is able to parse about 25K Json documents per second when each document has 20 KV pairs. I used the Chicago crime data available for download on their open data portal.

How fast will your code parse it? Here is an example of three lines:

[ { "ID" : "10224738", "Case Number" : "HY411648", "Date" : "2015-09-05T13:30:00", "Block" : "043XX S WOOD ST", "IUCR" : "0486", "Primary Type" : "BATTERY", "Description" : "DOMESTIC BATTERY SIMPLE", "Location Description" : "RESIDENCE", "Arrest" : "false", "Domestic" : "true", "Beat" : 924, "District" : 9, "Ward" : 12, "Community Area" : 61, "FBI Code" : "08B", "X Coordinate" : "1165074", "Y Coordinate" : "1875917", "Year" : 2015, "Latitude" : 41.815117, "Longitude" : -87.670000 }, { "ID" : "10224742", "Case Number" : "HY411435", "Date" : "2015-09-05T10:55:00", "Block" : "082XX S LOOMIS BLVD", "IUCR" : "0610", "Primary Type" : "BURGLARY", "Description" : "FORCIBLE ENTRY", "Location Description" : "RESIDENCE", "Arrest" : "false", "Domestic" : "false", "Beat" : 614, "District" : 6, "Ward" : 21, "Community Area" : 71, "FBI Code" : "05", "X Coordinate" : "1168430", "Y Coordinate" : "1850165", "Year" : 2015, "Latitude" : 41.744379, "Longitude" : -87.658431 }, { "ID" : "10224745", "Case Number" : "HY411654", "Date" : "2015-09-05T11:30:00", "Block" : "031XX W WASHINGTON BLVD", "IUCR" : "0320", "Primary Type" : "ROBBERY", "Description" : "STRONGARM - NO WEAPON", "Location Description" : "STREET", "Arrest" : "false", "Domestic" : "true", "Beat" : 1222, "District" : 12, "Ward" : 27, "Community Area" : 27, "FBI Code" : "03", "X Coordinate" : "1155536", "Y Coordinate" : "1900515", "Year" : 2015, "Latitude" : 41.882814, "Longitude" : -87.704326 } ]

anitil · on Aug 2, 2022

Because you have small files, you may not benefit as much from this extension as you'll probably be limited by OS-related issues of how fast you can open files. If you could cat them together then you might benefit from this. Also, this is for newline-deliminated json whereas yours looks like a json blob. But ultimately, it'd only take a couple of hours to test this module with the sqlite command line application and get a rough benchmark, which is more reliable than my guess.

On a seperate note, I'm not sure if that data is public, but it makes me a bit uncomfortable to see lat/long on crime data on a public forum.

didgetmaster · on Aug 2, 2022

The files are not that small. They can have many millions of rows in each. The example I posted was to show what each row would look like. I used it in a Json array for the example, but the file itself can be in Json Lines format.

The data set blocks off the last two digits of the street address for anonymity but it doesn't look like they did the same for lat/long. If you drill down to a single block (e.g. 055XX S ROCKWELL ST), you will see 37 different latitudes and 37 different longitudes.