Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A SQLite extension for reading large files line-by-line (github.com/asg017)
157 points by polyrand on July 30, 2022 | hide | past | favorite | 29 comments


Hey, author here, happy to answer any questions! Also checkout this notebook for a deeper dive into sqlite-lines, along with a slick WASM demonstration and more thoughts on the codebase itself https://observablehq.com/@asg017/introducing-sqlite-lines

I really dig SQLite, and I believe SQLite extensions will push it to another level. I rarely reach for Pandas or other "traditional" tools and query languages, and instead opt for plain ol' SQLite and other extensions. As a shameless plug, I recently started a blog series on SQLite and related tools and extensions if you want to learn more! Next week I'll be publishing more SQLite extensions for parsing HTML + making HTTP requests https://observablehq.com/@asg017/a-new-sqlite-blog-series

A few other SQLite extensions:

- xlite, for reading Excel files, in Rust https://github.com/x2bool/xlite

- sqlean, several small SQLite extensions in C https://github.com/nalgeon/sqlean

- mergestat, several SQLite extensions for developers (mainly Github's API) in Go https://github.com/mergestat/mergestat


I'm so interested to hear about the HTTP requests stuff.

I've played a tiny bit with SQLite Python extensions that perform HTTP requests, but I'm very unsure of the implications of this kind of thing.

What happens to the rest of the SQLite query engine if your custom function hits a page like https://httpbin.org/delay/10 with a deliberate 10 second delay before the request returns?

Maybe this is fine - your transaction takes 10 seconds longer but everything else continues to work as normal. I've not done the work to figure that out.


Yup, if it hits https://httpbin.org/delay/10 then the entire script will delay for 10 seconds (unless you have a custom lower timeout setting). Which means you can only make HTTP requests in serial requests, which sucks, but much simpler than coordinating with other tools

It'll come out in a week, but lmk if you want to play with it beforehand!


How will it affect other SQLite connections against the same database?

My guess is they'll be OK, but presumably any transactions will lock tables for the duration?


I think you're right, tho not an expert with multiple connections. I think if the DB is in WAL mode and you're only issuing read-only requests, then there will be no lock and long HTTP requests won't lock anything up. But if you issue a write-request that locks tables, then other write-requests will be blocked with long requests.

But if you're in journal mode then there's not much you can do. Please correct me if I'm wrong!


How difficult do you think it would be compile SpatiaLite (a big hairy C codebase) to WASM and load it as a dynamic module, like you've done with sqlite-lines?



Oh wow! I wonder how hard it would be to load that module into https://github.com/simonw/datasette-lite

Hah, turns out I have an issue for that already, which I'd forgotten about: https://github.com/simonw/datasette-lite/issues/30


I love this example from the docs:

    select
      line -> '$.id' as id,
      line -> '$.name' as name
    from lines_read("my-file.ndjson");
This is using the new -> JSON operator which was added to SQLite a few months ago.

The lines_read() thing there is a table-valued function, which means it returns a virtual table that you can query. This is a streaming operation which means it's safe to run it against a 100GB+ file without worrying about it sucking all of the data into memory at once.

Where this gets really useful is when you combine it with a create table statement:

    create table names as select
        line -> '$.id' as id,
        line -> '$.name' as name
      from lines_read("my-file.ndjson");
This should efficiently create a table with that exact subset of the data pulled from the newline-delimited JSON file.


> This is using the new -> JSON operator which was added to SQLite a few months ago.

Is this available in Postgresql? Or what would be the equivalent?


I believe the SQLite operators were directly inspired by PostgreSQL and MySQL. There's a detailed comparison here: https://sqlite.org/src/doc/json-enhancements/doc/json-enhanc...


I believe this https://sqlite.org/src/doc/json-in-core/doc/json-enhancement... is more up to date, as json_nextract is no longer in the core functions.


Postgres has similar JSON access operators, but slightly different syntax:

https://www.postgresql.org/docs/current/functions-json.html


Hey!

OctoSQL[0] author here, this is really impressive! I like this much more than the approach taken by other sqlite-based tools which first load stuff into SQLite and then let you query it.

On the other hand, it does have a cons that it doesn't automatically infer the schema of the input JSON and you still have to manually parse the raw lines. Maybe it would be possible to surmount this by exposing a json-dedicated file reading function which also does SQL inference (I'm not knowledgable about SQLite internals)?

One piece of feedback is with regard to the benchmarks: I think it would be worth it to add additional benchmarks which work on slightly more complex datasets than the one used here. I did a comparison of this vs OctoSQL on the Brazil dataset, and - as expected - sqlite-lines wiped the floor with it. However, then I ran the following queries on a slightly more complex dataset (the Amazon review dataset in this case, from SPyQL's benchmark notebook[1]):

  ~> time OCTOSQL_NO_TELEMETRY=1 ./octosql "SELECT COUNT(*), AVG(overall) 
                                            FROM books.json
                                            WHERE reviewerName = 'James'"
  +-------+-------------------+
  | count |    avg_overall    |
  +-------+-------------------+
  |  3010 | 4.402325581395349 |
  +-------+-------------------+
  real 0m49.805s
  user 0m32.169s
  sys 0m9.163s
  
  ~> time ./lines0-linux-amd64-sqlite3 :memory: "SELECT COUNT(*), AVG(json_extract(line, '$.overall'))
                                                 FROM lines_read('books.json') 
                                                 WHERE json_extract(line, '$.reviewerName') = 'James'"
  3010|4.40232558139535
  real 1m47.933s
  user 1m27.024s
  sys 0m11.559s
and as you can see, the results go in a very different direction.

But anyhow, congrats on the project, and I'm pumped to see what you come up with next!

[0]: https://github.com/cube2222/octosql

[1]: https://github.com/dcmoura/spyql/blob/master/notebooks/json_...


Hey thanks for sharing!

Re infering schema of input JSON: That would be slick! Though SQLite does have some limitations here with table-valued functions vs virtual table. I won't go into the specifics, but something like this isn't possible in SQLite:

  select name, age from lines_json_read('students.json')
The "name" and "age" dynamic columns aren't possible when using the "table function" syntax, but something like this is possible using traditional "virtual table" syntax:

  create virtual table students using lines_json_read(filename="students.json");
  
  select name, age from students;
It's a small difference, but definitely possible! Though parsing JSON in C is tricky, but would definitely accept contributions that figure it out.

And re benchmarks - thanks for sharing! Yeah, they're pretty basic, so would love to add more complex ones. With the books.json example, I think what's happening is that in SQLite's JSON function, it parses the JSON each time in each json_extract function - so it parses twice for each row in that query. I also suspect that the long strings in "reviewText" might slow down, but can't be sure. Once I get some free time I'll add OctoSQL to the benchmark suite and this new books dataset


Another option would be to create the virtual table used for the table-valued function using a schema to define the columns:

    CREATE VIRTUAL TABLE students_json_read USING lines_json_read(schema="students-schema.json);
Then, that one table could be used for multiple files:

    SELECT * FROM students_json_read('school1-students.json');
    SELECT * FROM students_json_read('school2-students.json');


I like the fact that author is extensively documenting the source code, following the sqlite spirit


Thats interesting, is there any way to add "custom delimiter templates" (or alternative tools)? I have some files where each line contains different kind of informations, but those are separated by their index on the row, so for example, customer name is the letter index 3 to letter index 10.


Hmm, so you have lines that have fields in fixed indexed positions? Maybe something line this:

  $ cat hn.txt 
  11alex    a
  12brian   b
  12craig   c
  $ cat test.sql
  .header on
  .mode box
  .load dist/lines0
  select
    substr(line, 1, 2) as id,
    substr(line, 3, 7) as name,
    substr(line, 11, 2) as age
  from lines_read('hn.txt')
  $ sqlite3x :memory: '.read test.sql'
  ┌────┬─────────┬─────┐
  │ id │  name   │ age │
  ├────┼─────────┼─────┤
  │ 11 │ alex    │ a   │
  │ 12 │ brian   │ b   │
  │ 12 │ craig   │ c   │
  └────┴─────────┴─────┘
Unfortunately SQLite's substr function is a bit awkward, but if each field has constant indicies it may work


These I believe are what is called fixed width files.

If you use pandas, there is a reader for those `pd.read_fwf()`

I’m sure there are many more languages that support this, it’s just the only one where I had a need.


Exactly that, all banks here in Brazil are exporting transactions with this kind of format (Its called CNAB file over here and they are probably using COBOL systems). You have different kinds of CNAB files, CNAB 200 where you have lines with 200 characters, CNAB 400...

Inside those files you have multiple chunks and every line contains information about a transaction. So if you wanna get the transaction amount you would look for all characters between character n10 and n20 for example. (Those positions are fixed and follow the bank template).

We had success parsing it with python/node, but i think spinning up a sqllite instance, reading straight from the file and doing some calculations would be much faster.


Just thinking out loud and assuming you only have this extension and want to do a quick poc: you could try to delegate the preliminary line parsing to awk (which is very powerful and works in a streaming fashion).

CNAB file -> awk -> select from line_read().

Awk could, for example, transform the fixed-width file in a delimited one (I believe this extension does not strictly require json-formatted inputs). You vould also perform pre-filtering of the input data on awk side (see the "/ /" syntax).

If this extension supports reading from stdin, you could build a traditional unix command pipe, otherwise you would have to glue together the two commands with a named pipe.

Depending on your sensibility, this could feel ugly or very elegant. I am confident it would be very fast, not memory bound, and can be made strict and correct.


SQLite also has support for csv (even without this extension) so if you can process your cnab into csv that may be the easiest most supported way to go.


Very cool looking! I love using SQL for ad-hoc data analysis, my brain just groks the language so much more easily than other data query languages like python + pandas.


I've read that spatialite is not well maintained, is that true?


nice simple & reliable approach to a hard perf problem.


I have built my own database engine on top of a general-purpose data management system I invented. It has an application that will let you create a DB table from a file (Json, JsonL, CSV, etc.). The application uses the Qt windowing framework that also includes a Json parser so I just used it.

It seems slower than I would like, so I wonder if your code could be used as a stand alone library for parsing the files? My current implementation is able to parse about 25K Json documents per second when each document has 20 KV pairs. I used the Chicago crime data available for download on their open data portal.

How fast will your code parse it? Here is an example of three lines:

[ { "ID" : "10224738", "Case Number" : "HY411648", "Date" : "2015-09-05T13:30:00", "Block" : "043XX S WOOD ST", "IUCR" : "0486", "Primary Type" : "BATTERY", "Description" : "DOMESTIC BATTERY SIMPLE", "Location Description" : "RESIDENCE", "Arrest" : "false", "Domestic" : "true", "Beat" : 924, "District" : 9, "Ward" : 12, "Community Area" : 61, "FBI Code" : "08B", "X Coordinate" : "1165074", "Y Coordinate" : "1875917", "Year" : 2015, "Latitude" : 41.815117, "Longitude" : -87.670000 }, { "ID" : "10224742", "Case Number" : "HY411435", "Date" : "2015-09-05T10:55:00", "Block" : "082XX S LOOMIS BLVD", "IUCR" : "0610", "Primary Type" : "BURGLARY", "Description" : "FORCIBLE ENTRY", "Location Description" : "RESIDENCE", "Arrest" : "false", "Domestic" : "false", "Beat" : 614, "District" : 6, "Ward" : 21, "Community Area" : 71, "FBI Code" : "05", "X Coordinate" : "1168430", "Y Coordinate" : "1850165", "Year" : 2015, "Latitude" : 41.744379, "Longitude" : -87.658431 }, { "ID" : "10224745", "Case Number" : "HY411654", "Date" : "2015-09-05T11:30:00", "Block" : "031XX W WASHINGTON BLVD", "IUCR" : "0320", "Primary Type" : "ROBBERY", "Description" : "STRONGARM - NO WEAPON", "Location Description" : "STREET", "Arrest" : "false", "Domestic" : "true", "Beat" : 1222, "District" : 12, "Ward" : 27, "Community Area" : 27, "FBI Code" : "03", "X Coordinate" : "1155536", "Y Coordinate" : "1900515", "Year" : 2015, "Latitude" : 41.882814, "Longitude" : -87.704326 } ]


Because you have small files, you may not benefit as much from this extension as you'll probably be limited by OS-related issues of how fast you can open files. If you could cat them together then you might benefit from this. Also, this is for newline-deliminated json whereas yours looks like a json blob. But ultimately, it'd only take a couple of hours to test this module with the sqlite command line application and get a rough benchmark, which is more reliable than my guess.

On a seperate note, I'm not sure if that data is public, but it makes me a bit uncomfortable to see lat/long on crime data on a public forum.


The files are not that small. They can have many millions of rows in each. The example I posted was to show what each row would look like. I used it in a Json array for the example, but the file itself can be in Json Lines format.

The data set blocks off the last two digits of the street address for anonymity but it doesn't look like they did the same for lat/long. If you drill down to a single block (e.g. 055XX S ROCKWELL ST), you will see 37 different latitudes and 37 different longitudes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: