Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hey!

OctoSQL[0] author here, this is really impressive! I like this much more than the approach taken by other sqlite-based tools which first load stuff into SQLite and then let you query it.

On the other hand, it does have a cons that it doesn't automatically infer the schema of the input JSON and you still have to manually parse the raw lines. Maybe it would be possible to surmount this by exposing a json-dedicated file reading function which also does SQL inference (I'm not knowledgable about SQLite internals)?

One piece of feedback is with regard to the benchmarks: I think it would be worth it to add additional benchmarks which work on slightly more complex datasets than the one used here. I did a comparison of this vs OctoSQL on the Brazil dataset, and - as expected - sqlite-lines wiped the floor with it. However, then I ran the following queries on a slightly more complex dataset (the Amazon review dataset in this case, from SPyQL's benchmark notebook[1]):

  ~> time OCTOSQL_NO_TELEMETRY=1 ./octosql "SELECT COUNT(*), AVG(overall) 
                                            FROM books.json
                                            WHERE reviewerName = 'James'"
  +-------+-------------------+
  | count |    avg_overall    |
  +-------+-------------------+
  |  3010 | 4.402325581395349 |
  +-------+-------------------+
  real 0m49.805s
  user 0m32.169s
  sys 0m9.163s
  
  ~> time ./lines0-linux-amd64-sqlite3 :memory: "SELECT COUNT(*), AVG(json_extract(line, '$.overall'))
                                                 FROM lines_read('books.json') 
                                                 WHERE json_extract(line, '$.reviewerName') = 'James'"
  3010|4.40232558139535
  real 1m47.933s
  user 1m27.024s
  sys 0m11.559s
and as you can see, the results go in a very different direction.

But anyhow, congrats on the project, and I'm pumped to see what you come up with next!

[0]: https://github.com/cube2222/octosql

[1]: https://github.com/dcmoura/spyql/blob/master/notebooks/json_...



Hey thanks for sharing!

Re infering schema of input JSON: That would be slick! Though SQLite does have some limitations here with table-valued functions vs virtual table. I won't go into the specifics, but something like this isn't possible in SQLite:

  select name, age from lines_json_read('students.json')
The "name" and "age" dynamic columns aren't possible when using the "table function" syntax, but something like this is possible using traditional "virtual table" syntax:

  create virtual table students using lines_json_read(filename="students.json");
  
  select name, age from students;
It's a small difference, but definitely possible! Though parsing JSON in C is tricky, but would definitely accept contributions that figure it out.

And re benchmarks - thanks for sharing! Yeah, they're pretty basic, so would love to add more complex ones. With the books.json example, I think what's happening is that in SQLite's JSON function, it parses the JSON each time in each json_extract function - so it parses twice for each row in that query. I also suspect that the long strings in "reviewText" might slow down, but can't be sure. Once I get some free time I'll add OctoSQL to the benchmark suite and this new books dataset


Another option would be to create the virtual table used for the table-valued function using a schema to define the columns:

    CREATE VIRTUAL TABLE students_json_read USING lines_json_read(schema="students-schema.json);
Then, that one table could be used for multiple files:

    SELECT * FROM students_json_read('school1-students.json');
    SELECT * FROM students_json_read('school2-students.json');




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: