More

alexgarcia-xyz · on Feb 6, 2023

Author here, happy to answer questions!

Simon already gave a great intro to my SQLite extensions, and I just want to point folks to Anton Zhiyanov's sqlean project, for pure-C extensions: https://github.com/nalgeon/sqlean

Also, some of these extensions are also packaged as Datasette plugins, so you can also run `datasette install datasette-sqlite-regex` or `datasette install datasette-sqlite-ulid` to add these extension to your Datasette instances! https://docs.datasette.io/en/stable/plugins.html

nsteel · on Feb 7, 2023

Have you looked into cross-compiling those arm targets using Docker and Github Actions? I'm not familiar with the toolchain you have to use but other projects do that.

alexgarcia-xyz · on Feb 7, 2023

I've tried it! For Rust SQLite extensions, I'm able to use some cross compilers for some extra ARM targets on linux[0], I just haven't had the time to do it for all extensions yet. For MacOS arm builds cargo has the `aarch64-apple-darwin` target which works, but is a bit slower than expected at runtime. I've also tried zig for cross compiling C extensions on different platforms, which seemed to work ok but I've heard of some perf issues cropping up with those.

[0] https://github.com/asg017/sqlite-loadable-rs/blob/main/.gith...

thibautg · on Feb 7, 2023

Nice!

Is Spatialite supported?

alexgarcia-xyz · on Feb 7, 2023

Not yet! Spatialite is a very big and intimidating codebase that's hard to compile. I've been mulling the idea of creating several smaller GIS sqlite extensions, especially since Spatialite development is starting to get stale, but nothing to share for now.

college_physics · on Feb 7, 2023

Yes, pip installing a GIS system would be awesome!

alexgarcia-xyz · on Dec 13, 2022

Author here, happy to answer any questions! Here's a few links that may be useful:

Direct GitHub link: https://github.com/asg017/sqlite-loadable-rs

A soon-to-be-released sqlite-loadable extension that's the fasted CSV parser for SQLite, and rivals DuckDB's parser at non-analytical queries: https://github.com/asg017/sqlite-xsv

A regular expression extension in Rust, the fastest SQLite regex implementation: https://github.com/asg017/sqlite-regex

And in the near future, expect more extensions for postgres, parquet, XML, S3, image manipulation, and more!

CGamesPlay · on Dec 14, 2022

The API for virtual tables is tricky, because you don't ever consume the API in Rust: SQLite does it. However, as it is presented the interface is unsafe, I think.

In the sqlite-xsv VTab::open impl for XsvTable [0], you pass a &'vtab XsvTable to XsvCursor. But the means that if open is called again, to create a second cursor, then a `&mut XsvTable` and a `&XsvTable` reference exist at the same time, which is unsafe. Note that sqlite-xsv doesn't actually keep the `&XsvTable`, so it's fine, but it serves as an example where a safety problem could surface.

The same problem also applies to VTabWriteable::update. The fix is to make both of these methods receive an immutable reference and force implementors to use interior mutability. Note this isn't hypothetical, it's actually unsafe, and would appear if you had a unit test that implemented a writable virtual table backed by a rust data structure, and attempted to iterate and update at the same time.

I have an as-yet-unpublished code base that experiments with this. It's not published yet because it doesn't cover everything I want to and I'm trying to avoid publishing something where the API might have to break. My goals are explicitly safety first, performance second. https://github.com/CGamesPlay/sqlite3_ext

[0] https://github.com/asg017/sqlite-xsv/blob/main/src/xsv.rs#L9...

CGamesPlay · on Dec 14, 2022

I went ahead and published the crate so that it is browsable via docs.rs. Feel free to adopt anything from it; the code is public domain. https://docs.rs/sqlite3_ext/latest/sqlite3_ext/

fedux · on Dec 14, 2022

I think "undefined behavior" is the standard term for the "is unsafe" situation that you mention.

teruakohatu · on Dec 14, 2022

> And in the near future, expect more extensions

Please consider GIS/spatial extension. SpatiaLite is pretty bad, for example the KNN feature is deprecated and the replacement has not yet made it to a release yet.

https://www.gaia-gis.it/fossil/libspatialite/wiki?name=KNN

alexgarcia-xyz · on Aug 10, 2022

Yeah, "dont make network requests from the database" is a common antipattern, but it doesn't quite apply with sqlite-http. It's mainly meant for data processing scripts that use SQL, which is typically on a local database file with pretty low risk. You probably wouldn't want/need sqlite-http on a SQLite db that a live application uses, unless you use the no network[0] option

[0] https://github.com/asg017/sqlite-http/blob/main/docs.md#no-n...

alexgarcia-xyz · on Aug 10, 2022

re pipelining: unless Go's network package does this magically under the hood, then sqlite-http doesn't take advantage of that! If you think that would be a nice feature, file an issue and I'll see what can be done

How complex is your text processing? If I'm working with a small dataset, I typically just save the "raw" responses into a table and create several intermediate tables that extracts the data I need. Something like:

  create table raw_responses as http_get('https:/....');
  create table intermediate1 as select xxxx(response_body) as message from raw_responses;
  create table final as select message from intermediate;

(some CTEs[0] will be very useful here)

The "lambda" pattern described about half-way through this post might be useful as well, if you need to do some complex text processing.

[0] https://www.sqlite.org/lang_with.html

robmccoll · on Aug 10, 2022

Go's HTTP client doesn't support pipelining in HTTP 1.1. It will do keep alive / socket re-use. HTTP 2+ have support for request multiplexing which Go supports. Your code appears to create a new client on each request, so I'm not sure if the underlying transport will be re-used or if any multiplexing will take place. I would be surprised if it is.

alexgarcia-xyz · on Aug 10, 2022

On SSRF: There is a "no network"[0] option for the extension, if you don't trust SQL input. This is what the example Datasette server[1] uses

Also to note, this is mainly meant for use with local data processing, I doubt there's a need for sqlite-http in a SQLite DB used in regular applications. If you're just wrangling some data locally with SQLite with these extensions, I doubt there's much of a security risk

[0] https://github.com/asg017/sqlite-http/blob/main/docs.md#no-n... [1] https://sqlite-extension-examples.fly.dev/data?sql=select+*+...

hsbauauvhabzb · on Aug 11, 2022

So your basing your security model on the assumption that developers won’t use it for things you never intended it to be used for?

simonw · on Aug 11, 2022

I think that's absolutely fine.

The creators of Python still released Python, even though people could use it to build insecure software.

alexgarcia-xyz · on Aug 10, 2022

Author here, happy to answer questions!

Direct link to the project: https://github.com/asg017/sqlite-http

Few other projects of interest:

- sqlite-html: https://github.com/asg017/sqlite-html

- sqlite-lines: https://github.com/asg017/sqlite-lines

- various other extensions: https://github.com/nalgeon/sqlean

And past HN threads:

- https://news.ycombinator.com/item?id=32335295

- https://news.ycombinator.com/item?id=32288165

alexgarcia-xyz · on Aug 3, 2022

Hey, author here, happy to answer questions! A few other recent posts/tools that you may be interested in:

- sqlite-lines discussion from a few days ago: https://news.ycombinator.com/item?id=32288165

- htmlq, Rust CLI for (like jq but for html): https://github.com/mgdm/htmlq

- The Go library that sqlite-html uses for making runtime-loadable SQLite extensions https://github.com/riyaz-ali/sqlite

- sqlean, a ton of other helpful SQLite extensions (in C): https://github.com/nalgeon/sqlean

tbran · on Aug 4, 2022

Not an extension, but I wanted to add that sqlite-utils is an incredible Python library for working with SQLite dbs. Great for squashing several lines of SQL into a one-liner.

https://sqlite-utils.datasette.io/en/stable/index.html

lilyball · on Aug 3, 2022

The bottom of the README links to a sqlite-http repo but that's just a 404. Is that a private repo?

alexgarcia-xyz · on Aug 3, 2022

Ah yes, my bad, gonna publish that soon! Will be an extension for things like `select http_get_body('https://google.com')`

dreadlordbone · on Aug 4, 2022

Maybe I'm being dumb, but what's the upside to generating/modifying HTML in SQL as compared to javascript?

alexgarcia-xyz · on Aug 4, 2022

totally not dumb - in general there isn't much upside. Maybe there's a usecase in server-side rendering HTML where you already use SQLite somewhere, or in a CLI tool if you're using sqlite3. But I doubt sqlite-html will be as fast as other tools. The querying functions are much more useful than the generating ones, IMO

alexgarcia-xyz · on July 30, 2022

Hey thanks for sharing!

Re infering schema of input JSON: That would be slick! Though SQLite does have some limitations here with table-valued functions vs virtual table. I won't go into the specifics, but something like this isn't possible in SQLite:

  select name, age from lines_json_read('students.json')

The "name" and "age" dynamic columns aren't possible when using the "table function" syntax, but something like this is possible using traditional "virtual table" syntax:

  create virtual table students using lines_json_read(filename="students.json");
  
  select name, age from students;

It's a small difference, but definitely possible! Though parsing JSON in C is tricky, but would definitely accept contributions that figure it out.

And re benchmarks - thanks for sharing! Yeah, they're pretty basic, so would love to add more complex ones. With the books.json example, I think what's happening is that in SQLite's JSON function, it parses the JSON each time in each json_extract function - so it parses twice for each row in that query. I also suspect that the long strings in "reviewText" might slow down, but can't be sure. Once I get some free time I'll add OctoSQL to the benchmark suite and this new books dataset

tstack · on July 30, 2022

Another option would be to create the virtual table used for the table-valued function using a schema to define the columns:

    CREATE VIRTUAL TABLE students_json_read USING lines_json_read(schema="students-schema.json);

Then, that one table could be used for multiple files:

    SELECT * FROM students_json_read('school1-students.json');
    SELECT * FROM students_json_read('school2-students.json');

alexgarcia-xyz · on July 30, 2022

Hmm, so you have lines that have fields in fixed indexed positions? Maybe something line this:

  $ cat hn.txt 
  11alex    a
  12brian   b
  12craig   c
  $ cat test.sql
  .header on
  .mode box
  .load dist/lines0
  select
    substr(line, 1, 2) as id,
    substr(line, 3, 7) as name,
    substr(line, 11, 2) as age
  from lines_read('hn.txt')
  $ sqlite3x :memory: '.read test.sql'
  ┌────┬─────────┬─────┐
  │ id │  name   │ age │
  ├────┼─────────┼─────┤
  │ 11 │ alex    │ a   │
  │ 12 │ brian   │ b   │
  │ 12 │ craig   │ c   │
  └────┴─────────┴─────┘

Unfortunately SQLite's substr function is a bit awkward, but if each field has constant indicies it may work

alexgarcia-xyz · on July 30, 2022

Hey, author here, happy to answer any questions! Also checkout this notebook for a deeper dive into sqlite-lines, along with a slick WASM demonstration and more thoughts on the codebase itself https://observablehq.com/@asg017/introducing-sqlite-lines

I really dig SQLite, and I believe SQLite extensions will push it to another level. I rarely reach for Pandas or other "traditional" tools and query languages, and instead opt for plain ol' SQLite and other extensions. As a shameless plug, I recently started a blog series on SQLite and related tools and extensions if you want to learn more! Next week I'll be publishing more SQLite extensions for parsing HTML + making HTTP requests https://observablehq.com/@asg017/a-new-sqlite-blog-series

A few other SQLite extensions:

- xlite, for reading Excel files, in Rust https://github.com/x2bool/xlite

- sqlean, several small SQLite extensions in C https://github.com/nalgeon/sqlean

- mergestat, several SQLite extensions for developers (mainly Github's API) in Go https://github.com/mergestat/mergestat

simonw · on July 30, 2022

I'm so interested to hear about the HTTP requests stuff.

I've played a tiny bit with SQLite Python extensions that perform HTTP requests, but I'm very unsure of the implications of this kind of thing.

What happens to the rest of the SQLite query engine if your custom function hits a page like https://httpbin.org/delay/10 with a deliberate 10 second delay before the request returns?

Maybe this is fine - your transaction takes 10 seconds longer but everything else continues to work as normal. I've not done the work to figure that out.

alexgarcia-xyz · on July 30, 2022

Yup, if it hits https://httpbin.org/delay/10 then the entire script will delay for 10 seconds (unless you have a custom lower timeout setting). Which means you can only make HTTP requests in serial requests, which sucks, but much simpler than coordinating with other tools

It'll come out in a week, but lmk if you want to play with it beforehand!

simonw · on July 30, 2022

How will it affect other SQLite connections against the same database?

My guess is they'll be OK, but presumably any transactions will lock tables for the duration?

alexgarcia-xyz · on July 30, 2022

I think you're right, tho not an expert with multiple connections. I think if the DB is in WAL mode and you're only issuing read-only requests, then there will be no lock and long HTTP requests won't lock anything up. But if you issue a write-request that locks tables, then other write-requests will be blocked with long requests.

But if you're in journal mode then there's not much you can do. Please correct me if I'm wrong!

simonw · on July 30, 2022

How difficult do you think it would be compile SpatiaLite (a big hairy C codebase) to WASM and load it as a dynamic module, like you've done with sqlite-lines?

alexgarcia-xyz · on July 30, 2022

This has been done by other people already!

- https://github.com/jvail/spl.js

- https://observablehq.com/@visionscarto/hello-spl-js

simonw · on July 30, 2022

Oh wow! I wonder how hard it would be to load that module into https://github.com/simonw/datasette-lite

Hah, turns out I have an issue for that already, which I'd forgotten about: https://github.com/simonw/datasette-lite/issues/30