A Faster Lexer in Go

mhh__ · on May 6, 2022

Free speedups are always appreciated but when benchmarking some not-optimized-at-all lexers, they were still so fast that I couldn't really justify optimizing them.

Unless you're parsing genuinely enormous amounts of code (or latency bound) I find that most compilers are far too slow in other areas to justify optimizing the lexer versus fixing architectural issues in (say) semantic analysis.

A lot of parts of compilers are non-local, and involve lots of pointer chasing, which means avoiding repeated work is very important. In the D compiler there a bunch of checks where we stop you shooting yourself in the foot, I think generating all of these tree-walkers from a specification (so you can walk it all in one go) would probably be a decent speedup.

woodruffw · on May 6, 2022

In this case, the speedup would probably be during LLVM's build itself: that's when TableGen inputs are parsed as part of codegen for big blobs of C++. Improvements to lexing speed here could actually produce some build time improvements, since TableGen's outputs tend to have a high dependent degree in the dependency graph.

(That being said, I have no idea if LLVM would merge a TableGen lexer written in Go.)

mhh__ · on May 6, 2022

The question prior to any of that is how many lines of tablegen LLVM has

According to a cloc definition I just did:

  LLVM Table Gen

  files: 1041         
  blank: 75684         
  comment: 69272         
  code: 480047

Probably enough to warrant a faster lexer.

rob74 · on May 7, 2022

One part of Go that really needs optimizing is the regexp library. It's still a pretty "basic" implementation and can't keep up with the optimized implementations of other languages...

klodolph · on May 7, 2022

I’ve seen at least a couple cases where lexer time dominated overall compile time.

Xeoncross · on May 7, 2022

For those not aware, a lot of us were introduced to this topic via the famous "Lexical Scanning in Go" talk by Rob Pike: https://www.youtube.com/watch?v=HxaD_trXwRE

I can highly recommend this video for anyone wanting to build something satisfying in Go as a fairly simple project.

eigenhombre · on May 7, 2022

I used some of this work for a Lisp I wrote[1] in Go -- it was really helpful, though it took a little work[2] to adapt it in a reusable way from the talk and from the Go template code. I also recommend studying the code and watching the helpful and clear video - an elegant approach to an old problem.

[1] https://github.com/eigenhombre/l1 [2] https://github.com/eigenhombre/lexutil

throwaway17_17 · on May 7, 2022

It is nice when articles by people I find informed and experienced end up using some technique or pattern I have used previously. The compiler for my ‘toy’ programming language was originally written in JavaScript, simply to see if I could do it and have an HTML based ‘REPL’ for testing with simplified compiler internal structures display options as well. JS is not fast in most cases, but I got a decent increase in memory usage numbers and speed by implementing something similar to the article’s subslice API. I just went to using index into source string and a length to store in the compiler’s various passes instead of making new strings for each token/lexeme/node/IL element.

Not really important revelations or anything, but as a person who does not do development professionally, it is nice to feel like I may be developing decent instincts for the craft.

xarope · on May 6, 2022

It's interesting how go and python have similar achilles heels, namely the way they handle strings (python's print was highlighted as a similar performance problem in another post recently in HN). I wonder if there will ever be a solution to this issue (i.e. the marshalling/demarshalling of potential unicode/runes to bytes and vice versa), or are we forever stuck in this tradeoff?

rob74 · on May 7, 2022

I think Go's "everything is UTF-8 unless you really need something different" approach is pretty sensible - it may not be the best for performance, but it saves you a lot of headaches...

earthboundkid · on May 7, 2022

Python printing flushes the buffer. That’s slow in every language.

masklinn · on May 7, 2022

Isn’t it just the standard line-buffering of stdout?

Print sends a newline by default but i don’t think it flushes explicitly (hence the flush keyword parameter).

User23 · on May 7, 2022

The obvious compromise is ASCII for code and utf8 for data. Although personally I like the idea of applying the function man in a business suit levitating. At least recreationally.

newbye2 · on May 7, 2022

Newbie question: What happen with the huge number of strings created?, byte sequences are cleaned with gc but the strings take memory.

tgv · on May 7, 2022

First, a string object is a small object with a pointer to the actual character sequence. The tokenizer produces the same number of these for all approaches. Second, in the new approach, that character sequence is part of the original input. All string objects point to a slice of the same "backing store", unlike the old approach where each string object had its own memory with a copy from a small part of the input.

newbye2 · on May 7, 2022

Two questions: first, how cache friendly is that backing store? and second how the backing store is updated when the strings are garbage collected? At first sight it seems that a stack allocated byte sequence can be easily removed but a backing store allocated sequence is not easily handle nor cache friendly.

The general problem seems to be about dynamic dispatch of methods, in go using fat pointers, so the author avoid the dynamic dispatch by modifing the code.

tgv · on May 7, 2022

> how cache friendly is that backing store

It's the original input buffer. It's less fragmented than creating small copies for every token. The backing store stays alive as long as there's a single token pointing to it. It's a bit inefficient if there's a lot of whitespace, comment, etc. in it, or if you're only interested in a very small portion of the tokens, but it doesn't have the overhead of a memory block per string (which is at least 16 bytes). So it depends, I guess.

I don't think dynamic dispatch comes into play in go. The strings are represented in the same way.