Free speedups are always appreciated but when benchmarking some not-optimized-at-all lexers, they were still so fast that I couldn't really justify optimizing them.
Unless you're parsing genuinely enormous amounts of code (or latency bound) I find that most compilers are far too slow in other areas to justify optimizing the lexer versus fixing architectural issues in (say) semantic analysis.
A lot of parts of compilers are non-local, and involve lots of pointer chasing, which means avoiding repeated work is very important. In the D compiler there a bunch of checks where we stop you shooting yourself in the foot, I think generating all of these tree-walkers from a specification (so you can walk it all in one go) would probably be a decent speedup.
In this case, the speedup would probably be during LLVM's build itself: that's when TableGen inputs are parsed as part of codegen for big blobs of C++. Improvements to lexing speed here could actually produce some build time improvements, since TableGen's outputs tend to have a high dependent degree in the dependency graph.
(That being said, I have no idea if LLVM would merge a TableGen lexer written in Go.)
One part of Go that really needs optimizing is the regexp library. It's still a pretty "basic" implementation and can't keep up with the optimized implementations of other languages...
I used some of this work for a Lisp I wrote[1] in Go -- it was really helpful, though it took a little work[2] to adapt it in a reusable way from the talk and from the Go template code. I also recommend studying the code and watching the helpful and clear video - an elegant approach to an old problem.
It is nice when articles by people I find informed and experienced end up using some technique or pattern I have used previously. The compiler for my ‘toy’ programming language was originally written in JavaScript, simply to see if I could do it and have an HTML based ‘REPL’ for testing with simplified compiler internal structures display options as well. JS is not fast in most cases, but I got a decent increase in memory usage numbers and speed by implementing something similar to the article’s subslice API. I just went to using index into source string and a length to store in the compiler’s various passes instead of making new strings for each token/lexeme/node/IL element.
Not really important revelations or anything, but as a person who does not do development professionally, it is nice to feel like I may be developing decent instincts for the craft.
It's interesting how go and python have similar achilles heels, namely the way they handle strings (python's print was highlighted as a similar performance problem in another post recently in HN). I wonder if there will ever be a solution to this issue (i.e. the marshalling/demarshalling of potential unicode/runes to bytes and vice versa), or are we forever stuck in this tradeoff?
I think Go's "everything is UTF-8 unless you really need something different" approach is pretty sensible - it may not be the best for performance, but it saves you a lot of headaches...
The obvious compromise is ASCII for code and utf8 for data. Although personally I like the idea of applying the function man in a business suit levitating. At least recreationally.
First, a string object is a small object with a pointer to the actual character sequence. The tokenizer produces the same number of these for all approaches. Second, in the new approach, that character sequence is part of the original input. All string objects point to a slice of the same "backing store", unlike the old approach where each string object had its own memory with a copy from a small part of the input.
Two questions: first, how cache friendly is that backing store? and second how the backing store is updated when the strings are garbage collected?
At first sight it seems that a stack allocated byte sequence can be easily removed but a backing store allocated sequence is not easily handle nor cache friendly.
The general problem seems to be about dynamic dispatch of methods, in go using fat pointers, so the author avoid the dynamic dispatch by modifing the code.
It's the original input buffer. It's less fragmented than creating small copies for every token. The backing store stays alive as long as there's a single token pointing to it. It's a bit inefficient if there's a lot of whitespace, comment, etc. in it, or if you're only interested in a very small portion of the tokens, but it doesn't have the overhead of a memory block per string (which is at least 16 bytes). So it depends, I guess.
I don't think dynamic dispatch comes into play in go. The strings are represented in the same way.
Unless you're parsing genuinely enormous amounts of code (or latency bound) I find that most compilers are far too slow in other areas to justify optimizing the lexer versus fixing architectural issues in (say) semantic analysis.
A lot of parts of compilers are non-local, and involve lots of pointer chasing, which means avoiding repeated work is very important. In the D compiler there a bunch of checks where we stop you shooting yourself in the foot, I think generating all of these tree-walkers from a specification (so you can walk it all in one go) would probably be a decent speedup.