I must have missed the first part, but this is a topic that always interests me, especially when starting from fundamentals.
Writing a lexer in C, while simple, is non-trivial mainly due to the tasks of pointer arithmetic and string manipulation. If the author continues on to the task of writing a completely customizable LR parser or something of the sort in C or another high-level language, it might be useful to take a look at the source code for an LR(1) and SLR parser-generator here: https://github.com/gregtour/duck-lang/tree/master/parser-gen...
I may fork this branch from the main duck programming language trunk because it could be useful to other programming languages.
My focus in studying programming languages has been concentrated on frontends for languages. One drawback of my parser's implementation is that it can be slow for generating complete canonical parsers for any deterministic context free grammar and the tables can be quite large. However, there must be ways to improve the code base to provide better features and performance. Also, once you have a parse table, it really is a fast parser.
The benefit to using a ground-up approach like this is not only having a complete understanding of all of the technology involved but also in having complete control.
Although I haven't used GNU tools like LEX or YACC in practice, I dislike the idea of generating code in a macro form or really breaking the paradigms of C and C++ to create auto-generated code. For me, it is much easier to have programs that take an input, like a BNF grammar, and provide an output, the parse table. That can then be applied to create a syntax tree from source code. For me, this makes more sense in creating code that operates on data and data structures rather than having code generated around templates or macros.
Having control over the data structures in use is helpful because then a programmer knows exactly where all of the data is going and where it is being stored, somewhat useful in designing a programming language and something you lose in using someone else's libraries.
IMO, using macros to generate the parser has some advantages. For example, you have the freedom do generate the syntax tree however you like, adding extra "line number" fields or translating some syntactic sugar. If your language has proper support for macros its also really pleasant overall. I wrote a bottom up parser in Racket once and it was really nice: compiling the parser is a piece of cake and you can define your own macros to automate list-parsing and some other boring things.
Writing a lexer in C, while simple, is non-trivial mainly due to the tasks of pointer arithmetic and string manipulation. If the author continues on to the task of writing a completely customizable LR parser or something of the sort in C or another high-level language, it might be useful to take a look at the source code for an LR(1) and SLR parser-generator here: https://github.com/gregtour/duck-lang/tree/master/parser-gen...
I may fork this branch from the main duck programming language trunk because it could be useful to other programming languages.
My focus in studying programming languages has been concentrated on frontends for languages. One drawback of my parser's implementation is that it can be slow for generating complete canonical parsers for any deterministic context free grammar and the tables can be quite large. However, there must be ways to improve the code base to provide better features and performance. Also, once you have a parse table, it really is a fast parser.
The benefit to using a ground-up approach like this is not only having a complete understanding of all of the technology involved but also in having complete control.
Although I haven't used GNU tools like LEX or YACC in practice, I dislike the idea of generating code in a macro form or really breaking the paradigms of C and C++ to create auto-generated code. For me, it is much easier to have programs that take an input, like a BNF grammar, and provide an output, the parse table. That can then be applied to create a syntax tree from source code. For me, this makes more sense in creating code that operates on data and data structures rather than having code generated around templates or macros.
Having control over the data structures in use is helpful because then a programmer knows exactly where all of the data is going and where it is being stored, somewhat useful in designing a programming language and something you lose in using someone else's libraries.