Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You don't need one, Lua is another example where no AST is ever generated. In some sense the resulting bytecode closely corresponds to the AST that would have been generated though.


Genuinely asking as parsing without an AST is something I've never seen explained: How do you go from source code to bytecode without an AST?

Isn't the bytecode just a flattened representation of an AST obtained by some sort of tree traversal?

This seems to imply an AST is involved in the generation of the bytecode


Have you used any parser generator like yacc/bison? They have "actions", which are arbitrary codes that will be executed when some grammar production is detected. For example, `expr ::= expr mulop expr { some code }` will execute `some code` when a multiplicative expression is detected, where intermediate results from two `expr`s and `mulop` are available to that code. This concept of actions generally applies to all sort of parsers, not just generated parsers.

Those actions would typically allocate and build (partial) ASTs, but you can do anything with them. You can for example directly evaluate the subexpression if your grammar is simple enough. Likewise bytecodes can be generated on the fly; the only concern here is a backward reference, which has to be patched after the whole expression block gets generated, but otherwise you don't have to build any tree-like structure here. (Most practical parsers only need a stack to function.)


Consider that if the bytecode is just a flattened representation, then instead of generating and traversing the tree, you can just traverse what would have been turned into the tree in the same traversal order directly.

E.g. imagine you're in a leaf production for some simple grammar for an operator that can only take integers, w/all error handling ignored:

     parse_fooexpr() {
          left = parse_int
          emit_int(left)
          expect_token(FOO_OP)
          right = parse_int
          emit_int(right)
          emit_foo_op()
     }
Whether emit_int outputs a "push <someint>" VM instruction to a bytecode buffer, or constructs an IntNode and pushes it to an argument stack, and whether "emit_foo_op" emits a "foo_op" bytecode instruction or pops the argument stack and constructs a FooOpNode(...argument stack values) AST node is an implementation detail in terms of the parser.

Wirth's compilers usually didn't use an AST, and his book on compiler construction contains a thorough walkthrough on generating code as you parse, if you want to see it explained in a lot more detail and w/tricks to generate better code than the most naive approaches will.

The difficulty with this approach comes when you want to apply optimisations or where it's inconvenient to generate the code in parse-order because the language tries to smart about what feels better to write, but you can get pretty far with this method.


Another way to think about this is to imagine you are working in a lazy-by-default language like Haskell. The code might seem to build up an AST and then evaluate it, but (disregarding parse errors) the AST is never materialized in memory at once: the tree might only have one level, with its children represented by thunks that continue to parse more of the source code. (If you are unfamiliar with Haskell laziness, imagine that the so-called tree has children that are functions to produce the next level of the tree.) Of course you will need a carefully designed bytecode in order to generate bytecode from AST where the children is not yet known.


The second half of Crafting Interpreters [0] does this—rather than outputting the AST, the bytecode interpreter outputs bytecode directly. Essentially, the process of creating the AST is already very similar to the tree traversal you'll do to unpack it (recursive descent parsing is a traversal of the tree, just in the call stack rather than over a real data structure), so if you don't need the AST you can skip it and just start outputting bytecode as if you were already walking the tree.

[0] http://craftinginterpreters.com/


Take a look at the evergreen Let's Build a Compiler, by Jack Crenshaw

https://compilers.iecc.com/crenshaw/

Chapter 2 and 3, expression parsing, should give you the idea. Instead of building the AST, you build code. So you're cutting out the middle man (AST).


It's as if they generated an AST and then immediately converted it to bytecode, all at once.

The key restriction is that you must be able to generate the bytecode in a single pass over the source code. For example, the compilation of a function at the top of the file must not depend on declarations further down the file.


Depending on the form the dependencies takes you can handle that too. E.g. if it's "just" about knowing addresses, then just building up a list of relocation records where you need to patch in addresses as they become known works just fine. Of course the more work like that you have to do, the less you save over just building an AST anyway.


Parse Tree and AST are slightly different. A parse tree can be implict. A recursive descent parser can can explore the parse tree without there being an explict parse tree data structure.


How you go from source to target code without an AST is that the syntax-directed translation step in your implementation (that which would build the AST) doesn't bother with that and just builds the output code instead. The extra traversal is skipped; replaced by the original parser's traversal of the raw syntax.

E.g. pseudo-Yacc rules for compiling the while loop in a C-like notation.

  while_loop : WHILE '(' expr ')' statement
               {
                   let back = get_label();
                   let fwd = get_label();
                   let expr = $3; // code for expr recursively generated
                   let stmt = $5; // code for statement, recursively generated
                   $$ = make_code(stmt.reg(),
                                  `$back:\n`
                                  `${expr.code()}\n`
                                  `BF ${expr.reg}, $fwd\n`  // branch if false
                                  `${stmt.code()}\n`
                                  `JMP $back\n`
                                  `$fwd:\n`)                                
               }
               ;

Every code fragment coming out of a rule has .code() and .reg(): the generated code, which is just a string, and the output register where it leaves its value. Such representational details are decided by the compiler writer.

The while statement produces no value, so we just borrow the statement's .reg() as a dummy in the call to make_code; our rule is that every code fragment has an output register, whether it produces a value or not.

When the LALR(1) parser reduces this while loop rule to the while_loop grammar symbol, the expr and statements have already been processed; so the rule action has ready access to the code objects for them. We just synthesize a new code object. We grab a pair of labels that we need for the forward and back jump.

I'm assuming we have a vaguely JS-like programming language being used for the grammar rule actions, in which we have template strings with interpolation, and adjacent strings get merged into one. The bytecode assembly is line oriented, so we have \n breaks.

One possible kind of expression is a simple integer constant, INTEGER:

  expr : INTEGER 
         {
           let reg = allocate_reg();
           let val = $1
           $$ = make_code(reg,
                          `LOAD $reg, #$val\n`)
         }
One possible statement is an empty statement dented by empty curly braces:

  statement : '{' '}'
              {
                $$ = make_code(R0,  // dedicated zero register
                               ""); // no-code solution
              }
So then when we have while (1) { } we might get R1 allocated in the expr rule, like this:

  LOAD R1, #1\n   ; output register is R1
then in the while loop, things get put together like this:

  L0:\n           ; back label
  LOAD R1, #1\n   ; code for expr
  BF R1, L1\n     ; branch if false to L1
                  ; no code came from empty statement
  JMP L0          ; back branch
  L1:\n           ; fwd label


See also: DOM versus streaming.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: