So far we have already briefly covered the basics of lexers.
On the previous article, our proto-compiler learned how to deal with expressions such as "42+0".
However, if you played around with it, you might have reached a limitation with it:
If we just added a multiplication operator, then "1+2*3" would result in 9, instead of the intended 7.
This makes sense considering how we implemented our calculator. It just continues executing operations without concern for precedence.
This could get even more complex if we consider cases such as "3*(2+1) == 4*2". So we will have to teach our calculator how to deal with precedence.
The simplest approach I know of (which happens to be widely used by compilers) is using
trees to represent the
document (by document, I mean the entire file or string that's meant to be interpreted/compiled).
A bit about treesA tree is a very specific data structure. The concept of trees is something like this: We have a parent(the
root), which has children. Each children
may, in turn, have other children, and so on. Each element of the tree, regarding of whether it has children or not, is known as a
node.
If a
node has no children, then it is called a
leaf. Two nodes with the same parent are known as
siblings. Each node can have multiple children, but
only a single parent!
I could try to draw a tree here, but blogspot(and HTML in general) dislikes spaces so drawing trees in a visually appealing way is hard. Instead, I will link to a wikipedia image and hope it doesn't get changed: http://en.wikipedia.org/wiki/File:Binary_tree_structure.svg
Assuming it isn't changed, that image represents a tree. The
root node is "Encyclopaedia". Both "Science" and "Culture" and
children or the root node. The "Science", "Art" and "Craft" nodes are leaves. "Art" and "Craft" are children of "Culture".
How can trees solve the precedence problem?Our calculator is unable to identify operator precedence because it has
no hierarchical information. Lexers simply export a list, a bunch of tokens: 1 + 2 * 3, for example.
Trees, however, can be way more specific.
Let's say I want to represent the above expression as a tree.
I would choose to make "+" the root node. This node would have two children: "1" (a leaf), and "*", which in turn would have two children (both leaves): "2" and "3".
However, for "( 1 + 2 ) * 3", the root node would be "*", and "+" would be a child of "*".
When discussing trees in compilers, it is common to hear names such as abstract syntax trees (AST). Although different terms might have different meaning, this basic concept is shared by several implementations (as usual with computers, there are usually multiple solutions for the same problem).
If we call the root the "upper-most node" and each children to be "below" its respective parent (and at the same level as its siblings), then we could add precedence to our calculator simply by telling it to always execute (or calculate) from the bottom-up. This is actually very simple to do. One possible way is to use recursion.
public int calculateSum(SumNode sumNode) {
return calculate(sumNode.left) + calculate(sumNode.right);
}
Where calculate itself would call calculateSum if the provided node was a sum. This is necessary to solve "(1 + 2)*3+4" and would, in fact, implement recursion naturally, since each calculation is delayed until the result of its child nodes is known.
How to create trees from a list of tokens?Now that we know what trees are, we need to generate one from the list of tokens our tree generated.
To solve that problem we will need a
parser. There are many ways to write parsers. I will only describe one method in detail.
The easy but messy way (my personal favorite)There is a very simple way to implement, but if your language is complex enough(many precedence levels), it will easily take a few hundreds - possibly thousands - lines of code to implement. This approach starts small but grows very quickly.
To implement this method, we start with the leaves. In this case, the number constants.
Here we are not really going to create a tree in memory. Instead, we are going to calculate the result as it would be generated (following the same method, the same order, but exporting a value instead of the tree).
We'll start with number parsing (numbers are our leaves):
public int parseNumber(out bool successful) {
Token t;
//If next token is not number
if (!((t = getNextToken()) is NumberConstant)) {
successful = false;
return -1;//Any value will work here, since it'll be just discarded
}
//We need a way to discard "bad" tokens,
//so we say which ones we keep in the next few lines.
//Of course, saying which ones do *not* matter would work too.
spendToken();
return ((NumberConstant) t).numberValue;
}
Now, to implement parenthesis expressions, we are going to add another method:
public int parseParenthesisExpression(out bool successful) {
if (!((t = getNextToken()) is ParenthesisSymbol)) {
successful = false;
return -1;//Any value will work here, since it'll be just discarded
}
spendToken();
int exprResult = parseExpression();
spendToken(); //Spend the ')' token.
return exprResult;
}
Now that we have the two basic units of our calculator, we are going to need a way to unite them
public int parseUnit() {
bool success;
int result = parseNumber(out success);
if (success) return result;
result = parseParenthesis(out success);
if (success) return result;
throw new ParserException("Syntax error");
}
Every tree will have at some point our "units". Always. Of course, parenthesis in turn calculate other things, but they still have precedence over what's outside them. More on why we consider the parenthesis grouping to be a unit will come later.
Now that we have the units, we can go to the next level of precedence. If we consider only two operators, "+" and "*", then "*" has priority. So we're going to start with that.
public int parseMultiplication() {
int left = parseUnit();
Token t = getNextToken();
if (!(t is MultiplicationSymbol)) return left;
//More on why this happens later
spendToken();
return left * parseExpression();
}Now, I believe some explanations must be done here. The reason why we check whether something is a multiplication symbol and return the unit if it isn't is because we want to
propagate values.
In other words, the sum operation will never directly interface with parseUnit. It will instead access parseMultiplication, even if there are no multiplications.
For a parseSum, one would replace the "int left = parseUnit();" by "int left = parseMultiplication();". Because of our little extra return in the parseMultiplication() earlier, this will work even for expressions like "1+2", which have no multiplications at all.
Finally, our parseExpression() would simply be whatever has the least precedence("+", in this case).
It is worth noting two limitations in this currently stub implementation
- It does not consider the end of the token list, and therefore it will always crash. Still, this limitation can be solved with only a few lines of code;
- It does not support multiple operators in the same precedence levels
yet. However, this is very simple to add to the current system. Just make parseSum() and parseMinus(), for example, the
same function, rather than two separate ones.
So why are parenthesis units?To prevent infinite recursion/stack overflows, we must make sure that our implementation does not just keep looping from the bottom to the top. That's why we didn't make parseMultiplication() call parseExpression() directly for the left operand.
If we did not add parenthesis as an unit, then "(1 + 2)" would be supported, but "(1) + (2)" would not.
Any alternative implementations?This method can get a little messy.
LLVM's Kaleidoscope has another one which I haven't had the opportunity to try so far, so I don't know how the two compare in terms of complexity, lines of code and readability.
Well, that closes today's session. This is still very far from a working compiler, but the concepts here will be required for future lessons. So until the next time!