Thursday, May 20, 2010

Ambiguity with assignment

I just collided with an annoying ambiguity in my own language.
On one hand, one of the types of right lists is the assignment right list, which is a left list, an assignment symbol and another right list, like this:
x, y = y, x;
This is very useful(for example, the code above could be used for variable swapping.
However, not everything in PineDL is a list.
For instance, array accessing is made with a simple expression, like this:
a[b + c]
And, of course, we want assignment expressions like we have in C:
a[b = c]
Which will set b to the same value as c and access that index.

Of course, now we have two types of assignment: list assignment and expression assignment.
This wouldn't be a problem if expressions and lists were isolated, but they are not.
Consider the following code in the context of a right list:
a, b = c, d
Is that a list assignment(as in left list = right list, which is a valid statement as we saw before), or is it a plain right list, with three elements, where the second is an assignment?
a, (b = c), d
This is a problem. And one we must solve.

There are a few possible solutions to this ambiguity:
1. Prevent list assignment - Undesirable because the swap use case and much of the language is made around this convenient operation;
2. Prevent value assignment - Not bad because it can be emulated anyway using the list compacter operator, although inconvenient
a[(b = c)] instead of a[b=c]
3. Allow both, but specifically mention how to get around this ambiguity. Kind of like the dangling else problem in C, which is clearly defined.

I'm going with the third option. I'm going to define that, when possible, list assignment is preferred to expression assignment.
So:
a, b = c, d;
//List assignment

3, b = c, d;
//Expression assignment, because 3, b is not a valid left list

a, b = c, 3;
//List assignment, because c, 3 is a valid right list

a, b = c, d = e, f;
//Two list assignments

a, b = 3, c = d, e;
//List assignment, where the right operand is a list that
// includes an expression assignment
//Same end result as:
//a, b = 3, (c = d), e;

a, b = c, d = e, 3;
//Two list assignments, because e, 3 is a valid right list.
I believe this solves the problem. Of course, you can still force PineDL to emulate the other meaning, like this:
a, (b = c), d;
Of course, b = c is a list assignment itself(remember this is the list compacter, so it applies to a right list), but nobody has to know, right? For all purposes, this behaves like expected.
In fact, you can write
a, (b = c), d = e;
This is because (b = c) means automatically that it can't be a left list.
Note that (b = c) is itself a list assignment(although it behaves like an expression assignment for all practical purposes), d = e is a real expression assignment.

Thursday, May 13, 2010

How to make a programming language - Part III

So far we have already briefly covered the basics of lexers.
On the previous article, our proto-compiler learned how to deal with expressions such as "42+0".
However, if you played around with it, you might have reached a limitation with it:
If we just added a multiplication operator, then "1+2*3" would result in 9, instead of the intended 7.
This makes sense considering how we implemented our calculator. It just continues executing operations without concern for precedence.

This could get even more complex if we consider cases such as "3*(2+1) == 4*2". So we will have to teach our calculator how to deal with precedence.
The simplest approach I know of (which happens to be widely used by compilers) is using trees to represent the document (by document, I mean the entire file or string that's meant to be interpreted/compiled).

A bit about trees
A tree is a very specific data structure. The concept of trees is something like this: We have a parent(the root), which has children. Each children may, in turn, have other children, and so on. Each element of the tree, regarding of whether it has children or not, is known as a node.
If a node has no children, then it is called a leaf. Two nodes with the same parent are known as siblings. Each node can have multiple children, but only a single parent!

I could try to draw a tree here, but blogspot(and HTML in general) dislikes spaces so drawing trees in a visually appealing way is hard. Instead, I will link to a wikipedia image and hope it doesn't get changed: http://en.wikipedia.org/wiki/File:Binary_tree_structure.svg

Assuming it isn't changed, that image represents a tree. The root node is "Encyclopaedia". Both "Science" and "Culture" and children or the root node. The "Science", "Art" and "Craft" nodes are leaves. "Art" and "Craft" are children of "Culture".

How can trees solve the precedence problem?
Our calculator is unable to identify operator precedence because it has no hierarchical information. Lexers simply export a list, a bunch of tokens: 1 + 2 * 3, for example.
Trees, however, can be way more specific.
Let's say I want to represent the above expression as a tree.
I would choose to make "+" the root node. This node would have two children: "1" (a leaf), and "*", which in turn would have two children (both leaves): "2" and "3".
However, for "( 1 + 2 ) * 3", the root node would be "*", and "+" would be a child of "*".
When discussing trees in compilers, it is common to hear names such as abstract syntax trees (AST). Although different terms might have different meaning, this basic concept is shared by several implementations (as usual with computers, there are usually multiple solutions for the same problem).

If we call the root the "upper-most node" and each children to be "below" its respective parent (and at the same level as its siblings), then we could add precedence to our calculator simply by telling it to always execute (or calculate) from the bottom-up. This is actually very simple to do. One possible way is to use recursion.
public int calculateSum(SumNode sumNode) {
  return calculate(sumNode.left) + calculate(sumNode.right);
}
Where calculate itself would call calculateSum if the provided node was a sum. This is necessary to solve "(1 + 2)*3+4" and would, in fact, implement recursion naturally, since each calculation is delayed until the result of its child nodes is known.

How to create trees from a list of tokens?
Now that we know what trees are, we need to generate one from the list of tokens our tree generated.
To solve that problem we will need a parser. There are many ways to write parsers. I will only describe one method in detail.

The easy but messy way (my personal favorite)
There is a very simple way to implement, but if your language is complex enough(many precedence levels), it will easily take a few hundreds - possibly thousands - lines of code to implement. This approach starts small but grows very quickly.

To implement this method, we start with the leaves. In this case, the number constants.
Here we are not really going to create a tree in memory. Instead, we are going to calculate the result as it would be generated (following the same method, the same order, but exporting a value instead of the tree).
We'll start with number parsing (numbers are our leaves):
public int parseNumber(out bool successful) {
  Token t;
  //If next token is not number
  if (!((t = getNextToken()) is NumberConstant)) {
    successful = false;
    return -1;//Any value will work here, since it'll be just discarded
  }

  //We need a way to discard "bad" tokens,
  //so we say which ones we keep in the next few lines.
  //Of course, saying which ones do *not* matter would work too.
  spendToken();

  return ((NumberConstant) t).numberValue;
}
Now, to implement parenthesis expressions, we are going to add another method:
public int parseParenthesisExpression(out bool successful) {
  if (!((t = getNextToken()) is ParenthesisSymbol)) {
    successful = false;
    return -1;//Any value will work here, since it'll be just discarded
  }

  spendToken();

  int exprResult = parseExpression();

  spendToken(); //Spend the ')' token.

  return exprResult;
}
Now that we have the two basic units of our calculator, we are going to need a way to unite them
public int parseUnit() {
  bool success;
  int result = parseNumber(out success);
  if (success) return result;

  result = parseParenthesis(out success);
  if (success) return result;

  throw new ParserException("Syntax error");
}
Every tree will have at some point our "units". Always. Of course, parenthesis in turn calculate other things, but they still have precedence over what's outside them. More on why we consider the parenthesis grouping to be a unit will come later.

Now that we have the units, we can go to the next level of precedence. If we consider only two operators, "+" and "*", then "*" has priority. So we're going to start with that.
public int parseMultiplication() {

  int left = parseUnit();
  Token t = getNextToken();

  if (!(t is MultiplicationSymbol)) return left;
  //More on why this happens later

  spendToken();

  return left * parseExpression();
}
Now, I believe some explanations must be done here. The reason why we check whether something is a multiplication symbol and return the unit if it isn't is because we want to propagate values.
In other words, the sum operation will never directly interface with parseUnit. It will instead access parseMultiplication, even if there are no multiplications.
For a parseSum, one would replace the "int left = parseUnit();" by "int left = parseMultiplication();". Because of our little extra return in the parseMultiplication() earlier, this will work even for expressions like "1+2", which have no multiplications at all.

Finally, our parseExpression() would simply be whatever has the least precedence("+", in this case).

It is worth noting two limitations in this currently stub implementation
- It does not consider the end of the token list, and therefore it will always crash. Still, this limitation can be solved with only a few lines of code;
- It does not support multiple operators in the same precedence levels yet. However, this is very simple to add to the current system. Just make parseSum() and parseMinus(), for example, the same function, rather than two separate ones.

So why are parenthesis units?

To prevent infinite recursion/stack overflows, we must make sure that our implementation does not just keep looping from the bottom to the top. That's why we didn't make parseMultiplication() call parseExpression() directly for the left operand.
If we did not add parenthesis as an unit, then "(1 + 2)" would be supported, but "(1) + (2)" would not.

Any alternative implementations?

This method can get a little messy. LLVM's Kaleidoscope has another one which I haven't had the opportunity to try so far, so I don't know how the two compare in terms of complexity, lines of code and readability.

Well, that closes today's session. This is still very far from a working compiler, but the concepts here will be required for future lessons. So until the next time!