I am proud to announce the release of the second alpha release of PineDL.
This release is mostly incremental. However, it is already a significant improvement over the last one.
Here are some of the changes:
1. The parser is now Antlr-based. I am hoping this will help maintainability and simplify development of the compiler.
2. Variable declarations
3. "While" loops
4. Function expressions
5. Postfix Increment expressions
Regarding variable declarations, the following syntax is used:
var x, y, z = 1, 2, 3;
Function expressions take the following syntax:
function(args) { return 123; }
Note that function expressions MUST have blocks as statements. function() return 1; is not supported.
Finally, postfix increment/decrement expressions are currently the only way to change the value of a variable. Prefix increment/decrement is not yet supported.
You can download the binary or (if you're curious) the source code. If you find bugs, feel free to report them to the bug tracker.
Saturday, December 25, 2010
Wednesday, December 8, 2010
PineDL Alpha 1
Today, I have released an initial preview version of PineDL.
Note that when I say "initial preview", I mean that it is horribly incomplete and not particularly useful.
Notably, there are no variable declarations nor loops, so you'll have to use good old function arguments and functional programming style. Except that without lambdas. And no data structures besides immutable strings.
That said, both the binaries and source code are available.
I would commit the source code to the repositories, but google code is on maintenance.
Source Code
Binary
Feel free to report bugs on the tracker.
Note that when I say "initial preview", I mean that it is horribly incomplete and not particularly useful.
Notably, there are no variable declarations nor loops, so you'll have to use good old function arguments and functional programming style. Except that without lambdas. And no data structures besides immutable strings.
That said, both the binaries and source code are available.
I would commit the source code to the repositories, but google code is on maintenance.
Source Code
Binary
Feel free to report bugs on the tracker.
Monday, December 6, 2010
System.Reflection.Emit and PineDL
Even if one implements a lexer and a parser, that alone is not enough to design a programming language.
The lexer is a tool that receives text and outputs a list of tokens.
The parser is a tool that receives a list of tokens and exports a tree.
What we need now is a tool that receives a tree and either runs it or exports some format that can be ran.
There are many ways to achieve this. One possible way is to use LLVM. LLVM is a C/C++ library that allows JIT and AOT compiling, as well as executing the result.
LLVM is a relatively low-level API. Although it allows the programmer to abstract away from concepts such as executable file formats, it is basically a portable assembler.
LLVM does not come with garbage collection, object-oriented programming support nor exceptions. Instead, LLVM provides the tools to create and integrate these features(entirely optional).
Contrast this with virtual machines like JVM and .NET.
Both Java and .NET use an intermediate format which comes with the features mentioned above built-in. In fact, not only they are built-in, they are mandatory.
These are high-level virtual machines.
By "high-level", I do not mean "better" nor "worst". Just different.
They are also not entirely incompatible. For instance, Mono has a LLVM backend for .NET JIT compilation.
While LLVM grants much programming freedom, it comes at a cost: additional development complexity. This may or may not be acceptable.
Among the multiple libraries .NET provides is System.Reflection.Emit. Like LLVM, this namespace provides support for creating, executing and exporting .NET assemblies on runtime.
For .NET language development, it is ideal.
System.Reflection.Emit integrates extremely well in the .NET stack. To generate a language standard library, one just creates that assembly in a .NET language(like C#), and then uses normal Reflection APIs combined with Emit to achieve the desired output.
I will not cover System.Reflection.Emit usage here(at least not now), but I will be discussing how it impacts language development.
In first place, like LLVM, there is very little API difference between executing and exporting a dynamically generated assembly.
In second place, MSIL is more programmer-friendly than LLVM intermediate language in many ways. In particular, object-oriented programming support is built-in, so executing a call is just this:
il.Emit(OpCodes.Ldloc, var1);
il.Emit(OpCodes.Ldstr, "xyz");
il.Emit(OpCodes.Callvirt, typeof(Foo).GetMethod("bar"));
il.Emit(OpCodes.Stloc, var2);
Is roughly equivalent to:
var2 = var1.bar("xyz");
Where var1 extends or implements Foo and "bar" is a virtual method.
Note that Foo could be a core API type. System.Reflection.Emit appears to be a first-class .NET citizen. There is also almost no difference between using some API method and using a dynamically generated method, as MethodBuilder extends MethodInfo and can be used in the Emit APIs in any place where MethodInfo could be used.
For PineDL, this is very practical. I just create the core language classes in C#, and then seemlessly integrate them into the PineDL application. This without having to concern myself with details like garbage collection and internal representation of virtual methods and interfaces. It just works, which is a big plus.
PineDL in particular is a hard language to "quickly" design a garbage collector for, since there is greater-than-usual cyclic reference risk, making choices such as simple reference counters and Boehm's conservative GC a bad choice.
Right now, I have a very simple working PineDL prototype, written entirely in C#, which I have not yet committed to the PineDL repositories.
It is also very simple and incomplete, so I will not be releasing binaries nor source code at this point. I want to at least have variable declarations, loops and most useful operations working before releasing something. In other words, I want to have something to show before showing it.
The code is currently divided in a few projects: Pine.Lexer, Pine.Parser, Pine.Core, Pine.CodeGen and Pine.IO. There is also a small testing project.
Pine.IO is part of the PineDL standard API, although it has very few functions right now.
I'm hoping to have something to show before the end of the year.
The lexer is a tool that receives text and outputs a list of tokens.
The parser is a tool that receives a list of tokens and exports a tree.
What we need now is a tool that receives a tree and either runs it or exports some format that can be ran.
There are many ways to achieve this. One possible way is to use LLVM. LLVM is a C/C++ library that allows JIT and AOT compiling, as well as executing the result.
LLVM is a relatively low-level API. Although it allows the programmer to abstract away from concepts such as executable file formats, it is basically a portable assembler.
LLVM does not come with garbage collection, object-oriented programming support nor exceptions. Instead, LLVM provides the tools to create and integrate these features(entirely optional).
Contrast this with virtual machines like JVM and .NET.
Both Java and .NET use an intermediate format which comes with the features mentioned above built-in. In fact, not only they are built-in, they are mandatory.
These are high-level virtual machines.
By "high-level", I do not mean "better" nor "worst". Just different.
They are also not entirely incompatible. For instance, Mono has a LLVM backend for .NET JIT compilation.
While LLVM grants much programming freedom, it comes at a cost: additional development complexity. This may or may not be acceptable.
Among the multiple libraries .NET provides is System.Reflection.Emit. Like LLVM, this namespace provides support for creating, executing and exporting .NET assemblies on runtime.
For .NET language development, it is ideal.
System.Reflection.Emit integrates extremely well in the .NET stack. To generate a language standard library, one just creates that assembly in a .NET language(like C#), and then uses normal Reflection APIs combined with Emit to achieve the desired output.
I will not cover System.Reflection.Emit usage here(at least not now), but I will be discussing how it impacts language development.
In first place, like LLVM, there is very little API difference between executing and exporting a dynamically generated assembly.
In second place, MSIL is more programmer-friendly than LLVM intermediate language in many ways. In particular, object-oriented programming support is built-in, so executing a call is just this:
il.Emit(OpCodes.Ldloc, var1);
il.Emit(OpCodes.Ldstr, "xyz");
il.Emit(OpCodes.Callvirt, typeof(Foo).GetMethod("bar"));
il.Emit(OpCodes.Stloc, var2);
Is roughly equivalent to:
var2 = var1.bar("xyz");
Where var1 extends or implements Foo and "bar" is a virtual method.
Note that Foo could be a core API type. System.Reflection.Emit appears to be a first-class .NET citizen. There is also almost no difference between using some API method and using a dynamically generated method, as MethodBuilder extends MethodInfo and can be used in the Emit APIs in any place where MethodInfo could be used.
For PineDL, this is very practical. I just create the core language classes in C#, and then seemlessly integrate them into the PineDL application. This without having to concern myself with details like garbage collection and internal representation of virtual methods and interfaces. It just works, which is a big plus.
PineDL in particular is a hard language to "quickly" design a garbage collector for, since there is greater-than-usual cyclic reference risk, making choices such as simple reference counters and Boehm's conservative GC a bad choice.
Right now, I have a very simple working PineDL prototype, written entirely in C#, which I have not yet committed to the PineDL repositories.
It is also very simple and incomplete, so I will not be releasing binaries nor source code at this point. I want to at least have variable declarations, loops and most useful operations working before releasing something. In other words, I want to have something to show before showing it.
The code is currently divided in a few projects: Pine.Lexer, Pine.Parser, Pine.Core, Pine.CodeGen and Pine.IO. There is also a small testing project.
Pine.IO is part of the PineDL standard API, although it has very few functions right now.
I'm hoping to have something to show before the end of the year.
Saturday, July 10, 2010
Regarding function ambiguity
Yet another ambiguity situation can be seen in PineDL. That being the function type vs. function declaration.
Consider the following (valid) expression:
The above being a function that takes no arguments and returns 0.
Now, consider this in context:
Note that the double ';' is needed(the first to end the function statement, the other to end the variable declaration).
The 'function' keyword is, however, a type as well.
Being a type constant, operations can be applied to function, including calling.
Now, consider this piece of code:
What can this be?
1. Function constant calling, assigning the result to variable x, and finally an empty statement.
2. Function declaration, assigning the function to x.
But which?
Of course, some cases are obviously one of the two:
But the problem is in edge cases.
There are a few solutions:
1. Demand functions to be declared with {}
2. Prefer one of the cases(problematic)
3. Demand functions to be declared with {}, but allow functions to be declared using the \foo() syntax, and allow those to have no {}. In addition, make sure \ is not a valid type constant.
I haven't decided which yet.
Consider the following (valid) expression:
function() return 0;The above being a function that takes no arguments and returns 0.
Now, consider this in context:
var foo = function() return 0;;Note that the double ';' is needed(the first to end the function statement, the other to end the variable declaration).
The 'function' keyword is, however, a type as well.
const mytype = function;Being a type constant, operations can be applied to function, including calling.
const x = function();Now, consider this piece of code:
x = function();;What can this be?
1. Function constant calling, assigning the result to variable x, and finally an empty statement.
2. Function declaration, assigning the function to x.
But which?
Of course, some cases are obviously one of the two:
{var x = function();} //Obviously 1var x = function(){}; //Obviously 2But the problem is in edge cases.
There are a few solutions:
1. Demand functions to be declared with {}
2. Prefer one of the cases(problematic)
3. Demand functions to be declared with {}, but allow functions to be declared using the \foo() syntax, and allow those to have no {}. In addition, make sure \ is not a valid type constant.
I haven't decided which yet.
Saturday, July 3, 2010
Getting LLVM to work with MinGW
Today, I have tried to make LLVM to work on Windows with MinGW.
When trying to use LLVM, it is very clear that it isn't a "pure" Windows project. For instance, plain MinGW won't work. While I expected to have to install mingw32-make, I also did have to install MSYS, which I particularly hate.
After downloading the source code, there were a few problems I had to face:
1. Could not rename File exists
These ones were annoying. To solve them, I first had to go to the Release/lib folder and see if the indicated file existed. If so, I had to delete it and try again. If not, just running make again would make it work.
2. Figuring out the correct order to use when compiling a sample "hello world!" project. Here is the way I used:
The -lpsapi and -limagehlp wasn't obvious, even when looking at LLVM's documentation. I happened to find those mentioned in some mailing list when googling for the problem.
3. Threading
For this one, stackoverflow helped me. Lots of undefined references to __imp__pthreads functions. To solve it, I had to run ./configure with --disable-threads
Now that I've got this out of the way, I hope I'll finally be able to learn LLVM.
References:
http://stackoverflow.com/questions/2129263/how-to-build-llvm-using-gcc-4-on-windows
http://osdir.com/ml/compilers.llvm.devel/2005-09/msg00067.html
http://llvm.org/docs/CommandGuide/html/llvm-config.html
When trying to use LLVM, it is very clear that it isn't a "pure" Windows project. For instance, plain MinGW won't work. While I expected to have to install mingw32-make, I also did have to install MSYS, which I particularly hate.
After downloading the source code, there were a few problems I had to face:
1. Could not rename File exists
c:\MinGW\bin\ranlib.exe: unable to rename 'c:/Users/HP530/Downloads/llvm-2.7/llvm-2.7/llvm-2.7/Release/lib/libLLVMX86CodeGen.a'; reason: File exists
make[3]: *** [/c/Users/HP530/Downloads/llvm-2.7/llvm-2.7/llvm-2.7/Release/lib/libLLVMX86CodeGen.a] Error 1These ones were annoying. To solve them, I first had to go to the Release/lib folder and see if the indicated file existed. If so, I had to delete it and try again. If not, just running make again would make it work.
2. Figuring out the correct order to use when compiling a sample "hello world!" project. Here is the way I used:
g++ `llvm-config --cxxflags` -o hworld.o -c hworld.cpp
g++ `llvm-config --ldflags` -o hworld.exe hworld.o `llvm-config --libs` -limagehlp -lpsapiThe -lpsapi and -limagehlp wasn't obvious, even when looking at LLVM's documentation. I happened to find those mentioned in some mailing list when googling for the problem.
3. Threading
For this one, stackoverflow helped me. Lots of undefined references to __imp__pthreads functions. To solve it, I had to run ./configure with --disable-threads
Now that I've got this out of the way, I hope I'll finally be able to learn LLVM.
References:
http://stackoverflow.com/questions/2129263/how-to-build-llvm-using-gcc-4-on-windows
http://osdir.com/ml/compilers.llvm.devel/2005-09/msg00067.html
http://llvm.org/docs/CommandGuide/html/llvm-config.html
Thursday, May 20, 2010
Ambiguity with assignment
I just collided with an annoying ambiguity in my own language.
On one hand, one of the types of right lists is the assignment right list, which is a left list, an assignment symbol and another right list, like this:
However, not everything in PineDL is a list.
For instance, array accessing is made with a simple expression, like this:
Of course, now we have two types of assignment: list assignment and expression assignment.
This wouldn't be a problem if expressions and lists were isolated, but they are not.
Consider the following code in the context of a right list:
There are a few possible solutions to this ambiguity:
1. Prevent list assignment - Undesirable because the swap use case and much of the language is made around this convenient operation;
2. Prevent value assignment - Not bad because it can be emulated anyway using the list compacter operator, although inconvenient
I'm going with the third option. I'm going to define that, when possible, list assignment is preferred to expression assignment.
So:
In fact, you can write
Note that (b = c) is itself a list assignment(although it behaves like an expression assignment for all practical purposes), d = e is a real expression assignment.
On one hand, one of the types of right lists is the assignment right list, which is a left list, an assignment symbol and another right list, like this:
x, y = y, x;This is very useful(for example, the code above could be used for variable swapping.
However, not everything in PineDL is a list.
For instance, array accessing is made with a simple expression, like this:
a[b + c]And, of course, we want assignment expressions like we have in C:
a[b = c]Which will set b to the same value as c and access that index.
Of course, now we have two types of assignment: list assignment and expression assignment.
This wouldn't be a problem if expressions and lists were isolated, but they are not.
Consider the following code in the context of a right list:
a, b = c, dIs that a list assignment(as in left list = right list, which is a valid statement as we saw before), or is it a plain right list, with three elements, where the second is an assignment?
a, (b = c), dThis is a problem. And one we must solve.
There are a few possible solutions to this ambiguity:
1. Prevent list assignment - Undesirable because the swap use case and much of the language is made around this convenient operation;
2. Prevent value assignment - Not bad because it can be emulated anyway using the list compacter operator, although inconvenient
a[(b = c)] instead of a[b=c]3. Allow both, but specifically mention how to get around this ambiguity. Kind of like the dangling else problem in C, which is clearly defined.
I'm going with the third option. I'm going to define that, when possible, list assignment is preferred to expression assignment.
So:
a, b = c, d;I believe this solves the problem. Of course, you can still force PineDL to emulate the other meaning, like this:
//List assignment
3, b = c, d;
//Expression assignment, because 3, b is not a valid left list
a, b = c, 3;
//List assignment, because c, 3 is a valid right list
a, b = c, d = e, f;
//Two list assignments
a, b = 3, c = d, e;
//List assignment, where the right operand is a list that
// includes an expression assignment
//Same end result as:
//a, b = 3, (c = d), e;
a, b = c, d = e, 3;
//Two list assignments, because e, 3 is a valid right list.
a, (b = c), d;Of course, b = c is a list assignment itself(remember this is the list compacter, so it applies to a right list), but nobody has to know, right? For all purposes, this behaves like expected.
In fact, you can write
a, (b = c), d = e;This is because (b = c) means automatically that it can't be a left list.
Note that (b = c) is itself a list assignment(although it behaves like an expression assignment for all practical purposes), d = e is a real expression assignment.
Thursday, May 13, 2010
How to make a programming language - Part III
So far we have already briefly covered the basics of lexers.
On the previous article, our proto-compiler learned how to deal with expressions such as "42+0".
However, if you played around with it, you might have reached a limitation with it:
If we just added a multiplication operator, then "1+2*3" would result in 9, instead of the intended 7.
This makes sense considering how we implemented our calculator. It just continues executing operations without concern for precedence.
This could get even more complex if we consider cases such as "3*(2+1) == 4*2". So we will have to teach our calculator how to deal with precedence.
The simplest approach I know of (which happens to be widely used by compilers) is using trees to represent the document (by document, I mean the entire file or string that's meant to be interpreted/compiled).
A bit about trees
A tree is a very specific data structure. The concept of trees is something like this: We have a parent(the root), which has children. Each children may, in turn, have other children, and so on. Each element of the tree, regarding of whether it has children or not, is known as a node.
If a node has no children, then it is called a leaf. Two nodes with the same parent are known as siblings. Each node can have multiple children, but only a single parent!
I could try to draw a tree here, but blogspot(and HTML in general) dislikes spaces so drawing trees in a visually appealing way is hard. Instead, I will link to a wikipedia image and hope it doesn't get changed: http://en.wikipedia.org/wiki/File:Binary_tree_structure.svg
Assuming it isn't changed, that image represents a tree. The root node is "Encyclopaedia". Both "Science" and "Culture" and children or the root node. The "Science", "Art" and "Craft" nodes are leaves. "Art" and "Craft" are children of "Culture".
How can trees solve the precedence problem?
Our calculator is unable to identify operator precedence because it has no hierarchical information. Lexers simply export a list, a bunch of tokens: 1 + 2 * 3, for example.
Trees, however, can be way more specific.
Let's say I want to represent the above expression as a tree.
I would choose to make "+" the root node. This node would have two children: "1" (a leaf), and "*", which in turn would have two children (both leaves): "2" and "3".
However, for "( 1 + 2 ) * 3", the root node would be "*", and "+" would be a child of "*".
When discussing trees in compilers, it is common to hear names such as abstract syntax trees (AST). Although different terms might have different meaning, this basic concept is shared by several implementations (as usual with computers, there are usually multiple solutions for the same problem).
If we call the root the "upper-most node" and each children to be "below" its respective parent (and at the same level as its siblings), then we could add precedence to our calculator simply by telling it to always execute (or calculate) from the bottom-up. This is actually very simple to do. One possible way is to use recursion.
How to create trees from a list of tokens?
Now that we know what trees are, we need to generate one from the list of tokens our tree generated.
To solve that problem we will need a parser. There are many ways to write parsers. I will only describe one method in detail.
The easy but messy way (my personal favorite)
There is a very simple way to implement, but if your language is complex enough(many precedence levels), it will easily take a few hundreds - possibly thousands - lines of code to implement. This approach starts small but grows very quickly.
To implement this method, we start with the leaves. In this case, the number constants.
Here we are not really going to create a tree in memory. Instead, we are going to calculate the result as it would be generated (following the same method, the same order, but exporting a value instead of the tree).
We'll start with number parsing (numbers are our leaves):
Now that we have the units, we can go to the next level of precedence. If we consider only two operators, "+" and "*", then "*" has priority. So we're going to start with that.
In other words, the sum operation will never directly interface with parseUnit. It will instead access parseMultiplication, even if there are no multiplications.
For a parseSum, one would replace the "int left = parseUnit();" by "int left = parseMultiplication();". Because of our little extra return in the parseMultiplication() earlier, this will work even for expressions like "1+2", which have no multiplications at all.
Finally, our parseExpression() would simply be whatever has the least precedence("+", in this case).
It is worth noting two limitations in this currently stub implementation
- It does not consider the end of the token list, and therefore it will always crash. Still, this limitation can be solved with only a few lines of code;
- It does not support multiple operators in the same precedence levels yet. However, this is very simple to add to the current system. Just make parseSum() and parseMinus(), for example, the same function, rather than two separate ones.
So why are parenthesis units?
To prevent infinite recursion/stack overflows, we must make sure that our implementation does not just keep looping from the bottom to the top. That's why we didn't make parseMultiplication() call parseExpression() directly for the left operand.
If we did not add parenthesis as an unit, then "(1 + 2)" would be supported, but "(1) + (2)" would not.
Any alternative implementations?
This method can get a little messy. LLVM's Kaleidoscope has another one which I haven't had the opportunity to try so far, so I don't know how the two compare in terms of complexity, lines of code and readability.
Well, that closes today's session. This is still very far from a working compiler, but the concepts here will be required for future lessons. So until the next time!
On the previous article, our proto-compiler learned how to deal with expressions such as "42+0".
However, if you played around with it, you might have reached a limitation with it:
If we just added a multiplication operator, then "1+2*3" would result in 9, instead of the intended 7.
This makes sense considering how we implemented our calculator. It just continues executing operations without concern for precedence.
This could get even more complex if we consider cases such as "3*(2+1) == 4*2". So we will have to teach our calculator how to deal with precedence.
The simplest approach I know of (which happens to be widely used by compilers) is using trees to represent the document (by document, I mean the entire file or string that's meant to be interpreted/compiled).
A bit about trees
A tree is a very specific data structure. The concept of trees is something like this: We have a parent(the root), which has children. Each children may, in turn, have other children, and so on. Each element of the tree, regarding of whether it has children or not, is known as a node.
If a node has no children, then it is called a leaf. Two nodes with the same parent are known as siblings. Each node can have multiple children, but only a single parent!
I could try to draw a tree here, but blogspot(and HTML in general) dislikes spaces so drawing trees in a visually appealing way is hard. Instead, I will link to a wikipedia image and hope it doesn't get changed: http://en.wikipedia.org/wiki/File:Binary_tree_structure.svg
Assuming it isn't changed, that image represents a tree. The root node is "Encyclopaedia". Both "Science" and "Culture" and children or the root node. The "Science", "Art" and "Craft" nodes are leaves. "Art" and "Craft" are children of "Culture".
How can trees solve the precedence problem?
Our calculator is unable to identify operator precedence because it has no hierarchical information. Lexers simply export a list, a bunch of tokens: 1 + 2 * 3, for example.
Trees, however, can be way more specific.
Let's say I want to represent the above expression as a tree.
I would choose to make "+" the root node. This node would have two children: "1" (a leaf), and "*", which in turn would have two children (both leaves): "2" and "3".
However, for "( 1 + 2 ) * 3", the root node would be "*", and "+" would be a child of "*".
When discussing trees in compilers, it is common to hear names such as abstract syntax trees (AST). Although different terms might have different meaning, this basic concept is shared by several implementations (as usual with computers, there are usually multiple solutions for the same problem).
If we call the root the "upper-most node" and each children to be "below" its respective parent (and at the same level as its siblings), then we could add precedence to our calculator simply by telling it to always execute (or calculate) from the bottom-up. This is actually very simple to do. One possible way is to use recursion.
public int calculateSum(SumNode sumNode) {
return calculate(sumNode.left) + calculate(sumNode.right);
}
Where calculate itself would call calculateSum if the provided node was a sum. This is necessary to solve "(1 + 2)*3+4" and would, in fact, implement recursion naturally, since each calculation is delayed until the result of its child nodes is known.How to create trees from a list of tokens?
Now that we know what trees are, we need to generate one from the list of tokens our tree generated.
To solve that problem we will need a parser. There are many ways to write parsers. I will only describe one method in detail.
The easy but messy way (my personal favorite)
There is a very simple way to implement, but if your language is complex enough(many precedence levels), it will easily take a few hundreds - possibly thousands - lines of code to implement. This approach starts small but grows very quickly.
To implement this method, we start with the leaves. In this case, the number constants.
Here we are not really going to create a tree in memory. Instead, we are going to calculate the result as it would be generated (following the same method, the same order, but exporting a value instead of the tree).
We'll start with number parsing (numbers are our leaves):
public int parseNumber(out bool successful) {
Token t;
//If next token is not number
if (!((t = getNextToken()) is NumberConstant)) {
successful = false;
return -1;//Any value will work here, since it'll be just discarded
}
//We need a way to discard "bad" tokens,
//so we say which ones we keep in the next few lines.
//Of course, saying which ones do *not* matter would work too.
spendToken();
return ((NumberConstant) t).numberValue;
}
Now, to implement parenthesis expressions, we are going to add another method:public int parseParenthesisExpression(out bool successful) {
if (!((t = getNextToken()) is ParenthesisSymbol)) {
successful = false;
return -1;//Any value will work here, since it'll be just discarded
}
spendToken();
int exprResult = parseExpression();
spendToken(); //Spend the ')' token.
return exprResult;
}
Now that we have the two basic units of our calculator, we are going to need a way to unite thempublic int parseUnit() {
bool success;
int result = parseNumber(out success);
if (success) return result;
result = parseParenthesis(out success);
if (success) return result;
throw new ParserException("Syntax error");
}
Every tree will have at some point our "units". Always. Of course, parenthesis in turn calculate other things, but they still have precedence over what's outside them. More on why we consider the parenthesis grouping to be a unit will come later.Now that we have the units, we can go to the next level of precedence. If we consider only two operators, "+" and "*", then "*" has priority. So we're going to start with that.
public int parseMultiplication() {
int left = parseUnit();
Token t = getNextToken();
if (!(t is MultiplicationSymbol)) return left;
//More on why this happens later
spendToken();
return left * parseExpression();
}Now, I believe some explanations must be done here. The reason why we check whether something is a multiplication symbol and return the unit if it isn't is because we want to propagate values.In other words, the sum operation will never directly interface with parseUnit. It will instead access parseMultiplication, even if there are no multiplications.
For a parseSum, one would replace the "int left = parseUnit();" by "int left = parseMultiplication();". Because of our little extra return in the parseMultiplication() earlier, this will work even for expressions like "1+2", which have no multiplications at all.
Finally, our parseExpression() would simply be whatever has the least precedence("+", in this case).
It is worth noting two limitations in this currently stub implementation
- It does not consider the end of the token list, and therefore it will always crash. Still, this limitation can be solved with only a few lines of code;
- It does not support multiple operators in the same precedence levels yet. However, this is very simple to add to the current system. Just make parseSum() and parseMinus(), for example, the same function, rather than two separate ones.
So why are parenthesis units?
To prevent infinite recursion/stack overflows, we must make sure that our implementation does not just keep looping from the bottom to the top. That's why we didn't make parseMultiplication() call parseExpression() directly for the left operand.
If we did not add parenthesis as an unit, then "(1 + 2)" would be supported, but "(1) + (2)" would not.
Any alternative implementations?
This method can get a little messy. LLVM's Kaleidoscope has another one which I haven't had the opportunity to try so far, so I don't know how the two compare in terms of complexity, lines of code and readability.
Well, that closes today's session. This is still very far from a working compiler, but the concepts here will be required for future lessons. So until the next time!
Monday, April 26, 2010
How to make a programming language - Part II
At this point, we've seen what computers understand and what they don't.
So, at this point, our real challenge is getting the computer to understand our language. But how do you do that?
There are two main ways for this:
1. Compile the language code to machine code;
2. Write an application that reads and executes the code directly(interpreters).
There are, in fact, many intermediate ways - mixes of the two mentioned above.
The good news is, they share several concepts, so we can discuss them generically.
But first, what is text?
Text, for computers, is merely a bunch of bytes. Thanks to useful standards such as ASCII or UTF-8, it is possible to know what those bytes mean.
But even then, these standards only allow us to understand what characters there are in a text.
In the code,
The computer sees only 3 characters.
Of course, we can go through those characters, and keep track of a result.
So, in pseudo-code, we could write:
The code above would probably work, however as soon as we add more complex code, such as:
It will fail miserably. We can, of course, improve the code to handle these situations, but that'll increase the complexity of our code.
Wouldn't be nice if the computer could just look at "12" and read it as a number, rather than two characters?
Enter the world of lexers. These have one job: To pick a document and output a bunch of units - which I will call tokens.
It is worth noting that there are several ways to implement lexers, but I will only implement one here.
First of all, we could define our Token structure or class.
Here, our "type" means what kind of token do we have - is it a number(12) or a symbol(+)?
If it is a number, then our number field will store it. Otherwise, our symbol will store the '+' or '-' character.
At this point, we have a vector of "tokens", very easy to understand.
As a great side effect, we also managed to throw away unwanted spaces.
We could now just try to rewrite the first function of this post to use tokens instead of characters, and we'd have a much more reliable simple calculator.
Shall we try?
So now we have a much better working calculator(at least it can deal with numbers with multiple digits).
Of course, we don't want a calculator - we want a language, but this little calculator example is critical. Many(if not most) compilers and interpreters rely on lexers and lexers greatly simplify otherwise ridiculously complex tasks.
This will be all for today.
So, at this point, our real challenge is getting the computer to understand our language. But how do you do that?
There are two main ways for this:
1. Compile the language code to machine code;
2. Write an application that reads and executes the code directly(interpreters).
There are, in fact, many intermediate ways - mixes of the two mentioned above.
The good news is, they share several concepts, so we can discuss them generically.
But first, what is text?
Text, for computers, is merely a bunch of bytes. Thanks to useful standards such as ASCII or UTF-8, it is possible to know what those bytes mean.
But even then, these standards only allow us to understand what characters there are in a text.
In the code,
1+2The computer sees only 3 characters.
Of course, we can go through those characters, and keep track of a result.
So, in pseudo-code, we could write:
int execute_calculation() {
const int PLUS = 0;
const int MINUS = 1;
int result = 0;
int mode = PLUS;
/*
Some of the functions used here do not really exist
But we could implement or replace them with little effort.
*/
while(!end_of_file_reached()) {
int nextChar = read_next_char();
if (isdigit(nextChar))
if (plus) result += nextChar - '0';
else result -= nextChar - '0';
if (nextChar == '+') mode = PLUS;
if (nextChar == '-') mode = MINUS;
}
return result;
}The code above would probably work, however as soon as we add more complex code, such as:
12+34-5It will fail miserably. We can, of course, improve the code to handle these situations, but that'll increase the complexity of our code.
Wouldn't be nice if the computer could just look at "12" and read it as a number, rather than two characters?
Enter the world of lexers. These have one job: To pick a document and output a bunch of units - which I will call tokens.
It is worth noting that there are several ways to implement lexers, but I will only implement one here.
First of all, we could define our Token structure or class.
class Token {
public:
int type;
int number;
char symbol;
}Here, our "type" means what kind of token do we have - is it a number(12) or a symbol(+)?
If it is a number, then our number field will store it. Otherwise, our symbol will store the '+' or '-' character.
using namespace std;
vector<Token> ReadTokens() {
vector<Token> tokens;
while (!end_of_file_reached()) {
int ch = read_next_char();
if (is_digit(ch)) {
int number = ch - '0';
//Once again, I warn that next_char_is_digit() is not a real C function
//And it would have to be implemented.
while (next_char_is_digit())
number = number * 10 + read_next_char() - '0';
//If we have 1 and find 2, then we multiply 1 by 10 and sum 2(to get 12)
Token t;
t.type = 'N'; //Number
t.number = number;
tokens.push_back(t);
}
else if (ch == ' ') ; //Ignore white spaces
else {
int op = ch;
Token t;
t.type = 'S'; //Symbol
t.symbol = ch;
tokens.push_back(t);
}
}
return tokens;
}
At this point, we have a vector of "tokens", very easy to understand.
As a great side effect, we also managed to throw away unwanted spaces.
We could now just try to rewrite the first function of this post to use tokens instead of characters, and we'd have a much more reliable simple calculator.
Shall we try?
int execute_calculation() {
int result = 0;
const int PLUS = 0;
const int MINUS = 1;
int mode = PLUS;
vector<Token> tokens = ReadTokens();
for (vector<Token>::size_type i = 0; i < tokens.size(); ++i) {
if (tokens[i].type == 'N')
if (mode == PLUS) result += tokens[i].number;
else result -= tokens[i].number;
else
if (tokens[i].symbol == '+') mode = PLUS;
else mode = MINUS;
}
return result;
}So now we have a much better working calculator(at least it can deal with numbers with multiple digits).
Of course, we don't want a calculator - we want a language, but this little calculator example is critical. Many(if not most) compilers and interpreters rely on lexers and lexers greatly simplify otherwise ridiculously complex tasks.
This will be all for today.
Friday, April 23, 2010
How to make a programming language - Part I
How do you make a programming language?
I was recently asked this question, after telling some people I was making one. We, programmers, use regularly programming languages without thinking that they too had to be invented some day.
There are two ways to read the question above, though. The first is how to design a language. The second is how to implement it. Imagining stuff appears easy - “it'll be something awesome, and it'll be pink” - but then when we move to implementation, problems surface. Will it all be pink? What happens when we're out of pink paint? Is it acceptable to mix red and white?
To design a language without caring about implementation is a recipe to disaster. At best, time is wasted before programmers notice the flaws. At worse, the resources spent on a doomed project might seriously put organizations at risk. Problems will appear, so it is best to be ready.
The best way to be ready is to always remember that useful languages must be implemented one day, and for that we have to know how languages are implemented. So, inevitably, we arrive the second question.
So, how are existent languages implemented? C, Java, Assembly, PHP... How does the computer's CPU understand those languages?
The real answer: it doesn't.
I am serious. It doesn't.
No, it isn't the GPU. No, not the motherboard either. And don't even think of suggesting the hard drive or the RAM.
The computer's CPU gives no special treatment to C nor Java. It doesn't even give special treatment to Assembly. It understands none of those languages. This concept is very important.
There is one thing, and only one thing, the CPU understands: machine code.
Because each type of Assembly language is specifically designed for each type of machine code(in turn designed for each type of CPU: x86 assembly for x86, MIPS assembly for MIPS, etc.), we might be in the illusion that Assembly is understood by the computer when it does not.
But if the computer loves machine code so much, what's the big deal with Assembly, C and other languages? The true is that CPUs love machine code but programmers... not so much...
Machine code is typically a binary format, so reading it is very hard for humans. As much as one wrong bit is enough to make the application fail silently(executing the program, but returning incorrect output, possibly crashing). Meaning that we're forced to deal with machine code format complexity in addition to the complexity of the problem we're trying to solve.
To simplify this, Assembly was invented. Assemblers take a textual(instead of binary) file, with the instructions in a human-readable format(MOV EAX, 5 instead of a bunch of 0s and 1s or hexadecimal numbers), and translate it to the binary, native, equivalent. This makes assemblers very valuable, as they allow working very close to the machine, and the program is still readable.
But Assembly is still too low-level. That is, programmers still have to struggle and spend too much time thinking about how to solve even the simplest of problems, when there are truly horribly complex problems out there to solve. Clearly, Assembly is not enough. And so languages have been appearing over the last few decades.
All this to say that, for a computer to understand a language – any language – it must be taught first. And the lesson must be available in something that the computer understands, such as the machine code [1].
There are two main ways to teach a computer: with a compiler and with an interpreter. With compilers, we convert the code of a given programming language to native code, which can then be executed. With interpreters, we directly execute the code. Compilation is usually much faster than interpretation so traditionally interpreted languages such as JavaScript are now taking advantage of compilation using a trick known as Just-In-Time compilation. There is also the possibility to compile to interpreted bytecodes(intermediate formats).
I might discuss JIT and bytecode eventually, but it's currently not a priority for me, since it is possible to make languages without those concepts.
However, many concepts are shared across the multiple approaches I mentioned before. In a future blog post, I will discuss these concepts, and hopefully demystify programming languages.
---
[1] Programming languages are often implemented in languages such as C, which are converted to machine code anyway. It is possible to use other methods that do not involve translating to machine code but, as we go see what those methods are made in, we will eventually find machine code sooner or later.
I was recently asked this question, after telling some people I was making one. We, programmers, use regularly programming languages without thinking that they too had to be invented some day.
There are two ways to read the question above, though. The first is how to design a language. The second is how to implement it. Imagining stuff appears easy - “it'll be something awesome, and it'll be pink” - but then when we move to implementation, problems surface. Will it all be pink? What happens when we're out of pink paint? Is it acceptable to mix red and white?
To design a language without caring about implementation is a recipe to disaster. At best, time is wasted before programmers notice the flaws. At worse, the resources spent on a doomed project might seriously put organizations at risk. Problems will appear, so it is best to be ready.
The best way to be ready is to always remember that useful languages must be implemented one day, and for that we have to know how languages are implemented. So, inevitably, we arrive the second question.
So, how are existent languages implemented? C, Java, Assembly, PHP... How does the computer's CPU understand those languages?
The real answer: it doesn't.
I am serious. It doesn't.
No, it isn't the GPU. No, not the motherboard either. And don't even think of suggesting the hard drive or the RAM.
The computer's CPU gives no special treatment to C nor Java. It doesn't even give special treatment to Assembly. It understands none of those languages. This concept is very important.
There is one thing, and only one thing, the CPU understands: machine code.
Because each type of Assembly language is specifically designed for each type of machine code(in turn designed for each type of CPU: x86 assembly for x86, MIPS assembly for MIPS, etc.), we might be in the illusion that Assembly is understood by the computer when it does not.
But if the computer loves machine code so much, what's the big deal with Assembly, C and other languages? The true is that CPUs love machine code but programmers... not so much...
Machine code is typically a binary format, so reading it is very hard for humans. As much as one wrong bit is enough to make the application fail silently(executing the program, but returning incorrect output, possibly crashing). Meaning that we're forced to deal with machine code format complexity in addition to the complexity of the problem we're trying to solve.
To simplify this, Assembly was invented. Assemblers take a textual(instead of binary) file, with the instructions in a human-readable format(MOV EAX, 5 instead of a bunch of 0s and 1s or hexadecimal numbers), and translate it to the binary, native, equivalent. This makes assemblers very valuable, as they allow working very close to the machine, and the program is still readable.
But Assembly is still too low-level. That is, programmers still have to struggle and spend too much time thinking about how to solve even the simplest of problems, when there are truly horribly complex problems out there to solve. Clearly, Assembly is not enough. And so languages have been appearing over the last few decades.
All this to say that, for a computer to understand a language – any language – it must be taught first. And the lesson must be available in something that the computer understands, such as the machine code [1].
There are two main ways to teach a computer: with a compiler and with an interpreter. With compilers, we convert the code of a given programming language to native code, which can then be executed. With interpreters, we directly execute the code. Compilation is usually much faster than interpretation so traditionally interpreted languages such as JavaScript are now taking advantage of compilation using a trick known as Just-In-Time compilation. There is also the possibility to compile to interpreted bytecodes(intermediate formats).
I might discuss JIT and bytecode eventually, but it's currently not a priority for me, since it is possible to make languages without those concepts.
However, many concepts are shared across the multiple approaches I mentioned before. In a future blog post, I will discuss these concepts, and hopefully demystify programming languages.
---
[1] Programming languages are often implemented in languages such as C, which are converted to machine code anyway. It is possible to use other methods that do not involve translating to machine code but, as we go see what those methods are made in, we will eventually find machine code sooner or later.
Friday, April 16, 2010
Compiler progress - Possible Bytecode API
I have worked on the compiler and I expect to have it reach the point where it is good enough for me to start the next phase: bytecode generation.
At this point, the lexer is also essentially finished, even though it still lacks features such as \xFFFF sequences in strings.
My goal here is not to have them both complete yet, but rather to go forward until it becomes possible to have some useful *tests*.
With a tree available, I can check the document to see if it is valid, while I keep generating code at the same time.
To show what I have in mind for a compiler API, what's better than an example?
First turning a sequence of characters into tokens, then turning the sequence of tokens into a tree and now turning a tree into a list of commands, just so that later I can turn the list of commands into a bytecode binary. That's essentially all a compiler needs to do.
Of course, the compiler is nothing without the virtual machine, so I'll have to look into that too.
At this point, the lexer is also essentially finished, even though it still lacks features such as \xFFFF sequences in strings.
My goal here is not to have them both complete yet, but rather to go forward until it becomes possible to have some useful *tests*.
With a tree available, I can check the document to see if it is valid, while I keep generating code at the same time.
To show what I have in mind for a compiler API, what's better than an example?
//Let's consider a sum operation was foundFor a while statement, I could do something like
Variable left = parseExpression(sum.left,
bytecodeManager, tmpVariableManager);
Variable right = parseExpression(sum.right,
bytecodeManager, tmpVariableManager);
Variable output = tmpVariableManager.fetchNew();
bytecodeManager.appendInstruction(
new AddInstruction(output, left, right));
left.releaseIfTemporary();
right.releaseIfTemporary();
return output;
//Let's consider a while operation was foundI'm not 100% sure about this API and it'll probably change, but I really like it.
Label begin = bytecodeGenerator.newLabel();
bytecodeGenerator.assignLabelToNextInstruction(begin);
Variable cond = parseExpression(whileStmt.expr);
Label end = bytecodeGenerator.newLabel();
bytecodeGenerator.appendInstruction(new UnlessInstruction(cond, end));
parseStatement(whileStmt.body, bytecodeManager, tmpVariableManager);
bytecodeGenerator.appendInstruction(new InconditionalJump(begin));
bytecodeGenerator.assignLabelToNextInstruction(end);
First turning a sequence of characters into tokens, then turning the sequence of tokens into a tree and now turning a tree into a list of commands, just so that later I can turn the list of commands into a bytecode binary. That's essentially all a compiler needs to do.
Of course, the compiler is nothing without the virtual machine, so I'll have to look into that too.
Sunday, April 11, 2010
Hello, World!
This is the first PineDL blog post, so I suppose I should start by explaining what PineDL is.
Very briefly, PineDL is an interpreted, dynamically typed, object oriented, imperative programming language.
It was inspired by several programming languages, although the influences of Java, C#, JavaScript and Lua should be the most noticeable.
And what good is a language without a "Hello, World!" program?
But of course, each new language must answer the same question "Why do we need yet another programming language?"
If you look into existent dynamically typed programming languages, you will notice none of them offers exactly what PineDL intends to have.
Perhaps the syntax doesn't match the good old {;} pattern, or maybe full object oriented programming with access modifiers isn't available(JavaScript, for instance).
PineDL intends to pick the "good parts" of many languages to make a better one.
Note, however, that PineDL brings some interesting new concepts that haven't been mainstream so far, including multiple(even named!) returns and a "protected" access modifier that doesn't suck!
PineDL needs essentially two components:
- The compiler - pinec;
- The virtual machine - pinevm.
Neither is complete yet.
Feel free to get the Mercurial source code and to read PineDL's wiki on the google code page:
http://code.google.com/p/pinedl/
Very briefly, PineDL is an interpreted, dynamically typed, object oriented, imperative programming language.
It was inspired by several programming languages, although the influences of Java, C#, JavaScript and Lua should be the most noticeable.
And what good is a language without a "Hello, World!" program?
function main() {
//This is a comment
console.writeline("Hello, World!");
}
As you can see, it is very similar to other languages.But of course, each new language must answer the same question "Why do we need yet another programming language?"
If you look into existent dynamically typed programming languages, you will notice none of them offers exactly what PineDL intends to have.
Perhaps the syntax doesn't match the good old {;} pattern, or maybe full object oriented programming with access modifiers isn't available(JavaScript, for instance).
PineDL intends to pick the "good parts" of many languages to make a better one.
Note, however, that PineDL brings some interesting new concepts that haven't been mainstream so far, including multiple(even named!) returns and a "protected" access modifier that doesn't suck!
PineDL needs essentially two components:
- The compiler - pinec;
- The virtual machine - pinevm.
Neither is complete yet.
Feel free to get the Mercurial source code and to read PineDL's wiki on the google code page:
http://code.google.com/p/pinedl/
Subscribe to:
Posts (Atom)