Monday, April 26, 2010

How to make a programming language - Part II

At this point, we've seen what computers understand and what they don't.
So, at this point, our real challenge is getting the computer to understand our language. But how do you do that?
There are two main ways for this:
1. Compile the language code to machine code;
2. Write an application that reads and executes the code directly(interpreters).

There are, in fact, many intermediate ways - mixes of the two mentioned above.
The good news is, they share several concepts, so we can discuss them generically.

But first, what is text?
Text, for computers, is merely a bunch of bytes. Thanks to useful standards such as ASCII or UTF-8, it is possible to know what those bytes mean.
But even then, these standards only allow us to understand what characters there are in a text.
In the code,

1+2

The computer sees only 3 characters.
Of course, we can go through those characters, and keep track of a result.
So, in pseudo-code, we could write:

int execute_calculation() {
  const int PLUS = 0;
  const int MINUS = 1;
  int result = 0;
  int mode = PLUS;
/*
Some of the functions used here do not really exist
But we could implement or replace them with little effort.
*/
  while(!end_of_file_reached()) {
    int nextChar = read_next_char();
    if (isdigit(nextChar))
      if (plus) result += nextChar - '0';
      else result -= nextChar - '0';
    if (nextChar == '+') mode = PLUS;
    if (nextChar == '-') mode = MINUS;
  }
  return result;
}


The code above would probably work, however as soon as we add more complex code, such as:

12+34-5

It will fail miserably. We can, of course, improve the code to handle these situations, but that'll increase the complexity of our code.
Wouldn't be nice if the computer could just look at "12" and read it as a number, rather than two characters?

Enter the world of lexers. These have one job: To pick a document and output a bunch of units - which I will call tokens.
It is worth noting that there are several ways to implement lexers, but I will only implement one here.

First of all, we could define our Token structure or class.

class Token {
public:
  int type;

  int number;
  char symbol;
}


Here, our "type" means what kind of token do we have - is it a number(12) or a symbol(+)?
If it is a number, then our number field will store it. Otherwise, our symbol will store the '+' or '-' character.

using namespace std;

vector<Token> ReadTokens() {
  vector<Token> tokens;
  while (!end_of_file_reached()) {
    int ch = read_next_char();
    if (is_digit(ch)) {
      int number = ch - '0';
//Once again, I warn that next_char_is_digit() is not a real C function
//And it would have to be implemented.
      while (next_char_is_digit())
        number = number * 10 + read_next_char() - '0';
       //If we have 1 and find 2, then we multiply 1 by 10 and sum 2(to get 12)
      Token t;
      t.type = 'N'; //Number
      t.number = number;
      tokens.push_back(t);
    }
    else if (ch == ' ') ; //Ignore white spaces
    else {
      int op = ch;
      Token t;
      t.type = 'S'; //Symbol
      t.symbol = ch;
      tokens.push_back(t);
    }
  }
  return tokens;
}



At this point, we have a vector of "tokens", very easy to understand.
As a great side effect, we also managed to throw away unwanted spaces.
We could now just try to rewrite the first function of this post to use tokens instead of characters, and we'd have a much more reliable simple calculator.
Shall we try?


int execute_calculation() {
  int result = 0;
  const int PLUS = 0;
  const int MINUS = 1;
  int mode = PLUS;
  vector<Token> tokens = ReadTokens();
  for (vector<Token>::size_type i = 0; i < tokens.size(); ++i) {
    if (tokens[i].type == 'N')
      if (mode == PLUS) result += tokens[i].number;
      else result -= tokens[i].number;
    else
      if (tokens[i].symbol == '+') mode = PLUS;
      else mode = MINUS;
  }
  return result;
}


So now we have a much better working calculator(at least it can deal with numbers with multiple digits).
Of course, we don't want a calculator - we want a language, but this little calculator example is critical. Many(if not most) compilers and interpreters rely on lexers and lexers greatly simplify otherwise ridiculously complex tasks.

This will be all for today.

Friday, April 23, 2010

How to make a programming language - Part I

How do you make a programming language?

I was recently asked this question, after telling some people I was making one. We, programmers, use regularly programming languages without thinking that they too had to be invented some day.

There are two ways to read the question above, though. The first is how to design a language. The second is how to implement it. Imagining stuff appears easy - “it'll be something awesome, and it'll be pink” - but then when we move to implementation, problems surface. Will it all be pink? What happens when we're out of pink paint? Is it acceptable to mix red and white?

To design a language without caring about implementation is a recipe to disaster. At best, time is wasted before programmers notice the flaws. At worse, the resources spent on a doomed project might seriously put organizations at risk. Problems will appear, so it is best to be ready.

The best way to be ready is to always remember that useful languages must be implemented one day, and for that we have to know how languages are implemented. So, inevitably, we arrive the second question.

So, how are existent languages implemented? C, Java, Assembly, PHP... How does the computer's CPU understand those languages?

The real answer: it doesn't.
I am serious. It doesn't.
No, it isn't the GPU. No, not the motherboard either. And don't even think of suggesting the hard drive or the RAM.

The computer's CPU gives no special treatment to C nor Java. It doesn't even give special treatment to Assembly. It understands none of those languages. This concept is very important.

There is one thing, and only one thing, the CPU understands: machine code.

Because each type of Assembly language is specifically designed for each type of machine code(in turn designed for each type of CPU: x86 assembly for x86, MIPS assembly for MIPS, etc.), we might be in the illusion that Assembly is understood by the computer when it does not.

But if the computer loves machine code so much, what's the big deal with Assembly, C and other languages? The true is that CPUs love machine code but programmers... not so much...

Machine code is typically a binary format, so reading it is very hard for humans. As much as one wrong bit is enough to make the application fail silently(executing the program, but returning incorrect output, possibly crashing). Meaning that we're forced to deal with machine code format complexity in addition to the complexity of the problem we're trying to solve.
To simplify this, Assembly was invented. Assemblers take a textual(instead of binary) file, with the instructions in a human-readable format(MOV EAX, 5 instead of a bunch of 0s and 1s or hexadecimal numbers), and translate it to the binary, native, equivalent. This makes assemblers very valuable, as they allow working very close to the machine, and the program is still readable.

But Assembly is still too low-level. That is, programmers still have to struggle and spend too much time thinking about how to solve even the simplest of problems, when there are truly horribly complex problems out there to solve. Clearly, Assembly is not enough. And so languages have been appearing over the last few decades.

All this to say that, for a computer to understand a language – any language – it must be taught first. And the lesson must be available in something that the computer understands, such as the machine code [1].

There are two main ways to teach a computer: with a compiler and with an interpreter. With compilers, we convert the code of a given programming language to native code, which can then be executed. With interpreters, we directly execute the code. Compilation is usually much faster than interpretation so traditionally interpreted languages such as JavaScript are now taking advantage of compilation using a trick known as Just-In-Time compilation. There is also the possibility to compile to interpreted bytecodes(intermediate formats).

I might discuss JIT and bytecode eventually, but it's currently not a priority for me, since it is possible to make languages without those concepts.

However, many concepts are shared across the multiple approaches I mentioned before. In a future blog post, I will discuss these concepts, and hopefully demystify programming languages.

---

[1] Programming languages are often implemented in languages such as C, which are converted to machine code anyway. It is possible to use other methods that do not involve translating to machine code but, as we go see what those methods are made in, we will eventually find machine code sooner or later.

Friday, April 16, 2010

Compiler progress - Possible Bytecode API

I have worked on the compiler and I expect to have it reach the point where it is good enough for me to start the next phase: bytecode generation.

At this point, the lexer is also essentially finished, even though it still lacks features such as \xFFFF sequences in strings.

My goal here is not to have them both complete yet, but rather to go forward until it becomes possible to have some useful *tests*.
With a tree available, I can check the document to see if it is valid, while I keep generating code at the same time.

To show what I have in mind for a compiler API, what's better than an example?
//Let's consider a sum operation was found
Variable left = parseExpression(sum.left,
bytecodeManager, tmpVariableManager);
Variable right = parseExpression(sum.right,
bytecodeManager, tmpVariableManager);
Variable output = tmpVariableManager.fetchNew();
bytecodeManager.appendInstruction(
new AddInstruction(output, left, right));
left.releaseIfTemporary();
right.releaseIfTemporary();
return output;
For a while statement, I could do something like
//Let's consider a while operation was found
Label begin = bytecodeGenerator.newLabel();
bytecodeGenerator.assignLabelToNextInstruction(begin);
Variable cond = parseExpression(whileStmt.expr);
Label end = bytecodeGenerator.newLabel();
bytecodeGenerator.appendInstruction(new UnlessInstruction(cond, end));
parseStatement(whileStmt.body, bytecodeManager, tmpVariableManager);
bytecodeGenerator.appendInstruction(new InconditionalJump(begin));
bytecodeGenerator.assignLabelToNextInstruction(end);
I'm not 100% sure about this API and it'll probably change, but I really like it.
First turning a sequence of characters into tokens, then turning the sequence of tokens into a tree and now turning a tree into a list of commands, just so that later I can turn the list of commands into a bytecode binary. That's essentially all a compiler needs to do.
Of course, the compiler is nothing without the virtual machine, so I'll have to look into that too.

Sunday, April 11, 2010

Hello, World!

This is the first PineDL blog post, so I suppose I should start by explaining what PineDL is.
Very briefly, PineDL is an interpreted, dynamically typed, object oriented, imperative programming language.
It was inspired by several programming languages, although the influences of Java, C#, JavaScript and Lua should be the most noticeable.
And what good is a language without a "Hello, World!" program?
function main() {
//This is a comment
console.writeline("Hello, World!");
}
As you can see, it is very similar to other languages.
But of course, each new language must answer the same question "Why do we need yet another programming language?"

If you look into existent dynamically typed programming languages, you will notice none of them offers exactly what PineDL intends to have.
Perhaps the syntax doesn't match the good old {;} pattern, or maybe full object oriented programming with access modifiers isn't available(JavaScript, for instance).
PineDL intends to pick the "good parts" of many languages to make a better one.
Note, however, that PineDL brings some interesting new concepts that haven't been mainstream so far, including multiple(even named!) returns and a "protected" access modifier that doesn't suck!

PineDL needs essentially two components:
- The compiler - pinec;
- The virtual machine - pinevm.

Neither is complete yet.
Feel free to get the Mercurial source code and to read PineDL's wiki on the google code page:
http://code.google.com/p/pinedl/