Monday, April 26, 2010

How to make a programming language - Part II

At this point, we've seen what computers understand and what they don't.
So, at this point, our real challenge is getting the computer to understand our language. But how do you do that?
There are two main ways for this:
1. Compile the language code to machine code;
2. Write an application that reads and executes the code directly(interpreters).

There are, in fact, many intermediate ways - mixes of the two mentioned above.
The good news is, they share several concepts, so we can discuss them generically.

But first, what is text?
Text, for computers, is merely a bunch of bytes. Thanks to useful standards such as ASCII or UTF-8, it is possible to know what those bytes mean.
But even then, these standards only allow us to understand what characters there are in a text.
In the code,

1+2

The computer sees only 3 characters.
Of course, we can go through those characters, and keep track of a result.
So, in pseudo-code, we could write:

int execute_calculation() {
  const int PLUS = 0;
  const int MINUS = 1;
  int result = 0;
  int mode = PLUS;
/*
Some of the functions used here do not really exist
But we could implement or replace them with little effort.
*/
  while(!end_of_file_reached()) {
    int nextChar = read_next_char();
    if (isdigit(nextChar))
      if (plus) result += nextChar - '0';
      else result -= nextChar - '0';
    if (nextChar == '+') mode = PLUS;
    if (nextChar == '-') mode = MINUS;
  }
  return result;
}


The code above would probably work, however as soon as we add more complex code, such as:

12+34-5

It will fail miserably. We can, of course, improve the code to handle these situations, but that'll increase the complexity of our code.
Wouldn't be nice if the computer could just look at "12" and read it as a number, rather than two characters?

Enter the world of lexers. These have one job: To pick a document and output a bunch of units - which I will call tokens.
It is worth noting that there are several ways to implement lexers, but I will only implement one here.

First of all, we could define our Token structure or class.

class Token {
public:
  int type;

  int number;
  char symbol;
}


Here, our "type" means what kind of token do we have - is it a number(12) or a symbol(+)?
If it is a number, then our number field will store it. Otherwise, our symbol will store the '+' or '-' character.

using namespace std;

vector<Token> ReadTokens() {
  vector<Token> tokens;
  while (!end_of_file_reached()) {
    int ch = read_next_char();
    if (is_digit(ch)) {
      int number = ch - '0';
//Once again, I warn that next_char_is_digit() is not a real C function
//And it would have to be implemented.
      while (next_char_is_digit())
        number = number * 10 + read_next_char() - '0';
       //If we have 1 and find 2, then we multiply 1 by 10 and sum 2(to get 12)
      Token t;
      t.type = 'N'; //Number
      t.number = number;
      tokens.push_back(t);
    }
    else if (ch == ' ') ; //Ignore white spaces
    else {
      int op = ch;
      Token t;
      t.type = 'S'; //Symbol
      t.symbol = ch;
      tokens.push_back(t);
    }
  }
  return tokens;
}



At this point, we have a vector of "tokens", very easy to understand.
As a great side effect, we also managed to throw away unwanted spaces.
We could now just try to rewrite the first function of this post to use tokens instead of characters, and we'd have a much more reliable simple calculator.
Shall we try?


int execute_calculation() {
  int result = 0;
  const int PLUS = 0;
  const int MINUS = 1;
  int mode = PLUS;
  vector<Token> tokens = ReadTokens();
  for (vector<Token>::size_type i = 0; i < tokens.size(); ++i) {
    if (tokens[i].type == 'N')
      if (mode == PLUS) result += tokens[i].number;
      else result -= tokens[i].number;
    else
      if (tokens[i].symbol == '+') mode = PLUS;
      else mode = MINUS;
  }
  return result;
}


So now we have a much better working calculator(at least it can deal with numbers with multiple digits).
Of course, we don't want a calculator - we want a language, but this little calculator example is critical. Many(if not most) compilers and interpreters rely on lexers and lexers greatly simplify otherwise ridiculously complex tasks.

This will be all for today.

No comments:

Post a Comment