How do you make a programming language?
I was recently asked this question, after telling some people I was making one. We, programmers, use regularly programming languages without thinking that they too had to be invented some day.
There are two ways to read the question above, though. The first is how to design a language. The second is how to implement it. Imagining stuff appears easy - “it'll be something awesome, and it'll be pink” - but then when we move to implementation, problems surface. Will it all be pink? What happens when we're out of pink paint? Is it acceptable to mix red and white?
To design a language without caring about implementation is a recipe to disaster. At best, time is wasted before programmers notice the flaws. At worse, the resources spent on a doomed project might seriously put organizations at risk. Problems will appear, so it is best to be ready.
The best way to be ready is to always remember that useful languages must be implemented one day, and for that we have to know how languages are implemented. So, inevitably, we arrive the second question.
So, how are existent languages implemented? C, Java, Assembly, PHP... How does the computer's CPU understand those languages?
The real answer: it doesn't.
I am serious. It doesn't.
No, it isn't the GPU. No, not the motherboard either. And don't even think of suggesting the hard drive or the RAM.
The computer's CPU gives no special treatment to C nor Java. It doesn't even give special treatment to Assembly. It understands none of those languages. This concept is very important.
There is one thing, and only one thing, the CPU understands: machine code.
Because each type of Assembly language is specifically designed for each type of machine code(in turn designed for each type of CPU: x86 assembly for x86, MIPS assembly for MIPS, etc.), we might be in the illusion that Assembly is understood by the computer when it does not.
But if the computer loves machine code so much, what's the big deal with Assembly, C and other languages? The true is that CPUs love machine code but programmers... not so much...
Machine code is typically a binary format, so reading it is very hard for humans. As much as one wrong bit is enough to make the application fail silently(executing the program, but returning incorrect output, possibly crashing). Meaning that we're forced to deal with machine code format complexity in addition to the complexity of the problem we're trying to solve.
To simplify this, Assembly was invented. Assemblers take a textual(instead of binary) file, with the instructions in a human-readable format(MOV EAX, 5 instead of a bunch of 0s and 1s or hexadecimal numbers), and translate it to the binary, native, equivalent. This makes assemblers very valuable, as they allow working very close to the machine, and the program is still readable.
But Assembly is still too low-level. That is, programmers still have to struggle and spend too much time thinking about how to solve even the simplest of problems, when there are truly horribly complex problems out there to solve. Clearly, Assembly is not enough. And so languages have been appearing over the last few decades.
All this to say that, for a computer to understand a language – any language – it must be taught first. And the lesson must be available in something that the computer understands, such as the machine code [1].
There are two main ways to teach a computer: with a compiler and with an interpreter. With compilers, we convert the code of a given programming language to native code, which can then be executed. With interpreters, we directly execute the code. Compilation is usually much faster than interpretation so traditionally interpreted languages such as JavaScript are now taking advantage of compilation using a trick known as Just-In-Time compilation. There is also the possibility to compile to interpreted bytecodes(intermediate formats).
I might discuss JIT and bytecode eventually, but it's currently not a priority for me, since it is possible to make languages without those concepts.
However, many concepts are shared across the multiple approaches I mentioned before. In a future blog post, I will discuss these concepts, and hopefully demystify programming languages.
---
[1] Programming languages are often implemented in languages such as C, which are converted to machine code anyway. It is possible to use other methods that do not involve translating to machine code but, as we go see what those methods are made in, we will eventually find machine code sooner or later.
No comments:
Post a Comment