How can a compiler compile itself?

Compilation

Compilation Problem Overview


I am researching CoffeeScript on the website http://coffeescript.org/, and it has the text

> The CoffeeScript compiler is itself written in CoffeeScript

How can a compiler compile itself, or what does this statement mean?

Compilation Solutions


Solution 1 - Compilation

The first edition of a compiler can't be machine-generated from a programming language specific to it; your confusion is understandable. A later version of the compiler with more language features (with source rewritten in the first version of the new language) could be built by the first compiler. That version could then compile the next compiler, and so on. Here's an example:

  1. The first CoffeeScript compiler is written in Ruby, producing version 1 of CoffeeScript
  2. The source code of the CS compiler is rewritten in CoffeeScript 1
  3. The original CS compiler compiles the new code (written in CS 1) into version 2 of the compiler
  4. Changes are made to the compiler source code to add new language features
  5. The second CS compiler (the first one written in CS) compiles the revised new source code into version 3 of the compiler
  6. Repeat steps 4 and 5 for each iteration

Note: I'm not sure exactly how CoffeeScript versions are numbered, that was just an example.

This process is usually called bootstrapping. Another example of a bootstrapping compiler is rustc, the compiler for the Rust language.

Solution 2 - Compilation

In the paper Reflections on Trusting Trust, Ken Thompson, one of the originators of Unix, writes a fascinating (and easily readable) overview of how the C compiler compiles itself. Similar concepts can be applied to CoffeeScript or any other language.

The idea of a compiler that compiles its own code is vaguely similar to a quine: source code that, when executed, produces as output the original source code. Here is one example of a CoffeeScript quine. Thompson gave this example of a C quine:

char s[] = {
	'\t',
	'0',
	'\n',
	'}',
	';',
	'\n',
	'\n',
	'/',
	'*',
	'\n',
    … 213 lines omitted …
	0
};

/*
 * The string s is a representation of the body
 * of this program from '0'
 * to the end.
 */

main()
{
	int i;
	
	printf("char\ts[] = {\n");
	for(i = 0; s[i]; i++)
		printf("\t%d,\n", s[i]);
	printf("%s", s);
}

Next, you might wonder how the compiler is taught that an escape sequence like '\n' represents ASCII code 10. The answer is that somewhere in the C compiler, there is a routine that interprets character literals, containing some conditions like this to recognize backslash sequences:

c = next();
if (c != '\\') return c;        /* A normal character */
c = next();
if (c == '\\') return '\\';     /* Two backslashes in the code means one backslash */
if (c == 'r')  return '\r';     /* '\r' is a carriage return */

So, we can add one condition to the code above…

if (c == 'n')  return 10;       /* '\n' is a newline */

… to produce a compiler that knows that '\n' represents ASCII 10. Interestingly, that compiler, and all subsequent compilers compiled by it, "know" that mapping, so in the next generation of the source code, you can change that last line into

if (c == 'n')  return '\n';

… and it will do the right thing! The 10 comes from the compiler, and no longer needs to be explicitly defined in the compiler's source code.1

That is one example of a C language feature that was implemented in C code. Now, repeat that process for every single language feature, and you have a "self-hosting" compiler: a C compiler that is written in C.


1 The plot twist described in the paper is that since the compiler can be "taught" facts like this, it can also be mis-taught to generate trojaned executables in a way that is difficult to detect, and such an act of sabotage can persist in all compilers produced by the tainted compiler.

Solution 3 - Compilation

You have already gotten a very good answer, however I want to offer you a different perspective, that will hopefully be enlightening to you. Let's first establish two facts that we can both agree on:

  1. The CoffeeScript compiler is a program which can compile programs written in CoffeeScript.
  2. The CoffeeScript compiler is a program written in CoffeeScript.

I'm sure you can agree that both #1 and #2 are true. Now, look at the two statements. Do you see now that it is completely normal for the CoffeeScript compiler to be able to compile the CoffeeScript compiler?

The compiler doesn't care what it compiles. As long as it's a program written in CoffeeScript, it can compile it. And the CoffeeScript compiler itself just happens to be such a program. The CoffeeScript compiler doesn't care that it's the CoffeeScript compiler itself it is compiling. All it sees is some CoffeeScript code. Period.

>How can a compiler compile itself, or what does this statement mean?

Yes, that's exactly what that statement means, and I hope you can see now how that statement is true.

Solution 4 - Compilation

> How can a compiler compile itself, or what does this statement mean?

It means exactly that. First of all, some things to consider. There are four objects we need to look at:

  • The source code of any arbitrary CoffeScript program
  • The (generated) assembly of any arbitrary CoffeScript program
  • The source code of the CoffeScript compiler
  • The (generated) assembly of the CoffeScript compiler

Now, it should be obvious that you can use the generated assembly - the executable - of the CoffeScript compiler to compile any arbitrary CoffeScript program, and generate the assembly for that program.

Now, the CoffeScript compiler itself is just an arbitrary CoffeScript program, and thus, it can be compiled by the CoffeScript compiler.

It seems that your confusion stems from the fact that when you create your own new language, you don't have a compiler yet you can use to compile your compiler. This surely looks like an chicken-egg problem, right?

Introduce the process called bootstrapping.

  1. You write a compiler in an already existing language (in case of CoffeScript, the original compiler was written in Ruby) that can compile a subset of the new language
  2. You write a compiler that can compile a subset of the new language in the new language itself. You can only use language features the compiler from the step above can compile.
  3. You use the compiler from step 1 to compile the compiler from step 2. This leaves you with an assembly that was originally written in a subset of the new language, and that is able to compile a subset of the new language.

Now you need to add new features. Say you have only implemented while-loops, but also want for-loops. This isn't a problem, since you can rewrite any for-loop in such a way that it is a while-loop. This means you can only use while-loops in the source code of your compiler, since the assembly you have at hand can only compile those. But you can create functions inside your compiler that can pase and compile for-loops with it. Then you use the assembly you already have, and compile the new compiler version. And now you have an assembly of an compiler that can also parse and compile for-loops! You can now go back to the source file of your compiler, and rewrite any while-loops you don't want into for-loops.

Rinse and repeat until all language features that are desired can be compiled with the compiler.

while and for obviously were only examples, but this works for any new language feature you want. And then you are in the situation CoffeScript is in now: The compiler compiles itself.

There is much literature out there. Reflections on Trusting Trust is a classic everyone interested in that topic should read at least once.

Solution 5 - Compilation

A small but important clarification

Here the term compiler glosses over the fact that there are two files involved. One is an executable which takes as input files written in CoffeScript and produces as its output file another executable, a linkable object file, or a shared library. The other is a CoffeeScript source file which just happens to describe the procedure for compiling CoffeeScript.

You apply the first file to the second, producing a third which is capable of performing the same act of compilation as the first (possibly more, if the second file defines features not implemented by the first), and so may replace the first if you so desire.

Solution 6 - Compilation

  1. The CoffeeScript compiler was first written in Ruby.
  2. The CoffeeScript compiler was then re-written in CoffeeScript.

Since the Ruby version of the CoffeeScript compiler already existed, it was used to create the CoffeeScript version of the CoffeeScript compiler.

enter image description here This is known as a self-hosting compiler.

It's extremely common, and usually results from an author's desire to use their own language to maintain that language's growth.

Solution 7 - Compilation

It's not a matter of compilers here, but a matter of expressiveness of the language, since a compiler is just a program written in some language.

When we say that "a language is written/implemented" we actually mean that a compiler or interpreter for that language is implemented. There are programming languages in which you can write programs that implement the language (are compilers/interpreters for the same language). These languages are called universal languages.

In order to be able to understand this, think about a metal lathe. It is a tool used to shape metal. It is possible, using just that tool, to create another, identical tool, by creating its parts. Thus, that tool is a universal machine. Of course, the first one was created using other means (other tools), and was probably of lower quality. But the first one was used to build new ones with higher precision.

A 3D printer is almost a universal machine. You can print the whole 3D printer using a 3D printer (you can't build the tip that melts the plastic).

Solution 8 - Compilation

Proof by induction

Inductive step

The n+1th version of the compiler is written in X.

Thus it can be compiled by the nth version of the compiler (also written in X).

Base case

But the first version of the compiler written in X must be compiled by a compiler for X that is written in a language other than X. This step is called bootstrapping the compiler.

Solution 9 - Compilation

Compilers take a high-level specification and turn it into a low-level implementation, such as can be executed on hardware. Therefore there is no relationship between the format of the specification and the actual execution besides the semantics of the language being targeted.

Cross-compilers move from one system to another system, cross-language compilers compile one language specification into another language specification.

Basically compiling is a just translation, and the level is usually higher-level of language to lower-level of language, but there are many variants.

Bootstrapping compilers are the most confusing, of course, because they compile the language they are written in. Don't forget the initial step in bootstrapping which requires at least a minimal existing version that is executable. Many bootstrapped compilers work on the minimal features of a programming language first and add additional complex language features going forward as long as the new feature can be expressed using the previous features. If that were not the case it would require to have that part of the "compiler" be developed in another language beforehand.

Solution 10 - Compilation

While other answers cover all the main points, I feel it would be remiss not to include what may be the most impressive example known of a compiler which was bootstrapped from its own source code.

Decades ago, a man named Doug McIlroy wanted to build a compiler for a new language called TMG. Using paper and pen, he wrote out source code for a simple TMG compiler... in the TMG language itself.

Now, if only he had a TMG interpreter, he could use it to run his TMG compiler on its own source code, and then he would have a runnable, machine-language version of it. But... he did have a TMG interpreter already! It was a slow one, but since the input was small, it would be fast enough.

Doug ran the source code on that paper on the TMG interpreter behind his eye sockets, feeding it the very same source as its input file. As the compiler worked, he could see the tokens being read from the input file, the call stack growing and shrinking as it entered and exited subprocedures, the symbol table growing... and when the compiler started emitting assembly language statements to its "output file", Doug picked up his pen and wrote them down on another piece of paper.

After the compiler finished execution and exited successfully, Doug brought the resulting hand-written assembly listings to a computer terminal, typed them in, and his assembler converted them into a working compiler binary.

So this is another practical (???) way to "use a compiler to compile itself": Have a working language implementation in hardware, even if the "hardware" is wet and squishy and powered by peanut butter sandwiches!

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAlexanderRDView Question on Stackoverflow
Solution 1 - CompilationBen NView Answer on Stackoverflow
Solution 2 - Compilation200_successView Answer on Stackoverflow
Solution 3 - CompilationJörg W MittagView Answer on Stackoverflow
Solution 4 - CompilationPolygnomeView Answer on Stackoverflow
Solution 5 - CompilationPMarView Answer on Stackoverflow
Solution 6 - CompilationTrevor HickeyView Answer on Stackoverflow
Solution 7 - CompilationPaul92View Answer on Stackoverflow
Solution 8 - CompilationGuy ArgoView Answer on Stackoverflow
Solution 9 - CompilationjcbView Answer on Stackoverflow
Solution 10 - CompilationAlex DView Answer on Stackoverflow