Do programming language compilers first translate to assembly or directly to machine code?

AssemblyGccCompilationCompiler Construction

Assembly Problem Overview


I'm primarily interested in popular and widely used compilers, such as gcc. But if things are done differently with different compilers, I'd like to know that, too.

Taking gcc as an example, does it compile a short program written in C directly to machine code, or does it first translate it to human-readable assembly, and only then uses an (in-built?) assembler to translate the assembly program into binary, machine code -- a series of instructions to the CPU?

Is using assembly code to create a binary executable a significantly expensive operation? Or is it a relatively simple and quick thing to do?

(Let's assume we're dealing with only the x86 family of processors, and all programs are written for Linux.)

Assembly Solutions


Solution 1 - Assembly

gcc actually produces assembler and assembles it using the as assembler. Not all compilers do this - the MS compilers produce object code directly, though you can make them generate assembler output. Translating assembler to object code is a pretty simple process, at least compared with compilation.

Some compilers produce other high-level language code as their output - for example, cfront, the first C++ compiler produced C as its output which was then compiled by a C compiler.

Note that neither direct compilation or assembly actually produce an executable. That is done by the linker, which takes the various object code files produced by compilation/assembly, resolves all the names they contain and produces the final executable binary.

Solution 2 - Assembly

Almost all compilers, including gcc, produce assembly code because it's easier---both to produce and to debug the compiler. The major exceptions are usually just-in-time compilers or interactive compilers, whose authors don't want the performance overhead or the hassle of forking a whole process to run the assembler. Some interesting examples include

  • Standard ML of New Jersey, which runs interactively and compiles every expression on the fly.

  • The tinycc compiler, which is designed to be fast enough to compile, load, and run a C script in well under 100 milliseconds, and therefore doesn't want the overhead of calling the assembler and linker.

What these cases have in common is a desire for "instantaneous" response. Assemblers and linkers are plenty fast, but not quite good enough for interactive response. Yet.

There are also a large family of languages, such as Smalltalk, Java, and Lua, which compile to bytecode, not assembly code, but whose implementations may later translate that bytecode directly to machine code without benefit of an assembler.

(Footnote: in the early 1990s, Mary Fernandez and I wrote the New Jersey Machine Code Toolkit, for which the code is online, which generates C libraries that compiler writers can use to bypass the standard assembler and linker. Mary used it to roughly double the speed of her optimizing linker when generating a.out. If you don't write to disk, speedups are even greater...)

Solution 3 - Assembly

According to chapter 2 of Introduction to Reverse Engineering Software (by Mike Perry and Nasko Oskov), both gcc and cl.exe (the back end compiler for MSVC++) have the -S switch you can use to output the assembly that each compiler produces.

You can also run gcc in verbose mode (gcc -v) to get a list of commands that it executes to see what it's doing behind the scenes.

Solution 4 - Assembly

GCC compiles to assembler. Some other compilers don't. For example, LLVM-GCC compiles to LLVM-assembly or LLVM-bytecode, which is then compiled to machine code. Almost all compilers have some sort of internal representation, LLVM-GCC use LLVM, and, IIRC, GCC uses something called GIMPLE.

Solution 5 - Assembly

Compilers, in general, parse the source code into an Abstract Syntax Tree (an AST), then into some intermediate language. Only then, usually after some optimizations, they emit the target language.

About gcc, it can compile to a wide variety of targets. I don't know if for x86 it compiles to assembly first, but I did give you some insight onto compilers - and you asked for that too.

Solution 6 - Assembly

None of the answers clarifies the fact that an ASSEMBLER is the first layer of abstraction between BINARY CODE and MACHINE DEPENDENT SYMBOLIC CODE. A compiler is the second layer of abstraction between MACHINE DEPENDENT SYMBOLIC CODE and MACHINE INDEPENDENT SYMBOLIC CODE.

If a compiler directly converts code to binary code, by definition, it will be called assembler and not a compiler.

It is more appropriate to say that a compiler uses INTERMEDIATE CODE which may or may not be assembly language e.g. Java uses byte code as intermediate code and byte code is assembler for java virtual machine (JVM).

EDIT: You may wonder why an assembler always produces machine dependent code and why a compiler is capable of producing machine independent code. The answer is very simple. An assembler is direct mapping of machine code and therefore assembly language it produces is always machine dependent. On the contrary, we can write more than one versions of a compiler for different machines. So to run our code independently of machine, we must compile same code but on the compiler version written for that machine.

Solution 7 - Assembly

Some of the above answers confused me because in some answers GCC(GNU Compiler Collection) is mentioned as a single tool but it's a suite of tools like GNU Assembler(also known as GAS), linker, compiler and debugger which are used together to produce an executable. And yes, GCC doesn't directly converts the C source file to machine code.

It does that in 4 steps:

  1. Pre-processing - Removing of comments and expanding macros(of C).etc
  2. Compilation - Source to Assembly(done by compiler)
  3. Assembling - Assembly to Machine Code(done by Assembler)
  4. Linking - By default linking standard functions dynamically to shared libraries(done by linker)

Solution 8 - Assembly

You'd probably be interested to listen to this pod cast: Internals of GCC

Solution 9 - Assembly

In most multi-pass compilers assembly language is generated during the code generation steps. This allows you to write the lexer, syntax and semantic phases once and then generate executable code using a single assembler back end. this is used a lot in cross compilers such a C compilers that generates for a range of different cpu's.

Just about every compiler has some form of this wheter its an implicit or explicity step.

Solution 10 - Assembly

There are many phases of compilation. In abstract, there is the front end that reads the source code, breaks it up into tokens and finally into a parse tree.

The back end is responsible for first generating a sequential code like three address code eg:

code:

x = y + z + w

into:

reg1 = y + z
x = reg1 + w

Then optimizing it, translating it into assembly and finally into machine language. All steps are layered carefully so that when needed, one of them can be replaced

Solution 11 - Assembly

Visual C++ has a switch to output assembly code, so I think it generates assembly code before outputting machine code.

Solution 12 - Assembly

Java compilers compile to java byte code (binary format) and then run this using a virtual machine (jvm).

Whilst this may seem slow it - it can be faster because the JVM can take advantage of later CPU instructions and new optimizations. A C++ compiler won't do this - you have to target the instruction set at compile time.

Solution 13 - Assembly

Although all compilers not convert the source code into an intermediate level code but there is a bridge of taking the source code to machine level code in several compilers

Solution 14 - Assembly

A listing file is a compiler-generated text file that contains the assembly language code produced by the compiler.Most compilers support the generation of listing files during the compilation process. For some compilers, such as GCC, this is a standard part of the compilation process because the compiler doesn’t directly generate an object file, but instead generates an assembly language file which is then processed by an assembler. In such compilers, requesting a listing file simply means that the compiler must not delete it after the assembler is done with it. In other compilers (such as the Microsoft or Intel compilers), a listing file is an optional feature that must be enabled through the command line.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMr MisterView Question on Stackoverflow
Solution 1 - AssemblyanonView Answer on Stackoverflow
Solution 2 - AssemblyNorman RamseyView Answer on Stackoverflow
Solution 3 - AssemblyBill the LizardView Answer on Stackoverflow
Solution 4 - AssemblyZifreView Answer on Stackoverflow
Solution 5 - AssemblyAsaf RView Answer on Stackoverflow
Solution 6 - AssemblyBubba YakozaView Answer on Stackoverflow
Solution 7 - AssemblyGaurav PurswaniView Answer on Stackoverflow
Solution 8 - AssemblyPaul HollingsworthView Answer on Stackoverflow
Solution 9 - AssemblyMikeJView Answer on Stackoverflow
Solution 10 - AssemblyjackView Answer on Stackoverflow
Solution 11 - AssemblyfriolView Answer on Stackoverflow
Solution 12 - AssemblyFortyrunnerView Answer on Stackoverflow
Solution 13 - AssemblyShahid pakistanView Answer on Stackoverflow
Solution 14 - AssemblyShivam suhaneView Answer on Stackoverflow