Object files vs Library files and why?

C++

C++ Problem Overview


I understand the basics of compilation. Source files compiled to object files which the linker then links into executables. These object files are comprised of source files containing definitions.

So my questions are:


  • Why do we have a separate implementation for a library? .a .lib, .dll...
  • I am probably mistaken, but it seems to me like .o files themselves are kind of the same thing as libraries?
  • Couldn't someone give you their .o implementations of a certain declaration (.h) and you could replace that in and have it linked to become an executable that performs the same functions, but using different operations?

C++ Solutions


Solution 1 - C++

Historically, an object file gets linked either completely or not at all into an executable (nowadays, there are exceptions like function level linking or whole program optimization becoming more popular), so if one function of an object file is used, the executable receives all of them.

To keep executables small and free of dead code, the standard library is split into many small object files (typically in the order of hundreds). Having hundreds of small files is very undesirable for efficiency reasons: Opening many files is inefficient, and every file has some slack (unused disk space at the end of the file). This is why object files get grouped into libraries, which is kind of like a ZIP file with no compression. At link time, the whole library is read, and all object files from that library that resolve symbols already known as unresolved when the linker started reading a library or object files needed by them are included into the output. This likely means that the whole library has to be in memory at once to recursively solve dependencies. As the amount of memory was quite limited, the linker only loads one library at a time, so a library mentioned later on the command line of the linker can not use functions from a library mentioned earlier on the command line.

To improve the performance (loading a whole library takes some time, especially from slow media like floppy disks), libraries often contain an index that tells the linker what object files provide which symbols. Indexes are created by tools like ranlib or the library management tool (Borland's tlib has a switch to generate the index). As soon as there is an index, libraries are definitely more efficient to link then single object files, even if all object files are in the disk cache and loading files from the disk cache is free.

You are completely right that I can replace .o or .a files while keeping the header files, and change what the functions do (or how they do it). This is used by the LPGL-license, which requires the author of a program that uses an LGPL-licensed library to give the user the possibility to replace that library by a patched, improved or alternative implementation. Shipping the object files of the own application (possibly grouped as library files) is enough to give the user the required freedom; no need to ship the source code (like with the GPL).

If two sets of libraries (or object files) can be used successfully with the same header files, they are said to be ABI compatible, where ABI means Application Binary Interface. This is more narrow than just having two sets of libraries (or object files) accompanied by their respective headers, and guaranteeing that you can use each library if you use the headers for this specific library. This would be called API compatibility, where API means Application Program Interface. As an example of the difference, look at the following three header files:

File 1:

typedef struct {
    int a;
    int __undocumented_member;
    int b;
} magic_data;
magic_data* calculate(int);

File 2:

struct __tag_magic_data {
    int a;
    int __padding;
    int b;
};
typedef __tag_magic_data magic_data;
magic_data* calculate(const int);

File 3:

typedef struct {
    int a;
    int b;
    int c;
} magic_data;
magic_data* do_calculate(int, void*);
#define calculate(x) do_calculate(x, 0)

The first two files are not identical, but they provide exchangeable definitions that (as far as I expect) do not violate the "one definition rule", so a library providing File 1 as header file can be used as well with File 2 as a header file. On the other hand, File 3 provides a very similar interface to the programmer (which might be identical in all that the library author promises the user of the library), but code compiled with File 3 fails to link with a library designed to be used with File 1 or File 2, as the library designed for File 3 would not export calculate, but only do_calculate. Also, the structure has a different member layout, so using File 1 or File 2 instead of File 3 will not access b correctly. The libraries providing File 1 and File 2 are ABI compatible, but all three libraries are API compatible (assuming that c and the more capable function do_calculate do not count towards that API).

For dynamic libraries (.dll, .so) things are completely different: They started appearing on systems where multiple (application) programs can be loaded at the same time (which is not the case on DOS, but it is the case on Windows). It is wasteful to have the same implementation of a library function in memory multiple times, so it is loaded only once and multiple applications use it. For dynamic libraries, the code of the referenced function is not included in the executable file, but just a reference to the function inside a dynamic library is included (For Windows NE/PE, it is specified which DLL has to provide which function. For Unix .so files, only the function names and a set of libraries are specified.). The operating system contains a loader aka dynamic linker that resolves these references and loads dynamic libraries if they are not already in memory at the time a program is started.

Solution 2 - C++

Ok, let's start with the beginning.

A programmer (you) creates some source files, .cpp and .h. The difference between those two files is just a convention :

  • .cpp are meant to be compiled
  • .h are meant to be included in other source files

but nothing (except the fear of having an unmaintanable thing) forbids you to import cpp files into other .cpp files.

At the early time of C (the ancestor of C++) .h file only contained declarations of functions, structures (without methods in C !) and constants. You could also have a macro (#define) but apart from that, no code should be in .h.

In C++ with templates, you must also add in the .h implementation of template classes because as C++ uses templates and not generics like Java, each instantiation of a template is a different class.

Now with the answer to your question :

Each .cpp file is a compilation unit. The compiler will :

  • in the preprocessor phase process, all #include or #define to (internally) generates a full source code
  • compiles it to object format (generally .o or .obj)

This object format contains :

  • relocatable code (that is addresses in code or variables are relatives to exported symbols)
  • exported symbols: the symbols that could be used from other compilation units (functions, classes, global variables)
  • imported symbols: the symbols used in that compilation unit and defined in other compilations units

Then (let's forget the libraries for now) the linker will take all the compilations units together and will resolve symbols to create an executable file.

One step further with static libraries.

A static library (generally .a or .lib) is more or less a bunch of object files put together. It exists to avoid to list individually every object file that you need, those from which you use the exported symbols. Linking a library containing object files you use and linking the objects files themselves is exactly the same. Simply adding -lc, -lm or -lx11 is shorter them adding hundred of .o files. But at least on Unix-like systems, a static library is an archive and you can extract the individual object files if you want to.

The dynamic libraries are completely different. A dynamic library should be seen as a special executable file. They are generally built with the same linker that creates normal executables (but with different options). But instead of simply declaring an entry point (on windows a .dll file does declare an entry point that can be used for initializing the .dll), they declare a list of exported (and imported) symbols. At runtime, there are system calls that allow to get the addresses of those symbols and use them almost normally. But in fact, when you call a routine in a dynamic loaded library the code resides outside of what the loader initially loads from your own executable file. Generally, the operation of loading all the used symbols from a dynamic library is either at load time directly by the loader (on Unix like systems) or with import libraries on Windows.

And now a look back to the include files. Neither good old K&R C nor the most recent C++ have a notion of the global module to import like for example Java or C#. In those languages, when you import a module, you get both the declarations for their exported symbols, and an indication that you will later link it. But in C++ (same in C) you have to do it separately :

  • first, declare the functions or classes - done by including a .h file from your source, so that compiler knows what they are
  • next link the object module, static library or dynamic library to actually get access to the code

Solution 3 - C++

Object files contain definitions of functions, static variables used by those functions, and other information output by the compiler. This is in a form that can be connected by the linker (linking points where functions are called with the entry points of the function, for example).

Library files are typically packaged to contain one or more object files (and therefore all the information in them). This offers advantages that it is easier to distribute a single library than a bunch of object files (e.g. if distributing compiled objects to another developer to use in their programs) and also makes linking simpler (the linker need to be directed to access fewer files, which makes it easier to create scripts to do linking). Also, typically, there are small performance benefits for the linker - opening one large library file and interpreting its content is more efficient than opening and interpreting the content of lots of small object files, particularly if the linker needs to do multiple passes through them. There are also small advantages that, depending on how hard drives are formatted and managed that a few large files consumes less disk space than a lot of smaller ones.

It is often worth packaging object files into libraries because that is an operation that can be done once, and the benefits are realised numerous times (every time the library is used by the linker to produce the executable).

Since humans comprehend source code better - and therefore have more chance of getting it working right - when it is in small chunks, most large projects consist of a significant number of (relatively) small source files, that get compiled to objects. Assembling object files into libraries - in one step - gives all the benefits I mentioned above, while allowing humans to manage their source code in a way that makes sense to humans rather than linkers.

That said, it is a developer choice to use libraries. The linker doesn't care, and it can take more effort to set up a library and use it than to link together lots of object files. So there is nothing stopping the developer employing a mix of object files and libraries (except for the obvious need to avoid duplication of functions and other things in multiple objects or libraries, which causes the link process to fail). It is, after all, the job of a developer to work out a strategy for managing the building and distribution of their software.

There is actually (at least) two types of library.

Statically linked libraries are used by the linker to build an executable, and compiled code from them is copied by the linker into the executable. Examples are .lib files under windows and .a files under unix. The libraries themselves (typically) do not need to be distributed separately with a program executable, because need parts are IN the executable.

Dynamically linked libraries are loaded into the program at run time. Two advantages are that the executable file is smaller (because it doesn't contain the content of the object files or static libraries) and that multiple executables can use every dynamically linked library (i.e. it is only necessary to distribute/install the libraries once, and all executables which use those libraries will work). Offsetting this is that installation of programs becomes more complicated (the executables will not run if the dynamically linked libraries cannot be found, so installation processes must cope with the potential need to install the libraries at least once). Another advantage is that dynamic libraries can be updated, without having to change the executable - for example, to fix a flaw in one of the functions contained in the library, and therefore fix the functioning of all programs which use that library without changing the executables. Offsetting this is that a program which relies on a recent version of a library may malfunction if only an older version of the library is found when it runs. This gives maintenance concerns with libraries (called by various names, such as DLL hell), particularly when programs rely on multiple dynamically linked libraries. Examples of dynamically linked libraries include DLLs under windows, .so files under unix. Facilities provided by operating systems are often installed - with the operating system - in the form of dynamically linked libraries, which allows all programs (when correctly built) to use the operating system services.

Programs can be developed to use a mix of static and dynamic libraries as well - again at the discretion of the developer. A static library might also be linked into the program, and take care of all the book-keeping associated with using a dynamically loaded library.

Solution 4 - C++

What you describe is how static linking works.

> Why do we have a separate implementation for a library? .a .lib, .dll...

.dlls are dynamically linked - the linking happens after you run the program. Depending on how you use the library, the function addresses are loaded just after you execute the program, or as late as possible.

.sos are the same idea, but on Linux.

.as, traditionally used on Linux (and also in MinGW), are library archives, which behave basically like enhanced object files:

  • they are linked statically.
  • you can pack multiple object files inside single library archive.
  • the names are indexed.

.libs are used by Microsoft linker in Visual Studio.

> Couldn't someone give you their .o implementations of a certain declaration (.h) and you could replace that in and have it linked to become an executable that performs the same functions, but using different operations?

Yes! With dynamic libraries, you can go even further: you can replace the library without recompiling, sometimes even without restarting the program.

The practical example is Wine - they provide open-sourced and portable implementation of WinAPI.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionFrancisco AguileraView Question on Stackoverflow
Solution 1 - C++Michael KarcherView Answer on Stackoverflow
Solution 2 - C++Serge BallestaView Answer on Stackoverflow
Solution 3 - C++PeterView Answer on Stackoverflow
Solution 4 - C++milleniumbugView Answer on Stackoverflow