Libraries are collections of precompiled functions that have been written to be reusable. Typically, they consist of sets of related functions to perform a common task.
Static Libraries - The simplest library is a static library. A static library is a type of archive. An archive is a single file holding a collection of other files in a format that makes it possible to retrieve the original individual files (called members of the archive). A static library is an archive whose members are object files.
Disadvantages of Static Libraries - One disadvantage of static libraries is that many programs may use objects from the same library and, as a result, there may be many copies of the same objects. Multiple copies of the same objects consume a large amount of valuable storage resources (RAM, ROM, disk space, etc.). Also, when a static library is updated, all programs that use the library must be recompiled in order to take advantage of the updated logic. Shared libraries can overcome both of these disadvantages.
There is a third, non-technical problem with the use of static shared libraries: The licensing for certain software packages (including GPL licensed software) may compromise the legal status of intellectual properties if software containing that property is linked directly with such software packages.
Shared Libraries - A shared library consumes exactly one instance of storage resource (one image on disk or flash, no more than one read-only image in RAM). The shared library's read-only segment (in particular, its .text section) can be shared among all processes; while its write-able sections (such as .data and .bss) can be allocated uniquely for each executing process. This write-able segment is also referred to as the object's Static Data Segment. It is this static data segment that creates most of the complexity for the implementation of shared libraries.
<<Table of Contents>>
The MMU or Memory Management Unit is a hardware solution that can simplify the implementation of the shared library. The MMU sits, logically, between the CPU and the physical memory. It "translates" virtual memory addresses used by the CPU into physical memory addresses. This simplifies addressing in a complex system because the software executing on the CPU can "believe" that a memory resource resides at some location, X (the virtual address), and the MMU can back this up with real memory at some other location, Y (the physical address).
Process-Specific Mappings - Linux, using the MMU, can make these memory mappings unique for each process. This is done by simply storing the correct MMU settings for each process and restoring these MMU settings each time that the process executes. In this way, the same virtual address X can be mapped to different physical address Y in each process.
The Shared Library implementation is simplified by the MMU. Using an MMU, each process can always "assume" that it "knows" the address of a shared library (using a virtual address) and the MMU can be programmed to map to the correct physical address that holds the shared library code. Each shared library can assume a virtual address for the location of its writable static data segment and the MMU can be programmed to map to a physical address that is only used by that process.
<<Table of Contents>>
Position Independence - For a variety of reasons, the implementation of shared libraries is more complicated than this simple story. For one thing, processes and shared libraries cannot assume fixed virtual addresses for resources in this way. It would not be possible to develop and support large, complex computing environment if every process resource had to reside at a fixed virtual address. Rather, they must be able to reside at any location within the processes' virtual address space; they must support position independence.
Position Independence and ARM - For the ARM processor, position independence is supported by the read-only, .text segment and for the writable static data segment using two different mechanisms:
Dynamic Linking - Another capability that modern systems provide is dynamic linking. With dynamic linking, the linker produces objects that are only partially linked; the dynamic linker executes when the process is loaded and it completes the linking at run-time. Some dynamic linking is required to bind processes and shared libraries at run-time, but the full dynamic linking capability goes beyond this so that any symbolic reference can be replaced at load time.
The GOT - To manage these capabilities, the GNU toolchain generates a global offset table (GOT). The GOT holds the resolved virtual addresses of symbols. Each GOT entry is accessed via a fixed offset into the table; the address at each entry is filled in by the dynamic linker as relocations that are completed at load time. Thus, if the software links so that it understands a certain, fixed offset into the GOT corresponds to the entry for a certain symbol, it can obtain the final, virtual address of the symbol at runtime by dereferencing the GOT at that offset.
The GOT resides at the beginning of the writable static data segment. As a consequence, it can be addressed easily using the static base register.
Special GOT Entries - In addition to holding the virtual address of symbols at fixed offsets, the GOT provides three special entries in the first three words of the GOT. The first entry of the GOT points to a resolver function. This function can determine the virtual address of a function that was not known when the process was loaded. This is used to support late binding.
The PLT - It is clear how dereferencing offsets into the GOT can be used to bind to symbolic references to data. But how about function calls? For function calls, the Procedure Linkage Table (PLT) is used. The PLT provides intervening logic between the C function call and the dynamically linked target of a function call. In this sense, it is a "thunk" layer since it replaces the original function call, looks up the resolved function address in the GOT, and forwards the function call.
Special PLT Entries - The first PLT entry is special and works with the first GOT entry: It is the code that vectors unresolved function references to the resolver function.
<<Table of Contents>>
No MMU? Shared library functionality can also be provided without an MMU, but the implementation will be more complex and may have some limitations. For the case of the ARM processor, paragraph 5.6 of The ARM-THUMB Procedure Call Standard, ARM document number SWS ESPC 0002 A-05, provides a technical approach to shared libraries that does not depend on the existence of an MMU. The general characteristics of this approach are summarized below:
If a shared library is bound with the library index N, then the Nth entry in the Process Data Table will contain the actual static data segment for the shared library.
This special "thunk" layer that is provided on the exporter side of the shared library interface: This look-up would have to be performed in every shared library at the beginning of every global function that could possibly be exported by a shared library.
<<Table of Contents>>
A special, static linker examines the partially linked object: If undefined external references to imported functions exist in the partially linked object, then a "thunk" layer is generated on the importer side to manage the outbound call. This is analogous in concept to the traditional PLT.
The thinking here is that cross-module function calls are less frequent than within module function calls. Therefore, infrequent, importer-side logic should be more performant than pervasive exporter-side logic.
Within the linker-generated, importer-side "thunk" logic is a data structure that contains the (physical) address of the imported function but also the SB register for the object that contains the imported function. This data structure was initialized by the dynamic linker when the process was loaded.
The primary optimizations involve the use of the extended FLT binary file format:
<<Table of Contents>>
Filesystem Dependencies - In order to have true shared libraries, there must be a single instance of the .text section, regardless of how may instances of a module are instantiated. This shared library implementation simply uses the Linux function mmap() to map the file to memory. So the performance of this implementation will depend on the correct behavior of mmap. If the system supports an XIP file system, ideally mmap() would return the address referencing the memory mapped file. If mmap() copies to memory, good shared library performance can be achieved if mmap() assures that no more than one instance of the file is every copied to memory. See uClinux, File Mapping, and Shared Libraries for futher information.
Kernel Signal Issues - When a signal handler is installed (via signal(), sigaction(), etc.), a pointer to the signal handling function must be provided. The value of the PIC base register that is used at the time that the signal is delivered must be exactly the same as value of the PIC base register at the time that signal handler was installed. This requires a small change to the kernal signal handling logic: to retain that caller's PIC base register contents when the signal handler is installed; and to restore that PIC base register value when the signal is delivered.
GOT-less Implementations - The present implementation was targeted for the embedded ARM7 application. Historically, ARM7 developers use the compiler switch -mno-got to suppress generation of a GOT. GOT binding is not supported by the FLT format (nor by the xFLT format). For the purposes of the shared library support, the import feature provided by the GOT is the ability to re-direct data references: The GOT forces a layer of indirection between each variable access. In comparison, the ELF dynamic loader can relocate GOT indirection layer at load time so that modules and global variables are properly imported to, and exported from, shared libraries.
Of course, it is this indirection that the embedded ARM7 developers wish to avoid: It adds to memory usage and degrades memory performance. The xFLAT approach to shared libraries follows this optimized usage and does not support dynamic, load-time binding of the GOT. As a consequence, global variables may not be shared between modules.
A work around for this limitation is provided that is completely transparent to the user: Accessor functions replace the global variable references. For example, the global variable errno is exported from libc. This solution provides two things:
The end result is the the programmer still appears to have a global variable called errno and no special programming was required.
Toolchain Module Issues - The operating context of the xFLT system differs from the traditional FLT operating context in one way: Traditional FLT processes consist of one contiguous memory allocation (or virtual address space) for the entire FLT process -- for both the read-only code segment and the writable, static data segment. In xFLT, by contrast, there are always two discontinuous address regions, one for the read-only code segment and one for the writable, static data segment.
As a consequence, the FLT-style compiler may attempt to resolve the address of text segment references by adding the static base register to the text segment offset! If your compiler behave in this manner, a small patch is required. An example of a patch to accomplish the require compiler is provided under xflat/gcc.
Debugger Module Issues - The GNU debugger includes logic to recognize when it has stepped into a shared library thunk routine. A thunk routine is a small section of code that lies between the calling code and the shared library code itself. The XFLAT shared library thunk routines will probably not be recognized by an unmodified GNU debugger. As a result, your debugger may have problems stepping into or over shared library calls without this modification.Compressed Flat - xFLT does not support a counterpart to the zipped FLT format. In zipped FLT, the binary is retained in a compressed form in the filesystem and decompressed to RAM when the binary is loaded.
Zipped FLT makes sense because the whole flat binary is unzipped and copied to RAM in one big, contiguous memory allocation -- nothing gets shared. xFLT, on the other hand, uses mmap to map the file into two pieces: a mapped, shared text address space and an allocated, private data space; this is how it produces the shared text segment.
A simpler (and probably better) solution than zipped xFLT might be to use a compressed file system. The file system would then be responsible for unzipping and copying the file to RAM. And file system would be responsible for assuring that mmap continues to behave correctly: that there is a single instance of the text segment. And you would get the benefit of having all files compressed.
<<Table of Contents>>