Shared Libraries without an MMU

Shared Libraries
Static vs. Shared Libraries
Shared Libraries and MMUs
Shared Libraries, Position Independence, and the GOT
Shared Libraries without an MMU
The arm.org Approach
XFLAT - An Alternative Approach to Shared Libraries without an MMU
System Dependencies

Shared Libraries

Static vs. Shared Libraries

Libraries are collections of precompiled functions that have been written to be reusable. Typically, they consist of sets of related functions to perform a common task.

Static Libraries - The simplest library is a static library. A static library is a type of archive. An archive is a single file holding a collection of other files in a format that makes it possible to retrieve the original individual files (called members of the archive). A static library is an archive whose members are object files.

Disadvantages of Static Libraries - One disadvantage of static libraries is that many programs may use objects from the same library and, as a result, there may be many copies of the same objects. Multiple copies of the same objects consume a large amount of valuable storage resources (RAM, ROM, disk space, etc.). Also, when a static library is updated, all programs that use the library must be recompiled in order to take advantage of the updated logic. Shared libraries can overcome both of these disadvantages.

There is a third, non-technical problem with the use of static shared libraries: The licensing for certain software packages (including GPL licensed software) may compromise the legal status of intellectual properties if software containing that property is linked directly with such software packages.

Shared Libraries - A shared library consumes exactly one instance of storage resource (one image on disk or flash, no more than one read-only image in RAM). The shared library's read-only segment (in particular, its .text section) can be shared among all processes; while its write-able sections (such as .data and .bss) can be allocated uniquely for each executing process. This write-able segment is also referred to as the object's Static Data Segment. It is this static data segment that creates most of the complexity for the implementation of shared libraries.

<<Table of Contents>>
<<Index>>

Shared Libraries and MMUs

The MMU or Memory Management Unit is a hardware solution that can simplify the implementation of the shared library. The MMU sits, logically, between the CPU and the physical memory. It "translates" virtual memory addresses used by the CPU into physical memory addresses. This simplifies addressing in a complex system because the software executing on the CPU can "believe" that a memory resource resides at some location, X (the virtual address), and the MMU can back this up with real memory at some other location, Y (the physical address).

Process-Specific Mappings - Linux, using the MMU, can make these memory mappings unique for each process. This is done by simply storing the correct MMU settings for each process and restoring these MMU settings each time that the process executes. In this way, the same virtual address X can be mapped to different physical address Y in each process.

The Shared Library implementation is simplified by the MMU. Using an MMU, each process can always "assume" that it "knows" the address of a shared library (using a virtual address) and the MMU can be programmed to map to the correct physical address that holds the shared library code. Each shared library can assume a virtual address for the location of its writable static data segment and the MMU can be programmed to map to a physical address that is only used by that process.

<<Table of Contents>>
<<Index>>

Shared Libraries, Position Independence, and the GOT

Position Independence - For a variety of reasons, the implementation of shared libraries is more complicated than this simple story. For one thing, processes and shared libraries cannot assume fixed virtual addresses for resources in this way. It would not be possible to develop and support large, complex computing environment if every process resource had to reside at a fixed virtual address. Rather, they must be able to reside at any location within the processes' virtual address space; they must support position independence.

Position Independence and ARM - For the ARM processor, position independence is supported by the read-only, .text segment and for the writable static data segment using two different mechanisms:

PC-Relative Addressing - Within the .text segment, every memory address can be determined relative to the current memory location. No matter where the .text is located within the virtual address space, this relationship is the same.
Static Base Register - When position independence is enabled, the ARM compiler also allocates a register, the Static Base Register or SB Register, to hold the base address of the allocated, write-able data segment. The compiler then generates data references that add the correct offset to the fixed, virtual address to obtain positional independence.

Dynamic Linking - Another capability that modern systems provide is dynamic linking. With dynamic linking, the linker produces objects that are only partially linked; the dynamic linker executes when the process is loaded and it completes the linking at run-time. Some dynamic linking is required to bind processes and shared libraries at run-time, but the full dynamic linking capability goes beyond this so that any symbolic reference can be replaced at load time.

The GOT - To manage these capabilities, the GNU toolchain generates a global offset table (GOT). The GOT holds the resolved virtual addresses of symbols. Each GOT entry is accessed via a fixed offset into the table; the address at each entry is filled in by the dynamic linker as relocations that are completed at load time. Thus, if the software links so that it understands a certain, fixed offset into the GOT corresponds to the entry for a certain symbol, it can obtain the final, virtual address of the symbol at runtime by dereferencing the GOT at that offset.

The GOT resides at the beginning of the writable static data segment. As a consequence, it can be addressed easily using the static base register.

Special GOT Entries - In addition to holding the virtual address of symbols at fixed offsets, the GOT provides three special entries in the first three words of the GOT. The first entry of the GOT points to a resolver function. This function can determine the virtual address of a function that was not known when the process was loaded. This is used to support late binding.

The PLT - It is clear how dereferencing offsets into the GOT can be used to bind to symbolic references to data. But how about function calls? For function calls, the Procedure Linkage Table (PLT) is used. The PLT provides intervening logic between the C function call and the dynamically linked target of a function call. In this sense, it is a "thunk" layer since it replaces the original function call, looks up the resolved function address in the GOT, and forwards the function call.

Special PLT Entries - The first PLT entry is special and works with the first GOT entry: It is the code that vectors unresolved function references to the resolver function.

<<Table of Contents>>
<<Index>>

Shared Libraries without an MMU

The arm.org Approach

No MMU? Shared library functionality can also be provided without an MMU, but the implementation will be more complex and may have some limitations. For the case of the ARM processor, paragraph 5.6 of The ARM-THUMB Procedure Call Standard, ARM document number SWS ESPC 0002 A-05, provides a technical approach to shared libraries that does not depend on the existence of an MMU. The general characteristics of this approach are summarized below:

Single Static Data Segment - This proposal suggests that there be a single static data segment. This static data segment is used by all objects -- by the program as well as its shared libraries.
Special Static Data Segment Format - However, this static data segment is NOT the one that is described above. Rather, it would consist of (1) the static data segment of the program prefaced with (2) a special header. This special header would consists of four pointers into a process data table. They would point to entries 0, 32, 64, and 96 of the process data table respectively. The static linker would produce this special static date segment at link time; the dynamic linker would fill in the process data table information at load time.
Magic Library Indices - Each shared library would have a unique identity represented by a small integer (the library index) that would be bound into the library code at link time, presumably via a command line argument to the linker.

If a shared library is bound with the library index N, then the Nth entry in the Process Data Table will contain the actual static data segment for the shared library.
Per-Function GOT Lookup - Using the SB register, each function would perform a lookup of in the common static data segment to find the reference to its private static data segment. This lookup would:
- Index to the block of 4 entries at the beginning of the single, common static data segment and Fetch the address of the correct region of the process data table. This would be performed using the SB register (which, itself would have to be fetched?) and the library index bound to the shared library by the linker.
- Index to into the region of the process data table to Fetch a reference to the private static data segment for the shared library.
This special "thunk" layer that is provided on the exporter side of the shared library interface: This look-up would have to be performed in every shared library at the beginning of every global function that could possibly be exported by a shared library.

<<Table of Contents>>
<<Index>>

XFLAT - An Alternative Approach to Shared Libraries without an MMU

An Alternative Approach -

Importer Side "Thunk." All of the special logic to handle cross-module function calls is placed on the client/importer side of the interface. This logic is generated only for function references that are imported by the object.

A special, static linker examines the partially linked object: If undefined external references to imported functions exist in the partially linked object, then a "thunk" layer is generated on the importer side to manage the outbound call. This is analogous in concept to the traditional PLT.

The thinking here is that cross-module function calls are less frequent than within module function calls. Therefore, infrequent, importer-side logic should be more performant than pervasive exporter-side logic.
Separate Static Data Segments - Each object, program and shared library has its own, separate, unrelated static data segment. This static data segment is created and initialized by the dynamic linker.
No Special Formatting of the static data segments is required.
Separate Static Base Registers - Each linked object has its own SB register value to address its own static data segment. It is the job of the importer-side "thunk" logic to modify the SB register when the cross-module call is made so that it will be set correctly for the imported function.

Within the linker-generated, importer-side "thunk" logic is a data structure that contains the (physical) address of the imported function but also the SB register for the object that contains the imported function. This data structure was initialized by the dynamic linker when the process was loaded.
Symmetry - There is no difference between the way that the "thunk" is generated between objects: The same "thunk" is generated for functions imported by programs as for functions imported by shared libraries.
Optimizations for the Embedded Application - The approach adopted herein has been simplified and optimized for superior embedded performance: Both in time-to-load-and-bind shared libraries but also in the footprint of the shared library objects.

The primary optimizations involve the use of the extended FLT binary file format:
Extended FLT Binary Format - The FLT (flat) binary format has been extended to support shared libraries; this extended flat is referred to as xFLT (xflat). As a consequence of using the xflat binary format, better performance is obtained, but also loss of some of the more obscure features of other dynamic loaders (like ELF): caching, loader environment variables, and "weak" symbol support. In addition, some other common dynamic linking features are not needed since xflat objects are fully bound at load time (except for binding of specific imported and exported functions). See also the discussion of GOT-less Implementations below.
Shared Text Segments - xflat uses mmap to map the file into a shared address space; this is how it produces the shared text segment. Even if shared libraries are not used, the text segments for programs are still shared; If multiple instances of a program are running, they would be running from the same text space. With an XIP flash file system, xflat binaries operate directly from the flash memory with no use of RAM for text. This is a potentially big RAM savings.
Minimal Toolchain Impact The technical approach does not (normally) require modification to the GNU toolchain -- gcc, libc, the binutils, etc. Rather, outboard tools generate the importer side "thunk." This greatly reduces the complexity and increases portability of the solution. See Toolchain Module Issues below.
Function Pointers - The most significant downside of this technical approach is in the use of function pointers to provide callbacks from shared libraries: When the callback is made, the call will bypass the "importer-side thunk" logic and the callback will fail. In this implementation, certain macros and loader APIs are provided to such handle callbacks across module boundaries. However, these callbacks cannot be performed transparently: Modifications must be made to both callee- and caller-source code use callbacks through function pointers.

<<Table of Contents>>
<<Index>>

System Dependencies

Filesystem Dependencies - In order to have true shared libraries, there must be a single instance of the .text section, regardless of how may instances of a module are instantiated. This shared library implementation simply uses the Linux function mmap() to map the file to memory. So the performance of this implementation will depend on the correct behavior of mmap. If the system supports an XIP file system, ideally mmap() would return the address referencing the memory mapped file. If mmap() copies to memory, good shared library performance can be achieved if mmap() assures that no more than one instance of the file is every copied to memory. See uClinux, File Mapping, and Shared Libraries for futher information.

Kernel Signal Issues - When a signal handler is installed (via signal(), sigaction(), etc.), a pointer to the signal handling function must be provided. The value of the PIC base register that is used at the time that the signal is delivered must be exactly the same as value of the PIC base register at the time that signal handler was installed. This requires a small change to the kernal signal handling logic: to retain that caller's PIC base register contents when the signal handler is installed; and to restore that PIC base register value when the signal is delivered.

GOT-less Implementations - The present implementation was targeted for the embedded ARM7 application. Historically, ARM7 developers use the compiler switch -mno-got to suppress generation of a GOT. GOT binding is not supported by the FLT format (nor by the xFLT format). For the purposes of the shared library support, the import feature provided by the GOT is the ability to re-direct data references: The GOT forces a layer of indirection between each variable access. In comparison, the ELF dynamic loader can relocate GOT indirection layer at load time so that modules and global variables are properly imported to, and exported from, shared libraries.

Of course, it is this indirection that the embedded ARM7 developers wish to avoid: It adds to memory usage and degrades memory performance. The xFLAT approach to shared libraries follows this optimized usage and does not support dynamic, load-time binding of the GOT. As a consequence, global variables may not be shared between modules.

A work around for this limitation is provided that is completely transparent to the user: Accessor functions replace the global variable references. For example, the global variable errno is exported from libc. This solution provides two things:

An accessor function built into the shared library version of libc that exports a pointer to libc's errno, and
An header file structure that re-defines the errno variable so that references to errno will use the accessor function.

The end result is the the programmer still appears to have a global variable called errno and no special programming was required.

Toolchain Module Issues - The operating context of the xFLT system differs from the traditional FLT operating context in one way: Traditional FLT processes consist of one contiguous memory allocation (or virtual address space) for the entire FLT process -- for both the read-only code segment and the writable, static data segment. In xFLT, by contrast, there are always two discontinuous address regions, one for the read-only code segment and one for the writable, static data segment.

As a consequence, the FLT-style compiler may attempt to resolve the address of text segment references by adding the static base register to the text segment offset! If your compiler behave in this manner, a small patch is required. An example of a patch to accomplish the require compiler is provided under xflat/gcc.

Debugger Module Issues - The GNU debugger includes logic to recognize when it has stepped into a shared library thunk routine. A thunk routine is a small section of code that lies between the calling code and the shared library code itself. The XFLAT shared library thunk routines will probably not be recognized by an unmodified GNU debugger. As a result, your debugger may have problems stepping into or over shared library calls without this modification.

Compressed Flat -

Zipped FLT makes sense because the whole flat binary is unzipped and copied to RAM in one big, contiguous memory allocation -- nothing gets shared. xFLT, on the other hand, uses mmap to map the file into two pieces: a mapped, shared text address space and an allocated, private data space; this is how it produces the shared text segment.

A simpler (and probably better) solution than zipped xFLT might be to use a compressed file system. The file system would then be responsible for unzipping and copying the file to RAM. And file system would be responsible for assuring that mmap continues to behave correctly: that there is a single instance of the text segment. And you would get the benefit of having all files compressed.

<<Table of Contents>>
<<Index>>