This report details the study made to determine the maximum number of threads that can be sucessfully spawned using libpthreads under Xflat.
When an application is built to use shared libraries through Xflat, a special Thunk layer is created to enable calls between the application and library functions. More detail on Xflat and the thunk layer can be found in here.
A simple example will ilustrate the use of the thunk layer. Say we have an application App1 that needs to call printf which is provided by the libc library uClibc. The application uses the thunk layer to store all the local App1 register values and then loads the correct uclibc register values, then it jumps through to the corresponding function. Figure 1 ilustrates this:
In order to call the printf function of libc, Xflat uses a thunk frame to save the App1 registers, this allows the Xflat thunk layer to replace the registers with the appropiate registers for executing the libc printf function. Then the printf function gets executed with the appropiate register values. The libc library does not need a thunk layer since it directly executes the kernel system calls.
The normal operation would require the application to use the xflat thunk layer only when calling functions that are contained in shared libraries. Each external library call uses 1 of the frames in the thunk layer, the default build has 6 frames available. Since normall calls to library functions block until the function is completed (like printf) the frame is released upon completion of the call, and thus it is reused at the next external call. This holds true only for single threaded applications were only one path or thread of execution is followed per thread.
On multi-threaded applications several threads may call a shared library function or diferent shared library functions simultaneously, this will require 1 thunk layer frame per call. Since these are asynchronous calls, only a maximum of 6 calls (1 call per frame) can be made simultaneously from the application to shared libraries. (this limit can be increased at build time of the BSP)
The question of what is the maximum number of threads I can spawn using Xflat? comes after we see that the number of frames available limits the number of simultaneous calls that we can make to a shared library. But the answer to this question has two versions, the theoritical and the practical.
To create a thread we need to call the pthread_create function from the libpthread library. When this function is called one thunk frame is used, but when the call returns this thunk frame is release and can be reused. So it would appear that theoricatily we can continue to spawn threads until we hit the kernel maximum of 1024 per process. This assumption is partially true but in practice it proves false. Since a new thread has been spawned, this new thread continues to execute it's statup rutine. If this rutine uses any external shared library functions, such as a printf, it will require to call the shared uClibc library through the xflat thunk layer; thus requiring a thunk frame to make this call. This means that if the threads do not call any external functions, or the number of simultaneous external calls does not exhaust the number of available frames, the theoretical maximum number of threads is still the kernel's 1024 threads per process. But since the exact timing of each of the thread's external calls is asynchronous this theoritical limit cannot be guaranteed.
The practical or real maximum number of threads that can be spawned per process has to satisfy the worst case timing scenario. This scenario will occur when all the threads make a simultaneous external function call to a shared library; at this point all the threads use a thunk frame to interconnect with the apropiate shared library. (or a call the blocks long enough for all the threads to execute an external function, something like sleep(120);). In this scenario the number of runnable threads will match the number of thunk frames. For the default build of 6 thunk frames the practical maximum number of threads is 6.
It's important to remember that the parent process is a thread by itself. So if the maximum number of safe threads is 6, then this means 5 child threads plus the parent thread. So if you want to target a build specifically to allow N threads you need to build the application using N+1 thunk frames. (for example for 20 runnable safe threads build the application with at least 21 thunk frames)
It's important to understand that while experimentaly you may have more threads than the maximum expressed here, it is not safe to asume that this will always be the case. When a thread fails to allocate a thunk frame it will fail silently and the thread will be destroyed. Thus it is very dangerous to surpass this limit.
The use of the xflat thunk layer increases the memory requirements of an application. Each thunk frame measures 24 bytes. The extra memory is required when the application or library is loaded, thus the compiled size will not increase, but the RAM requirement will be increased by 24 bytes per thunk frame. This is the reason why the default build uses 6 frames per process, which will only add 144 bytes to the memory requirement, but if we used 1024 frames it would require 24Kb more RAM per process, which is considerable.
Note: To set the number of thunk frames of the compiled application you can pass the -W1,-p,<#_frames> to the xflat linker xflat-ld.
In order to use libpthread as a shared library with Xflat, libpthread itself needs a thunk layer in order to be able to call functions located inside uClibc library. So for an application there are now 2 levels of indirection through xflat, this is represented in figure2:
For each thread that accesses a lipthread function which requires a libc function a path is created from the application to libc through the thunk layers of the application and libpthreads. This means that even if the application is compiled with 100 thunk frames they won't necessarily be able to run 100 threads that simultaneously access libc through libpthreads as the number of thunk frames available in libpthread determines this maximum. It would appear that libpthreads need to be compiled with at least 100 thunk frames available, but there are more rhunk frames required.
Libpthread uses a Master and a Master_Event thread that get spawned within libpthread at different times for different services to the other threads, this adds 2 more thunk frames.
There is a special case with the function pthread_cond_wait() and pthread_cond_timedwait(), this functions start an extra event helper thread. This means that for every thread that calls pthread_cond_timedwait() two threads are created inside libpthreads, so if these threads are to be able to call functions in libc there needs to be twice as many thunk frames available per thread serviced.
For worst case scenario, in which all threads call an external function simulteneously, we need to have 2 frames for the Master and Master_Event thread, 2*N threads that are serviced by the libpthread library.
This gives us the next formula for calculating the number of frames needed for a given process to spawn N threads sucessfully even in the worst case scenario:
Minimum Frames Needed:
|Application||C=Numer of child threads able to spawn||
||To be able to run C child threads the application needs a minimum (C+1) thunk frames.|
|libpthread||S=Number of threads able to service||
||To be able to service S incoming threads, libpthreads needs a minimum (2*S+2) thunk frames.|
>From this table we can determine the minimum number of frames needed by libpthread to service and application that creates N child threads is:
And the Maximum number of threads using libpthreads is: (as long as the application is compiled with (N+1) thunk frames)