Better Parallelization And Scalability | Breaking Eggs And Making Omelettes

Obviously, I have more than enough FATE-related work to keep my free time filled for the foreseeable future. But that doesn’t stop me from coming up with more ideas for completely revising the underlying architecture. And it’s always good to hash these ideas out on this blog since it: 1) helps me clarify the issues; 2) allows other people to jump in with improvements and alternatives; 3) allows me to put as much thought into these ideas as possible. Let’s face it– whatever design decisions I make for FATE are the ones the team tends to be stuck with for a long time.

People who dig into FATE and the various commands it executes in order to build and test FFmpeg often ask why I only perform singly-threaded builds, i.e., why not build with ‘make -j2’ on a dual-core machine? It would be faster, right? Well, yes, but only for the build phase. The test phase (which usually takes longer) is still highly serial (though the ‘make test’ regression suite can also be parallelized). A pragmatic reason I have for not wanting to multi-thread the build is that the stdout/stderr text lines can easily get jumbled which makes it more difficult to diagnose failures.

I do, however, put both cores on the main dual-core FATE machine to use– I run 2 separate installations of the FATE script, thus divvying the labor by having each core handle roughly half of the configurations. Thus far, one installation runs the x86_64 configs while the other installation is a 32-bit chroot environment running the x86_32 configs.

Can I come up with a better parallelization system? I think I might be able to. And to what end? Taking this whole operation to the next level, where “next level” is defined loosely as getting a few hundred more tests into the database while perhaps upgrading to a faster machine with more than 2 cores which is responsible for more than just native machine builds. Also, I am experimenting with moving the PowerPC builds to a faster (x86) machine for building. Better support for cross compiling and remote testing is driving some of this refactoring.

So there are 2 parts to the build/test cycle: build and test (try to keep up). The assumption is that building requires the use of 1 CPU while testing also requires the use of 1 CPU. While the build phase will always require 1 CPU on the machine running the build, the test phase does not need 1 CPU on the same machine if it’s supposed to test on another machine. Instead, the testing phase in that case will require 1 CPU on a particular remote machine. The reason I bring this up is that there is no reason to tie up 1 CPU on the building machine if the testing phase is being carried out on a separate machine.

I am thinking of revising the FATE script to have a multi-threaded/process (still deciding which) operating model, one that is even more ambitious than I was brainstorming during the last big architectural change (but scrapped the idea because it seemed unnecessary). In this new model, a single installation of a FATE script is responsible for a lot of configurations and is smart about divvying the load amongst the available CPUs.

As a concrete example of this idea in operation, revision N is committed to the FFmpeg tree. A running FATE script detects this per its usual polling method and loads up its job queue with build jobs. 26 build jobs, in fact. This x86_64 machine has 2 CPU threads. So pull the first 2 build jobs out of the queue and start building them. When one of the build jobs completes, check its status. If the build succeeded, then put a new job in the queue indicating that a test phase should proceed with the newly built binaries. Then start with the next job in the queue.

In this model, however, there is still the matter of testing the 6 configurations cross compiled for PowerPC. Each of those 6 test phases should not occupy a job slot for the x86_64 CPU. This implies that there should be separate job queues for each on the CPUs that this FATE installation is responsible for. So the 6 build jobs for the 6 PowerPC configurations go into the x86_64 job queue but — upon successful completion — the build jobs trigger new testing jobs to be entered into the PowerPC job queue. Of course, the main x86_64 machine will be processing the PowerPC job queue concurrently with the x86_32/64 job queue. But running the test phase of a PowerPC configuration won’t count as a job for the x86_64 machine.

Right now, the main FATE machine is a meager 2.13 GHz Core 2 Duo. But I can envision upgrading it to a faster quad-core. Scalability is what I’m driving at here.

So far, so good. But there are a few implementation details. The PowerPC configurations will naturally require more time for testing than the x86 configurations. Thus, when there is a new revision available, the PowerPC build jobs should be queued up first so that they can get built first and move into the PowerPC job queue for earlier testing.

How to manage directories for this? Individual configuration’s builds would go into their own directories, which would clean themselves up after the fact. Ideally, I want to stay with one build tree here. This is important because if the source directory path changes at all, the ccache is invalidated which really slows down this whole operation. But that implies that the FATE installation is not allowed to move on from a particular revision until all (26 in this case) configurations have been tested. What if the 6 PowerPC configs lag behind the the 20 x86 configurations? Vice versa? Much more likely, what if there is a network or machine problem and the script can’t connect to the PowerPC machine? That blocks the whole operation.

It seems that there are times when the script needs to intelligently decide whether to leave certain configurations behind so that the entire set of configurations do not get too far behind. This implies separate source trees. In order to mitigate the ccache impact, experiment with fooling ccache using relative paths. Copy iterations of source tree to, e.g., ‘source/svn12345/ffmpeg’ or ‘source/12346/ffmpeg’, change directory into that source directory, create the individual build directories beneath that tree, and configure with a relative path. Perform reference counting on the tree to track how many configurations are using it and trash the whole thing when the last configuration is complete.

So I think I’m going with a multithreaded approach for this instead of a multiprocess approach. Multithreaded seems to lend itself best to this task. My biggest misgiving was SQLite’s grave warning against threads in general, implying that it can be dangerous to use SQLite in a multithreaded program. My best understanding, however, is that trouble can arise if a script opens an SQLite handle in one thread and tries to use the handle in another thread. I don’t plan to do that. Each thread will manage its unique handles. This shouldn’t be any different from a single-threaded program managing multiple handles to SQLite for some reason. I don’t think there is any restriction on such usage.

And finally, it will make sense to move the results logging phase into a separate thread. That means, when the server starts acting up (as myself and various other testers have experienced), it won’t block the entire testing operation.

Feedback always welcome, though not necessarily always incorporated. I am still considering how the periodic task infrastructure will fit into this.