My day job is in Beowulf clustering (I work for Advanced Clustering Technologies in Kansas City); it's pretty cool because it draws on every conceivable computer-related skill that one might have: programming, scripting, architecture, data center design, Linux kernel hacking, user interfaces, message passing, compiler optimizations, BLAS, semi-high level math, hardware ... the list goes on and on. It's challenging enough without having hardware failures get in our way.

So, we developed an automated way to find and report hardware problems using a combination of the kernel's EDAC and MCE and lm_sensors, S.M.A.R.T., and hddtemp. It's a little initrd that boots up and puts the system under heavy, heavy load using a program called HPL (well-known in HPC) compiled against Intel and AMD-specific BLAS libraries. We've been using it internally for a few years but decided that others might benefit from it as well.

So, I set up a public git mirror here:

git clone bootimage

And I put up some pre-compiled binaries and an explanation here:

If you have ECC memory, be sure to enable multi-bit checking in your BIOS.