2011.02.19 in

Worst bug ever

Today I fixed the most time consuming bug ever. I believe I've spent well over 50 hours total actively trying to find it, and countless more thinking about it. I first encountered this problem some time in early December on my first pass through the new CAL/CAL++ for ATI GPUs Milkyway@Home separation application. The final results were differed on the order of 10⁻⁶, much larger than the required accuracy within 10⁻¹². I took a long break in December and early January to apply to grad school and other things, but I was still thinking of the problem. Nearly all the time I've spent working on Milkyway@Home since I got back to RPI almost a month ago has been trying to find this bug. It's stolen dozens of hours from me, but it's finally working.

The problem was errors that looked like this:
CAL++ version errors

This is the μ · r area integrated over one ν step in the integral from the broken CAL version compared to the correct OpenCL results.

It looks like there's almost a sort of order to the error, with the grids and deep lines at some points, though the grids are much less prominent when the full range of errors are plotted, except for the deep line in the middle of r. I spent many hours looking for something related to reading/writing invalid positions in the various buffers used (I find dealing with image pitch and stuff very annoying), and every time finding that everything was right after many hours of tedious work comparing numbers.

It also didn't help that for a long time, my test inputs were not what I thought they were, which finally pushed me to get the automated testing I've meant to get working since I started working on the project last May.

I eventually happened to find this this on the internet.

I vaguely remember finding this before and reading "Declaring data buffers which contain IEEE double-precision floating point values as CAL_DOUBLE_1 or CAL_DOUBLE_2 would also prevent any issues with current AMD GPUs." and decided whatever changes it was talking about didn't apply to me, since I was already using the latest stuff and using the Radeon 5870. Apparently this is wrong.

This claims CAL_FORMAT_DOUBLE_2 should work correctly, but it apparently doesn't. I also don't understand why I can't use the integer formats for stuff put into constant buffers. I spent way too much of my time searching for random details in ATI documentation. It's rather annoying. Switching to the CAL_FORMAT_UNSIGNED_INT32_{2 | 4} formats for my buffers solved the stupid problem. I guess some kind of terrible sampling was going on? I don't understand how that results in the error plots, with half the buffer being much worse, and the grids.

I really don't understand why this wasn't in the actual documentation, and instead I just happened to find this. Only one of the provided examples for CAL uses doubles, and it is a very simple example.

blog comments powered by Disqus