GCN OpenCL memory fences update (and inline PTX)

This is an update for my previous post about the global behaviour of mem_fence() on existing GPUs for ones which have started existing since then.

On previous AMD architectures the caches were not really used except for read only images. The latest Tahiti/GCN GPUs have a read/write, incoherent L1 cache local to a compute unit. Since a single workgroup will always run on a single compute unit, memory will be consistent in that group using the cache.

According to the OpenCL specification, global memory is only guaranteed to be consistent outside of a workgroup, after an execution barrier, i.e. the kernel is finished, so memory will be consistent before the next invocation. I found this to be really annoying and would ruin my kernels, and in some cases had a high overhead from the multiple kernel launches.

The write does seem to be committed to memory like the IL documentation would indicate, however the read is still problematic outside of the workgroup. You must bypass the L1 cache in order to ensure reading an updated value.

For some cases I found it faster and more convenient to use atomics to bypass the L1 cache (e.g. read any given int value with atomic_or(&address, 0)).

Use atomics to bypass the L1 cache if you need strong memory consistency across workgroups. This is an option for reads that aren't very critical. This was true for one of the N-body kernels. For another it was many times slower than running a single workgroup at time to ensure global consistency.

In the future when the GDS hardware becomes available as an extension, it will probably be a better option for global synchronization. It's been in the hardware at least since Cayman (and maybe Evergreen?) but we don't (yet) have a way to access it from the OpenCL layer.

On the Nvidia side, there is the potential that mem_fence() will stop providing a truly global fence in a future compiler update. Since CUDA 4.0 or so the OpenCL compiler has supported inline PTX. You can get the same effect as __threadfence() by using the instruction directly:

inline void strong_global_mem_fence_ptx()


OpenCL vs. CUDA GPU memory fences

Update: This information is only partially useful on the most recent Tahiti/GCN GPUs, and a safer option is now available for Nvidia

I've been working on the Milkyway Nbody GPU application using OpenCL, mostly basing it off of a CUDA implementation of the Barnes-Hut algorithm. The port (and addition of a slew of other minor features and details to match the original) to OpenCL has been quite a bit more painful and time consuming than I anticipated. In it's current state, it seems to work correctly on Nvidia GPUs (somewhat unsurprisingly).

The most recent problems I've been exploring is an apparent "tree incest" problem which seems to happen quite frequently in situations where it should not. In the traversal of the tree to compute a force on a body, it should enter nearby cells and perform a more accurate force calculation based on individual bodies (as opposed to the center of mass of a collection of farther away bodies, which is how this is an O(n log n) approximation and not the basic O(n2) algorithm. Logically, the cell a body itself belongs to should be entered and forces calculated from it's neighbors while skipping an interaction on itself. If when calculating forces on a body it doesn't run into itself, there's something wrong. This can happen ordinarily depending on the distribution of bodies, usually when bodies are very close to the edges of a cell. It happens most often with the classic cell opening criterion, particularly when using opening angles close to the maximum of 1.0. This is happening nondeterministically and in all cases on AMD GPUs (usually for some small number of bodies relative to the total I'm testing with), so something is slightly broken.

The CUDA implementation uses in several places the __threadfence() and __threadfence_block() functions. The CUDA documentation for these functions is mostly clear. It stalls the current thread until its memory accesses complete. The closest equivalents in OpenCL are the mem_fence() functions. According to the AMD porting CUDA guide says of __threadfence() that there is "no direct equivalent" in OpenCL, but that mem_fence(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE) is an equivalent of __threadfence_block(). My guess was that the potentially different behaviour between mem_fence() and __threadfence() might be responsible, so I went looking for what actually happens.

Ignoring the supposedly identical __threadfence_block(), and mem_fence(GLOBAL|LOCAL) I went looking at __threadfence(). According to the CUDA documentation

__threadfence() waits until all global and shared memory accesses made by the calling thread prior to __threadfence() are visible to:

  • All threads in the thread block for shared memory accesses
  • All threads in the device for global memory accesses
According to the OpenCL spec, a mem_fence() "Orders loads and stores of a work-item executing a kernel. This means that loads and stores preceding the mem_fence will be committed to memory before any loads and stores following the mem_fence." Earlier in the spec (section 3.3.1 Memory Consistency), it states that "Global memory is consistent across work-items in a single work-group at a work-group barrier, but there are no guarantees of memory consistency between different work-groups executing a kernel."

This says that there's no concept of (device) global memory consistency. The global memory accesses are completed and visible to other threads in the same workgroup and only at a barrier, which this is not. I guess that means the writes could be trapped in some kind of cache and only visible to threads in the other wavefronts executing on the same compute unit making up the workgroup. This is quite the difference from the much stronger __threadfence() where the writes are visible to all threads in the device. From this it unfortunately sounds like what I need to happen can't be done without some unfortunate hackery involving atomics or splitting into multiple kernels to achieve a global weak sort of "synchronization." Breaking (some of) these pieces into separate kernels isn't really practical in this case. It would have been kind of painful to do and slower. I figured I would look into what actually is happening.

Since things seemed to be working correctly on Nvidia, I checked what happens there. Inspecting the PTX from CUDA and my sample OpenCL kernels, it appears that the CUDA __threadfence() and __threadfence_block() compile into the same instructions as OpenCL's mem_fence() (as well as read_mem_fence() and write_mem_fence()) with the different flags. Any of the fences with a CLK_GLOBAL_MEM_FENCE compiles to, and mem_fences with only CLK_LOCAL_MEM_FENCE compiles to membar.cta. I thought the PTX documentation was more clear on what actually happens here.

According the PTX documentation, membar.cta "waits for prior memory accesses to complete relative to other threads in the CTA." CTA stands for "Cooperative Thread Array," which apparently is a CUDA block (an OpenCL workgroup). This would seem to confirm the same behaviour with mem_fence(LOCAL). More interestingly, waits for prior memory accesses to complete relative to other threads in the device confirming that __threadfence() and mem_fence(GLOBAL) have the same behaviour on Nvidia. If the problem I'm debugging is this issue, this explains why it does work as expected on Nvidia.

Since now I was sure the correct thing should in fact be happening on Nvidia, I checked the AMD IL from my sample kernels, and found fence_lds_memory in the places I was most interested in. AMD IL instructions are built up out of a base name (in this case "fence") and then have modifiers prefixed with underscores appended to the name. In this case, the _lds modifier is the local fence around the workgroup. The LDS is the "local data share," and is the same as OpenCL __local memory. Once again, mem_fence(GLOBAL|LOCAL) appears to have the same expected behaviour as __threadfence_block() as expected.

Specifically, it states that:

_lds - shared memory fence. It ensures that:

  • no LDS read/write instructions can be re-ordered or moved across this fence instruction.
  • all LDS write instructions are complete (the data has been written to LDS memory, not in internal buffers) and visible to other threads.
What I'm actually looking for is the global behaviour, as given by the _memory modifier:

_memory - global/scatter memory fence. It ensures that:

  • No memory import/export instructions can be re-ordered or moved across this fence instruction.
  • All memory export instructions are complete (the data has been written to physical memory, not in the cache) and is visible to other threads.
I supposed I should also have checked the final ISA to be sure, but I'm lazy and gave up on finding the Cayman ISA reference. Tthere does appear to be some sort of waiting for the write:
  04 WAIT_ACK:  Outstanding_acks <= 0
I guess this might kill my hypothesis about the different mem_fence() behaviour. I would feel a bit better if it included the phrase "in the device" at the end, but my reading of this is still that it does what I hoped. It does appear that a mem_fence() is consistent across the device with AMD and Nvidia's GPU implementations of OpenCL, so now I need to do more work to find what's actually broken.

So now I'm relying on implementation detail behaviour beyond the spec (it's not the only place...), but oh well. It's much nicer than the alternative (more work).

The conclusion of all of this, is that relying on OpenCL implementation behaviour, a mem_fence() with CLK_GLOBAL_MEM_FENCE should work among all threads in the device for both Nvidia and AMD GPUs (at least on current hardware) and as far as I can tell from chasing the documentation.


Milkyway@Home for iPhone / iOS

Milkyway@Home has now been released for iPhone / iOS! Now you can not only use your GPU to figure out the structure of our galaxy, but your phone too! You can get install instructions on the news post.

Technically, the new client is only semi-functional. The actual important core code is there and working, but the BOINC pieces are missing which are necessary for it to be to more (sort of) useful. It downloads 1 of about 5 sample workunits (which also happen to be 1000x smaller than the real versions so that they actually can complete on the phone). Checkpointing and some other stuff still works normally. There wasn't really time to port BOINC / the libraries, so there isn't any accounting or actual task fetching. The results also are just optionally emailed to me at I'm curious what results I'll get back.

The small tests that run in about 10 seconds on my desktop take over 20 minutes on my iPhone 3g, and drain the battery about 8%, so it's about 2 orders of magnitude slower. I do not think the full size units will complete in a week, although I haven't actually tried it.


Milkyway N-body 0.40

I've just finished making the next release of the Milkyway@Home N-body simulation. These should start going out soon along with new searches. The major change in this version is that Lua is now used for configuration instead of the old static JSON file. You can now script a totally arbitrary initial particle distribution, and you can run a simulation without the milkyway background potential, so it's probably more useful for anyone else to use now for other N-body simulations.

This all started one night when I was planning on finishing the few remaining issues with the new CAL/CAL++/ATI Separation application, but needed to get some reference results from the working OpenCL application on Nvidia. But then Tim Horton left Jayne (the system with the GTX 480 donated by a Milkyway@Home user) running Windows and disappeared somewhere. Then there was something of an obstacle to doing what I was planning on doing and I thought it would be cool to have an embedded dynamic language. My original thought was to use JavaScript after Seed, but JavaScriptCore isn't really split out from Webkit, and the full Webkit dependency was too much for what I needed. I considered Python, but the Lua libraries were an order of magnitude smaller, so I chose it and spent the next few days getting basic things working. I spent a long time writing boring bindings for many of the structs for the N-body simulation. I later found that if I had used LuaJIT, it seems it would have done most of the work for me. About a week later I had something working. Although the mini-API isn't really close to what I originally envisioned or wanted, you can now script any simulation you want. I'm considering making it possible to use a Lua closure for the external potential, but I'm not sure how much slower that will be.

This enables more complex models to be fit without a need to make new releases, and avoids a need to mangle the code every time the model changes to support more combinations of features. The current plan is to try multi-component dwarfs, with an extra component of dark matter which will be ignored in the final likelihood calculation.

I've fixed a few other small random bugs, such as in some cases on Windows some system IO error would happen when checkpointing, and there would be a "something something failed: No error" type error message. I'm also using newer BOINC libraries for Linux to hopefully avoid the random crashes which were coming from the 6.10.58 libraries.

I've switched to only using the multithreaded versions using OpenMP for all systems. The 64-bit Windows build is now actually 64 bit as well. Building on Windows is still an agonizing process (mostly getting dependencies and a functioning build environment, not actually building it) I don't understand, but it's getting better.

You can download the source here.


Worst bug ever

Today I fixed the most time consuming bug ever. I believe I've spent well over 50 hours total actively trying to find it, and countless more thinking about it. I first encountered this problem some time in early December on my first pass through the new CAL/CAL++ for ATI GPUs Milkyway@Home separation application. The final results were differed on the order of 10⁻⁶, much larger than the required accuracy within 10⁻¹². I took a long break in December and early January to apply to grad school and other things, but I was still thinking of the problem. Nearly all the time I've spent working on Milkyway@Home since I got back to RPI almost a month ago has been trying to find this bug. It's stolen dozens of hours from me, but it's finally working.

The problem was errors that looked like this:
CAL++ version errors

This is the μ · r area integrated over one ν step in the integral from the broken CAL version compared to the correct OpenCL results.

It looks like there's almost a sort of order to the error, with the grids and deep lines at some points, though the grids are much less prominent when the full range of errors are plotted, except for the deep line in the middle of r. I spent many hours looking for something related to reading/writing invalid positions in the various buffers used (I find dealing with image pitch and stuff very annoying), and every time finding that everything was right after many hours of tedious work comparing numbers.

It also didn't help that for a long time, my test inputs were not what I thought they were, which finally pushed me to get the automated testing I've meant to get working since I started working on the project last May.

I eventually happened to find this this on the internet.

I vaguely remember finding this before and reading "Declaring data buffers which contain IEEE double-precision floating point values as CAL_DOUBLE_1 or CAL_DOUBLE_2 would also prevent any issues with current AMD GPUs." and decided whatever changes it was talking about didn't apply to me, since I was already using the latest stuff and using the Radeon 5870. Apparently this is wrong.

This claims CAL_FORMAT_DOUBLE_2 should work correctly, but it apparently doesn't. I also don't understand why I can't use the integer formats for stuff put into constant buffers. I spent way too much of my time searching for random details in ATI documentation. It's rather annoying. Switching to the CAL_FORMAT_UNSIGNED_INT32_{2 | 4} formats for my buffers solved the stupid problem. I guess some kind of terrible sampling was going on? I don't understand how that results in the error plots, with half the buffer being much worse, and the grids.

I really don't understand why this wasn't in the actual documentation, and instead I just happened to find this. Only one of the provided examples for CAL uses doubles, and it is a very simple example.


Milkyway@Home Windows Dependencies

I'm posting these so I don't accidentally almost lose them again, since it's rather painful to make them again, and in the unlikely event they're actually useful for someone else. I have binaries for the Milkyway@Home dependencies, in particular popt built with Visual Studio (which I ported last semester, which mostly involved removing C99 features, and casting mallocs since MSVC brings in lots of C++ stupidity to C while somehow also not supporting C99 features that are in C++). For MinGW libraries, the BOINC libraries are 6.10.58 with my patch to fix the MinGW build from last June, which still hasn't been applied. Crlibm, as required by the N-body to get consistent math results, I never got to truly work on Windows. These have the crlibm_init() function stripped out, since headers it was using to set the x87 FPU control flags seemed to be missing. It's been on my TODO list to get Crlibm properly working on Windows (and maybe with MSVC too), but I haven't gotten to it.

These are binaries for MSVC 2010, and for MinGW with GCC 4.4 (from TakeoffGW), for use with the static runtime (/MT). I had some trouble recently when trying to link these with GCC 4.5 stuff (at least the BOINC/C++ parts since C++ linking tends to stop working with new releases of things) from the new MinGW installer, so I'm not sure it will work with those.

MinGW dependencies
MSVC dependencies
Popt built with Visual Studio


Random Friday Adventure

The evening started out with a simple goal: to avoid paying extra for an Apple approved video card, and get more screens to work for Peter. For his Mac Pro, Peter made the unfortunate choice of a Radeon 5750, as this GPU does not have doubles. Doubles are a critical feature for Milkyway@Home. Apple's OpenCL implementation unfortunately seems to be missing support for doubles on GPUs which is also problematic.

We started out leaving Open Source Software Practice class and headed to Peter's apartment, stopping for pizza on the way. Peter gave us a tour of his scary apartment in the sketchy parts of Troy, and showed us the demon GPU. Tim and Peter argued for a while about something dealing with backups and new hard drives. Earlier Tim had been giving out hard drives after giving up on Jayne's disk situation and getting a new NAS box, in addition to random sale drives Peter had. Peter was concerned about preserving his torrents; his ratio is important to him. There was some kind of copy that was never going to finish.

To use this random GPU from Newegg in the Mac Pro, it needed new EFI friendly firmware or something like that. It's not exactly clear to me why it needs to be different. I had suggested to Peter getting the 5830, since the 5830 a much more acceptable card, with doubles! I found a comment on a blog post about these hacky firmwares which claimed to have it sort of working with some weird caveats about which outputs worked. The correct, good GPU with no work would be a 5870 from Apple, though that would be more expensive.

We sat around Peter's apartment for a while before heading to our apartment to use my desktop for flashing purposes. Peter drove us, but there wasn't parking near our apartment. We went one street over, a strange place I have never been before. I have had to swap my GPU on a daily basis recently to work on Milkyway@Home, so I put in Peter's new GPU. While waiting to be told the next step to try, I played with the AMD OpenCL examples. To flash the video card, we thought we needed to get a DOS boot disk with the flash utility, which couldn't be used from Windows. It turns out this is close to impossible. I don't really remember all the nonsense we tried, but there was all kinds of boot sector hackery, and 16-bit tools which wouldn't run. We spent many hours making various disks and trying to flash with each one. Files would just mysteriously not show up and pretty much everything was failing. While the garbage utilities we were trying to use were old, the whole experience seemed like an adventure in computer use from the very early 90s. We booted off of Tim's camera, although it didn't actually help.

We eventually gave up and decided we could try flashing it again after it was installed in the Mac Pro and see if it would just happen to work. We went outside and couldn't find Peter's car. We walked up and down the street looking for it, and Peter came to the conclusion that it was towed again. It turns out it was; the fence we parked in front of was apparently a driveway of sorts. The next day Peter was supposed to drive some person he had never met through some extended relation, and didn't really want to do it. He now had the perfect excuse to get out of it, but for some reason was determined to go get the car now. We didn't have any other way to get there, so we were going to walk.

We stopped at the pizza place briefly so Peter could get money for the tow from the ATM. We then started our walking adventure through Troy after midnight. I haven't really wandered through Troy yet in almost 4 years. This walking adventure took a bit more than half an hour to the fabulous garage. Some random people in a car parked there said the guy was calling back, but Peter still called the guy, who confirmed he was coming back. We headed around back to where the cars were and paid the tow guy, where some kind of sketchy guy showed up behind.

Peter paid $200 I think for the tow. It would have been cheaper to buy the Apple 5870 in the first place and avoid all of this. After that, we went to Denny's. Peter had commentary on some of the other patrons here at 2 am. I noticed that many of the people were old and wearing fancy clothes which I found strange.

I ordered some kind of food which I didn't quite understand what it was, so I pointed to it on the menu using my formidable social skills. I didn't quite get the right thing, but close enough. While sitting there eating, Peter explained that he had talked to Netkas about something or other related to the original GPU problem. Tim and I got excited, and sort of asked in disbelief about this conversation with him. It was kind of sad and hilarious that we both instantly recognized this completely random OSx86 hacker. We left Denny's and still didn't have a working GPU in a Mac Pro so everything was a complete failure.


End of the semester

I've mostly spent the semester doing homework; I didn't make as much progress on Clutterhs as I had planned. I hope once that I'm done with finals, I'll have time to make the various changes I want and finish binding Clutter 1.2. The release of Clutter 0.2 is still waiting on those, as well as a release of gtk2hs.

In total, I've completed 58 physics homework assignments, 3 essays, 6 tests, a large number of class exercises and quizzes, and 3 upcoming finals. I've spent countless hours doing nothing but math. On occasion, there was lots of help from Mathematica.

In the 3 physics classes I took this semester (Electromagnetic Theory, Intro to Quantum Mechanics, Thermodynamics and Statistical Mechanics), in total I did 315 pages of LaTeX/ Lyx homework at an average length of 5.4 pages each. Nearly all of that being just math, with an occasional graph or chart. This is in addition to the countless scraps of paper I occasionally used (although as time passes I find myself just doing nearly all of the simple algebra in my head and entering directly into Lyx.)

I'm rather disappointed to not be going anywhere this summer. I applied to over a dozen different research programs for the summer for physics and astronomy at a wide array of random places; I was rejected from all of them. I finally did get accepted to work on milkyway@home, which is what I'll be working on this summer. I'm glad to be working on this, although I really don't want to stay in Troy. It may be my ideal research project.

I accidentally got accepted to GSoC. I applied as a final fallback in case I didn't get accepted to do research with milkyway@home. I found out after the GSoC deadline, and some time before the selected students were announced I dropped out as a student, at least according to melange. A while later it seems I was accepted anyway to work on the gobject introspection based binding generator for gtk2hs. I'll probably end up doing some work on this over the summer; it might also become what I work on next semester for RCOS. One of the other rejected Haskell proposals is now taking my place.



One of the primary features of this blog from near the beginning was the addition of the Cornify button, currently at the top of the page. It makes unicorns happen. Last weekend at ROFLCon I had this picture taken of me with a unicorn, and with the guy who made Cornify right behind me wearing a unicorn shirt.

Unicorns are pretty cool.


Clutterhs 0.1 Released

I've made an early release of Clutterhs, and it's on Hackage. There's still a fair number of things that don't work yet, and lots of messy code that needs to be cleaned up, but most of the core stuff should work. There isn't a tutorial yet, but overall usage is similar to gtk2hs. If you want to try it, the best thing to do would be to look at the simple demos in the git repo. They aren't very good examples, and just sort of use a bunch of random stuff but they should help with the general idea of how to use it (which is still quite likely to change).

See Older Posts