« CM-5 | Main | Bible Puzzlers »

Atomic Incrementing

I found a great quote:

Locks are like tanks - powerful, slow, safe, expensive, and prone to getting you stuck.

It goes well with this picture:

Creating atomic increments (and avoiding locks) on various architectures is absurdly difficult harder than it should be. And portable? HA! You said “portable.” I’ve been looking into it, and here’s my little collection of implementations.

So, there’s a small amount of information out there about doing it on PowerPC. In GCC’s inline-assembly, on Linux (using gcc 4.1.2) it looks like this:

void atomic_incr(int * operand, int incr)
{
    asm volatile (
        "loop:\n\t" /* repeat until this succeeds */
        "lwarx  6,0,%0\n\t" /* reserve the operand */
        "add    6,6,%1\n\t" /* add incr to it */
        "stwcx. 6,0,%0\n\t" /* put the sum back, and release it */
        "bne- loop" /* start-over on failure */
        :
        : "r" (operand), "r" (incr)
        : "r6"
     );
}

For some idiotic reason, this is different on MacOS X:

void atomic_incr(int * operand, int incr)
{
    asm volatile (
        "loop:\n" /* repeat until this succeeds */
        "lwarx  r6,0,%0\n\t" /* reserve the operand */
        "add    r6,r6,%1\n\t" /* add incr to it */
        "stwcx. r6,0,%0\n\t" /* put the sum back, and release it */
        "bne- loop\n" /* start-over on failure */
        :
        : "r" (operand), "r" (incr)
        : "r6"
     );
}

Note the addition of r’s in front of all of the register numbers.

But that’s just fine. It makes some sense, and above all: it works!

Actually, here’s an even better one (more optimizable, and more cross-platform):

void atomic_incr(int * operand, int incr)
{
    register tmp;
    asm volatile (
        "loop:\n" /* repeat until this succeeds */
        "lwarx  %2,0,%0\n\t" /* reserve the operand */
        "add    %2,%2,%1\n\t" /* add incr to it */
        "stwcx. %2,0,%0\n\t" /* put the sum back, and release it */
        "bne- loop\n" /* start-over on failure */
        :: "r" (operand), "r" (incr), "r" (tmp)
     );
}

Now, show me the same thing for Intel processors. What’s that? You can’t? Of course not. Because Intel loves mucking with this stuff, so each of their processors uses a slightly different syntax to get the best speed. That, and getting it into GCC syntax is additionally difficult because some genius keeps changing the GCC assembler semantics.

Here’s a basic atomic increment that works on most x86 machines (NOT i386’s):

void atomic_add(int * operand, int incr)
{
    asm volatile (
        "lock xaddl %1, %0\n" // add incr to operand
        : // no output
        : "m" (*operand), "r" (incr)
    );
}

It uses xaddl so that I don’t have to worry about output, and can do the memory operation. The following also works, but may not be as fast (or it may be equivalent):

void atomic_add(int * operand, int incr)
{
    asm volatile (
        "lock xaddl %1, (%0)\n" // add incr to operand
        : // no output
        : "r" (operand), "r" (incr)
    );
}

Here’s a version that works on 386’s. I don’t know which is “faster” on all architectures:

void atomic_add(int * operand, int incr)
{
    asm volatile (
        "lock addl %1, %0\n" // add incr to operand
        : "=m" (*operand)
        : "r" (incr), "m" (*operand)
    );
}

But does ANY of that work on fancy Itanium (ia64) machines? Hell no!

You may be wondering: how does the Linux kernel implement atomic increments on ia64 machines. Like this:

#define atomic_add(i,v) \
({ \
    int __ia64_aar_i = (i); \
    (__builtin_constant_p(i) \
     && (   (__ia64_aar_i   1) || (__ia64_aar_i    4) \
         || (__ia64_aar_i   8) || (__ia64_aar_i   16) \
         || (__ia64_aar_i  -1) || (__ia64_aar_i   -4) \
         || (__ia64_aar_i  -8) || (__ia64_aar_i  -16) \
            ? ia64_fetch_and_add(__ia64_aar_i, &(v)->counter) \
            : ia64_atomic_add(__ia64_aar_i, v); \
})

Yeah, pretty sweet, no? Separate functions for doing simple math versus doing other increments. Here’s a simple increment:

void atomic_oneup(int * operand)
{
    uint64_t res; // this MUST be 64 bits
    asm volatile (
        "fetchadd4.rel %0=%1,%2"
        : "=r" (res)
        : "m" (*operand), "i" (1)
        : "memory"
    );
}

Note that it HAS to return something, because it’s a fetch instruction. But it’s also conveniently atomic. And that increment doesn’t have to be 1, obviously, but could be any of the ones listed in that ugly #define. But what about adding arbitrary numbers? For this we must use cmpxchg. Here’s how Linux uses it (tidied up a LOT):

static __inline int ia64_atomic_add (int i, atomic_t *v)
{
    __s32 old, new;
    do {
        old = atomic_read(v);
        new = old + i;
    } while (ia64_cmpxchg(acq, v, old, new, sizeof(atomic_t)) != old);
    return new;
}

How can you not love something like that? Boiling it down, that call to ia64_cmpxchg turns into: ia64_cmpxchg4_acq((__u16*)v, new, o); (that o variable is a place-holder). THAT, in turn, is this:

#define ia64_cmpxchg4_acq(ptr, new, old) \
({
    __u64 ia64_intri_res;
    asm volatile ("mov ar.ccv=%0;;" :: "rO" (old));
    asm volatile ("cmpxchg4.acq %0=[%1],%2,ar.ccv":
    "=r" (ia64_intri_res) : "r"(ptr), "r"(new) : "memory");
    ia64_intri_res;
})

Just so we don’t go insane here, here’s a condensed loop:

void atomic_incr(int * operand, int incr)
{
    int64_t res, old, new;
    do {
        old = *operand; // atomic if operand is 32-bit aligned
        new = old + incr;
        asm volatile ("mov ar.ccv=%0;;" :: "rO" (old));
        // these are separate because the compiler inserts stuff in between
        asm volatile ("cmpxchg4.acq %0=[%1],%2,ar.ccv":
                    "=r" (res) : "r"(operand), "r"(new) : "memory");
    } while (res != old); // if res==old, our computation is out of date
}

Ugly much? Just a bit.

Something else I’ve discovered, which is kinda critical. There is a big difference between

asm volatile ( /* stuff */ );

and

asm __volatile__ ( /* stuff */ );

Which is that gcc is MUCH more careful not to muck with the latter, but feels free to modify the former if it feels it would be a good idea. The only difference between

asm volatile ( /* stuff */ );

and

__asm__ volatile ( /* stuff */ );

is that the latter can be used if the former conflicts with something.)

TrackBack

TrackBack URL for this entry:
https://www.we-be-smart.org/mt/mt-tb.cgi/659

Comments (2)

mwgamera:

With gcc 4.1.2 you can use __sync_fetch_and_add instead of writing custom assembly, I believe.
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html

Kyle:

I found that 4.1.2 didn't *reliably* support those builtins, particularly on OSX. But yes, those builtins (when available) are quite nice.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on May 2, 2007 2:06 PM.

The previous post in this blog was CM-5.

The next post in this blog is Bible Puzzlers.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type 3.34