Atomic Incrementing (Kyle)

I found a great quote:

Locks are like tanks - powerful, slow, safe, expensive, and prone to getting you stuck.

It goes well with this picture:

Creating atomic increments (and avoiding locks) on various architectures is ~~absurdly difficult~~ harder than it should be. And portable? HA! You said “portable.” I’ve been looking into it, and here’s my little collection of implementations.

So, there’s a small amount of information out there about doing it on PowerPC. In GCC’s inline-assembly, on Linux (using gcc 4.1.2) it looks like this:

void atomic_incr(int * operand, int incr)
{
    asm volatile (
        "loop:\n\t" /* repeat until this succeeds */
        "lwarx  6,0,%0\n\t" /* reserve the operand */
        "add    6,6,%1\n\t" /* add incr to it */
        "stwcx. 6,0,%0\n\t" /* put the sum back, and release it */
        "bne- loop" /* start-over on failure */
        :
        : "r" (operand), "r" (incr)
        : "r6"
     );
}

For some idiotic reason, this is different on MacOS X:

void atomic_incr(int * operand, int incr)
{
    asm volatile (
        "loop:\n" /* repeat until this succeeds */
        "lwarx  r6,0,%0\n\t" /* reserve the operand */
        "add    r6,r6,%1\n\t" /* add incr to it */
        "stwcx. r6,0,%0\n\t" /* put the sum back, and release it */
        "bne- loop\n" /* start-over on failure */
        :
        : "r" (operand), "r" (incr)
        : "r6"
     );
}

Note the addition of r’s in front of all of the register numbers.

But that’s just fine. It makes some sense, and above all: it works!

Actually, here’s an even better one (more optimizable, and more cross-platform):

void atomic_incr(int * operand, int incr)
{
    register tmp;
    asm volatile (
        "loop:\n" /* repeat until this succeeds */
        "lwarx  %2,0,%0\n\t" /* reserve the operand */
        "add    %2,%2,%1\n\t" /* add incr to it */
        "stwcx. %2,0,%0\n\t" /* put the sum back, and release it */
        "bne- loop\n" /* start-over on failure */
        :: "r" (operand), "r" (incr), "r" (tmp)
     );
}

Now, show me the same thing for Intel processors. What’s that? You can’t? Of course not. Because Intel loves mucking with this stuff, so each of their processors uses a slightly different syntax to get the best speed. That, and getting it into GCC syntax is additionally difficult because some genius keeps changing the GCC assembler semantics.

Here’s a basic atomic increment that works on most x86 machines (NOT i386’s):

void atomic_add(int * operand, int incr)
{
    asm volatile (
        "lock xaddl %1, %0\n" // add incr to operand
        : // no output
        : "m" (*operand), "r" (incr)
    );
}

It uses xaddl so that I don’t have to worry about output, and can do the memory operation. The following also works, but may not be as fast (or it may be equivalent):

void atomic_add(int * operand, int incr)
{
    asm volatile (
        "lock xaddl %1, (%0)\n" // add incr to operand
        : // no output
        : "r" (operand), "r" (incr)
    );
}

Here’s a version that works on 386’s. I don’t know which is “faster” on all architectures:

void atomic_add(int * operand, int incr)
{
    asm volatile (
        "lock addl %1, %0\n" // add incr to operand
        : "=m" (*operand)
        : "r" (incr), "m" (*operand)
    );
}

But does ANY of that work on fancy Itanium (ia64) machines? Hell no!

You may be wondering: how does the Linux kernel implement atomic increments on ia64 machines. Like this:

#define atomic_add(i,v) \
({ \
    int __ia64_aar_i = (i); \
    (__builtin_constant_p(i) \
     && (   (__ia64_aar_i   1) || (__ia64_aar_i    4) \
         || (__ia64_aar_i   8) || (__ia64_aar_i   16) \
         || (__ia64_aar_i  -1) || (__ia64_aar_i   -4) \
         || (__ia64_aar_i  -8) || (__ia64_aar_i  -16) \
            ? ia64_fetch_and_add(__ia64_aar_i, &(v)->counter) \
            : ia64_atomic_add(__ia64_aar_i, v); \
})

Yeah, pretty sweet, no? Separate functions for doing simple math versus doing other increments. Here’s a simple increment:

void atomic_oneup(int * operand)
{
    uint64_t res; // this MUST be 64 bits
    asm volatile (
        "fetchadd4.rel %0=%1,%2"
        : "=r" (res)
        : "m" (*operand), "i" (1)
        : "memory"
    );
}

Note that it HAS to return something, because it’s a fetch instruction. But it’s also conveniently atomic. And that increment doesn’t have to be 1, obviously, but could be any of the ones listed in that ugly #define. But what about adding arbitrary numbers? For this we must use cmpxchg. Here’s how Linux uses it (tidied up a LOT):

static __inline int ia64_atomic_add (int i, atomic_t *v)
{
    __s32 old, new;
    do {
        old = atomic_read(v);
        new = old + i;
    } while (ia64_cmpxchg(acq, v, old, new, sizeof(atomic_t)) != old);
    return new;
}

How can you not love something like that? Boiling it down, that call to ia64_cmpxchg turns into: ia64_cmpxchg4_acq((__u16*)v, new, o); (that o variable is a place-holder). THAT, in turn, is this:

#define ia64_cmpxchg4_acq(ptr, new, old) \
({
    __u64 ia64_intri_res;
    asm volatile ("mov ar.ccv=%0;;" :: "rO" (old));
    asm volatile ("cmpxchg4.acq %0=[%1],%2,ar.ccv":
    "=r" (ia64_intri_res) : "r"(ptr), "r"(new) : "memory");
    ia64_intri_res;
})

Just so we don’t go insane here, here’s a condensed loop:

void atomic_incr(int * operand, int incr)
{
    int64_t res, old, new;
    do {
        old = *operand; // atomic if operand is 32-bit aligned
        new = old + incr;
        asm volatile ("mov ar.ccv=%0;;" :: "rO" (old));
        // these are separate because the compiler inserts stuff in between
        asm volatile ("cmpxchg4.acq %0=[%1],%2,ar.ccv":
                    "=r" (res) : "r"(operand), "r"(new) : "memory");
    } while (res != old); // if res==old, our computation is out of date
}

Ugly much? Just a bit.

Something else I’ve discovered, which is kinda critical. There is a big difference between

asm volatile ( /* stuff */ );

and

asm __volatile__ ( /* stuff */ );

Which is that gcc is MUCH more careful not to muck with the latter, but feels free to modify the former if it feels it would be a good idea. The only difference between