Main

Research Archives

May 2, 2003

Research Idea

In the Lord of the Rings, the body of the actor was used to map and control the body of Gollum in a realistic fashion. The same was not done with his face, but it could be - although perhaps not to the same amount of detail.

Now imagine a simple line drawing of a face. How detailed would a camera's model of a human face need to be in order to extract just that much information from the face - the size of the smile, dimples, the angle and placement of the brows and perhaps lines on the forehead. Could this be extracted using a simple webcam?

How cool would it be to have this cariacature of a face (simple smiley line drawing) represent the other person in IM conversations, as a two way conversation? How much data would need to be sent in order to get decent responsiveness?

How well could you map this line drawing to a 3D face, that merely morphs along the same control lines as this line-drawing face?

How possible is this?

February 11, 2004

Academic Writing...

Academic writing is obtuse, yes. But it can also be fun! see?

March 14, 2005

Misprediction

Why is misprediction not a word?

September 7, 2006

Why Are Compilers So Idiotic?

It astonishes me how idiotic some compilers can be.

I’ve been working with some unusual compilers recently: the PGI compilers from the Portland Group, and the XL compilers from IBM. I’ve been attempting to get them to compile some standard libraries for testing parallel operations. For example, I want them both to compile LAM-MPI, and BLAS, and LAPACK.

In fighting with them to get them to function, I’ve discovered all sorts of quirks. To vent my frustrations, I am documenting them here. Specifically, IBM’s XL compilers today.

I’m compiling things with the arguments -O4 -qstrict -qarch=ppc970 -Q -qcpluscmt. This takes some explanation. I’m using -O4 instead of -O5 because with the latter, the LAPACK libraries segfault. That’s right. Fortran code, with nary a pointer in sight, segfaults. How that happens is beyond me. The -qarch=ppc970 flag is because, without it, code segfaults. What does this mean? This means that the compiler can’t figure out what cpu it’s running on (which, hey, I’ll give them a pass on that one: I’m not running this compiler on a “supported” distribution of Linux) and is inserting not only bad code but bad pointer dereferences (HUH?!?).

When compiling LAPACK, you’re going to discover that the standard Fortran function ETIME, which LAPACK uses, doesn’t exist in XL-world. Instead, they decided it would be more useful to have an ETIME_ function. See the underscore? That was a beautiful addition, wasn’t it? I feel better already.

While compiling LAM with any sort of interesting optimization (the benefits of which are unclear in LAM’s case), you’re going to discover that XL’s -qipa flag (which is implicitly turned on by -O3 and above) can cause extremely long compile times for some files. How extreme? I’m talking over an hour on a 2Ghz PPC with several gigabytes of RAM. But don’t worry! Even though it looks like the compiler is stuck in an infinite loop, it’s really not, and will actually finish if you give it enough time. Or you could just compile LAM without optimization, it’s up to you.

Next nifty discovery: some genius at IBM decided that all inline functions MUST be static. They have to be, otherwise the world just comes apart at the seams. Nevermind the fact that the C standard defines the inline keyword as a hint to the compiler, and specifically forbids the compiler from changing the semantics of the language. What does this matter? A common, sensible thing a library can do is to define two init functions, like so:

inline int init_with_options(int foo, int bar, int baz)
{
        ...do stuff...
        return n;
}
int init(void)
{
        return init_with_options(0, 0, 0);
}

Now what do you suppose the author of such code intends? You guessed it! He wants to let the compiler know that dumping the contents of init_with_options() into init() is a fine thing to do. The author is not trying to tell the compiler “nobody will ever call init_with_options().” But that’s what the XL compilers think the author is saying. Better still, the documentation for XL explains that there’s a compiler option that may help: -qnostaticinline “Wow!” you say to yourself, that sounds promising! Nope. The option doesn’t seem to do a thing. You should have been clued in by the fact that the documentation says that that option is on by default. No, sorry, all inline functions are static, and there’s nothing you can do about it. If you didn’t want them static, you shouldn’t have given such hints to the compiler.

Here’s another good one: what does the user mean when he specifies the -M compiler flag? Well, let’s think about this. The documentation for that option says:

Creates an output file that contains information to be included in a “make” description file. This is equivalent to specifying -qmakedep without a suboption.

Now, what do you think -M really does? Oh, it does what it says, alright: creates a .d file. But it also doesn’t stop the compiler from actually attempting to COMPILE the file. So, now that you’ve got your dependencies built so that you know in what order to compile things, you tell make to have another go at building things. But what’s this? It’s already been compiled (incorrectly!)! Joy! My workaround is to run the compiler like so:

rm -f /tmp/foo
xlc -M -c -o /tmp/foo file.c

Now, when gcc and other compilers handle the -M flag, they spit out dependencies to stdout, rather than creating a file. Many complex Makefiles that you really don’t want to go mutzing with rely on that behavior. How do we get XL to do the same? Here’s one you wouldn’t have suspected: -MF/dev/stdout What’s the -MF flag supposed to do? Modify an existing Makefile, that’s what. See, isn’t that an excellent idea?

Speaking of excellent ideas, IBM decided that the C language needed some extensions. And I can’t begrudge them that; everybody does it. Among the extensions they added was a __align() directive, along the lines of sizeof(), that allows you to specify the alignment of variables that you create. You’d use it like so:

int __align(8) foo;

Unfortunately, in the standard pthread library headers, there are several structs defined that look like this:

struct something {
        void * __func;
        int __align;
}

You see the problem? Of course, there’s no way to tell XL to turn off the __align() extension. You would think that using -qlanglvl might do it, because it supposedly allows you to specify “strict K&R C conformance”. You’d be wrong. Your only option is to edit the headers and rename the variable.

Other ways in which XL tries to be “intelligent” but just ends up being idiotic is it’s handling of GCC’s __extension__ keyword. For example, in the pthreads headers, there is a function that looks like this:

int pthread_cancel_exists(void)
{
        static void * pointer =
        __extension__ (void *) pthread_cancel;
        return pointer != 0;
}

The reason GCC puts __extension__ there is because pthread_cancel may not exist, and it wants to have pointer be set to null in that case. Normally, however, if you attempt to point to a symbol that doesn’t exist, you’ll get a linking error. XL, of course, barfs when it sees this, but not in the way you think. XL attempts to be smart and recognize common uses of __extension__. Somehow, somewhere, the error you get will be:

found ), expected {

Really makes you wonder what the heck it thinks is going on there, doesn’t it? The solution? Remove “__extension__” and it works fine.

That’s all for now. I’m sure I’ll find more.

October 9, 2007

Concurrent Hash Table Tricks

So, I’m working on qthreads (which is open-sourced, but currently lacks a webpage), and thinking about its Unix implementation.

The Unix implementation emulates initialization-free synchronization (address locks and FEBs) by storing addresses in a hash table (well, okay, a striped hash table, but if we make the stripe 1, then it’s just a hash table). Let’s take the simplest: address locks. The semantics of the hash table at the moment are really pretty basic: if an address is in the hash, it’s locked. If it’s not in the hash, it’s not locked. The hash is the cp_hashtable from libcprops, a library which I appreciate greatly for giving C programmers easy access to high-quality basic data structures (I’ve contributed some significant code to it as well). Anyway, the downside of using this hash table is that it’s a bottleneck. The table is made thread-safe by simply wrapping it in a lock, and every operation (lock and unlock) requires locking the table to either insert an entry or remove an entry.

So how could we do this with a more concurrent hash table? I’ve seen two hash table APIs that are concurrent: the lock-free hash in Java that I talked about previously, and the concurrent_hash_map from Intel’s Thread Building Blocks library (which, given that it’s in C++, is something I can actually use).

The way the TBB hash works is that you can perform three basic operations on your hash: find(), insert(), and erase(). When you do either of the first two operations, you can lock that entry in the hash and prevent others from getting at it, or you can access it read-only. The erase function merely takes a key and removes it from the hash table, giving you no access to whatever might have been deleted from the hash table. Worse yet, you cannot erase something that you currently have a lock on, even if it’s a write lock!

Using this hash the way that I currently use the cprops hash is thus impossible. Why? Because erasing things from the TBB hash is ALWAYS a race condition. Put another way, all TBB hash erase operations are “blind erase” operations, when what you really want is “erase if it’s still in an erasable state”. You can never be certain that erasing an entry from the hash table is a good idea, because you can never be certain that someone else didn’t add something important to that entry in between the time that you decided the entry was erasable and the time you actually erased it. If I insert a value (to “lock” an address, say), I can associate that value with a queue of waiting threads (i.e. other threads that also want to lock that address), but I can never erase that entry in the hash table! The reason is that since I can’t erase something that I have access to (i.e. have a write-lock on), there’s a race condition between me fetching the contents of that hash table entry and me removing that entry from the hash table.

A different approach to this might be to simply never remove entries from the hash table, and to simply say that if the associated list of threads is empty (or NULL), then the lock is unlocked. That would work well, except for that tiny little problem of the hash table eternally growing and never reclaiming memory from unused entries. So, if I had an application that created lots of locks all over the place (i.e. inserted lots of different entries into the hash), but never had more than a handful locked (i.e. in the hash) at a time, I’d be wasting memory (and possibly, LOTS of it).

Is there another way to use such a hash table to implement locks more efficiently? I don’t know, but I don’t think so (I’d love to be proved wrong). Any way you slice it, you come back to the problem of deleting things that are in a deletable state, but not knowing if it’s safe to do so.

The Azul Java-only hash is an interesting hash that behaves differently. It is based upon compare-and-swap (CAS) atomic operations. Thus, for a given key, you can atomically read the contents of a value, but there’s no guarantee that that value isn’t changed the MOMENT you read it. Deleting an entry, in this case, means swapping a tombstone marker into place where the entry’s value is supposed to be, which you can avoid doing if that value changed before you got to the swap part (the C of the CAS). Thus, after you’ve extracted the last thread that’d locked that address (i.e. you’ve set the value to NULL) you can avoid marking a thing as “deleted” when it has really just been re-locked because if the value changed to non-NULL (and the compare part of the CAS fails), you can simply ignore the failure and assume that whoever changed it knew what they were doing. Thus, you CAN safely delete elements from the hash table. Better still, it easily integrates with (and may even require) a lock-free CAS-based linked list for queueing blocked threads. (You may be saying to yourself “um, dude, a hash table entry with a tombstone as a value is still taking up memory”, and I say to you: yeah? so? they get trimmed out of the hash table whenever the hash table is resized, thereby being an awesome idea.)

And, as I think about it, forcing users to do blind erases makes Intel TBB hash tables ALMOST unusable for an entire class of problems and/or algorithms. That category of algorithms is any algorithm that needs to delete entries that could potentially be added back at any time. They really ought to provide an equivalent of a CAS: let the user say “delete this hash entry if the value is equal to this”.

I say “ALMOST unusable” because it’s fixable. Consider the ramifications of having control over the comparison and equivalence functions: a key can be associated with a “deletable” flag that provides much of the needed functionality. With such a flag, the result of any find() operation can be considered invalid not only if it returns false but also if the deletable flag associated with the result’s key is true. Essentially, finding something in the hash becomes:

while (hash.find(result, &findme) && result->first->deletable) {
    result->release();
}

It’s an extra layer of indirection, and can cause something to spin once or twice, but it works. Your comparison struct functions must then be something like this:

typedef struct evilptrabstraction {
    bool deletable;
    void * key;
} epa_s;

typedef epa_s * epa;

struct EPAHashCompare {
    static size_t hash(const epa &x) {
        return (size_t)x->key; // or a more complex hash
    }
    static bool equal (const epa &x, const epa &y) {
        if (x->deletable && y->deletable) return true;
        if (x->deletable || y->deletable) return false;
        return x->key == y->key;
    }
};

Note that anything marked deletable is equivalent, but doesn’t match anything non-deletable. Thus, safely deleting something becomes the following (assuming findme is a version of the epa struct not marked deletable):

accessor *result = new accessor();

bool found = hash.find(*result, &findme);
while (found && (*result)->first->deletable)  {
    (*result)->release();
    found = hash.find(*result, &findme);
}

if (found) {
    (*result)->first->deletable = true;
    delete result; // release the lock
    findme.deletable = true;
    hash.erase(&findme);
} else {
    delete result;
}

This opens the question of inserting safely, though, because during the insertion process, your inserted object might have already existed, and if it already existed, it may have been in the process of being deleted (i.e. it might have been marked as deleted). There’s the potential that your “freshly-inserted” item got marked deletable if it was already in the hash. So how do you insert safely?

bool inserted = hash.insert(result, insertme);
// !inserted means insertme was already in the hash
while (!inserted && result->first->deletable) {
    result.release();
    inserted = hash.insert(result, insertme);
}
if (!inserted) delete insertme;

Note that we can’t simply toggle the deletable mark, because an erase() operation may already be waiting for the hash value, and it doesn’t expect that the key for the item may have changed while it was waiting for the item to be locked (so changing the deletable flag won’t stop it from being erased). The downside, of course, is that popularly erased/re-inserted items may cause a fair bit of memory churn, but that’s unavoidable with the TBB’s bare-bones erase() interface.

January 11, 2008

Apple's Compiler Idiocy

This is something that’s been bugging me for a while here, and I might as well write it down since I finally found a solution.

I have an atomic-increment function. To make it actually atomic, it uses assembly. Here’s the PPC version:

static inline int atomic_inc(int * operand)
{
    int retval;
    register unsigned int incrd = incrd; // silence initialization complaints
    asm volatile ("1:\n\t"
                  "lwarx  %0,0,%1\n\t" /* reserve operand into retval */
                  "addi   %2,%0,1\n\t" /* increment */
                  "stwcx. %2,0,%1\n\t" /* un-reserve operand */
                  "bne-   1b\n\t" /* if it failed, try again */
                  "isync" /* make sure it wasn't all just a dream */
                  :"=&r" (retval)
                  :"r" (operand), "r" (incrd)
                  :"cc","memory");
    return retval;
}

Now, what exactly is wrong with that, eh? This works great on Linux. The general GCC compiles this just fine, as does the PGI compiler, IBM’s compiler, and Intel’s compiler.

Apple’s compiler? Here’s the error I get:

gcc -c test.c
/var/tmp/ccqu2RmV.s:5949:Parameter error: r0 not allowed for parameter 2 (code as 0 not r0)

Okay, so, some kind of monkey business is going on. What does this look like in the .S file?

1:
    lwarx r0,0,r2
    addi   r3,r0,1
    stwcx. r3,0,r2
    bne-   1b
    isync
    mr r3,r0

It decided (retval) was going to be r0! Even though that’s apparently not allowed! (FYI it’s the addi that generates the error).

The correct workaround is to use the barely documented “b” option, like this:

static inline int atomic_inc(int * operand)
{
    int retval;
    register unsigned int incrd = incrd; // silence initialization complaints
    asm volatile ("1:\n\t"
                  "lwarx  %0,0,%1\n\t" /* reserve operand into retval */
                  "addi   %2,%0,1\n\t" /* increment */
                  "stwcx. %2,0,%1\n\t" /* un-reserve operand */
                  "bne-   1b\n\t" /* if it failed, try again */
                  "isync" /* make sure it wasn't all just a dream */
                  :"=&b" (retval) /* note the b instead of the r */
                  :"r" (operand), "r" (incrd)
                  :"cc","memory");
    return retval;
}

That ensures, on PPC machines, that the value is a “base” register (aka not r0).

How gcc on Linux gets it right all the time, I have no idea. But it does.

August 18, 2008

More Compiler Complaints: Sparc Edition

Unlike my previous whining about compilers, this one I have no explanation for. It’s not me specifying things incorrectly, it’s just the compiler being broken.

So, here’s the goal: atomically increment a variable. On a Sparc (specifically, SparcV9), the function looks something like this:

static inline int atomic_inc(int * operand)
{
    register uint32_t oldval, newval;
    newval = *operand;
    do {
        oldval = newval;
        newval++;
        __asm__ __volatile__ ("cas [%1], %2, %0"
            : "=&r" (newval)
            : "r" (operand), "r"(oldval)
            : "cc", "memory");
    } while (oldval != newval);
    return oldval+1;
}

Seems trivial, right? We use the CAS instruction (compare and swap). Conveniently, whenever the comparison fails, it stores the value of *operand in the second register (i.e. %0 aka newval), so there are no extraneous memory operations in this little loop. Right? Right. Does it work? NO.

Let’s take a look at the assembly that the compiler (gcc) generates with -O2 optimization:

save    %sp, -0x60, %sp
ld      [%i0], %i5      /* newval = *operand; */
mov     %i0, %o1        /* operand is copied into %o1 */
mov     %i5, %o2        /* oldval = newval; */
cas     [%o1], %o2, %o0
ret
restore %i5, 0x1, %o0

Say what? Does that have ANYTHING to do with what I told it? Nope! Of course, gcc is awful, you say! Use SUN’s compiler! Sorry, it’s the EXACT SAME output.

But let’s be a bit more explicit about the fact that the newval register is an input to the assembly block:

static inline int atomic_inc(int * operand)
{
    register uint32_t oldval, newval;
    newval = *operand;
    do {
        oldval = newval;
        newval++;
        __asm__ __volatile__ ("cas [%1], %2, %0"
            : "=&r" (newval)
            : "r" (operand), "r"(oldval), "0"(newval)
            : "cc", "memory");
    } while (oldval != newval);
    return oldval+1;
}

Now, Sun’s compiler complains: warning: parameter in inline asm statement unused: %3. But at least, gcc leaves the add operation in:

save    %sp, -0x60, %sp
ld      [%i0], %i5      /* oldval = *operand; */
mov     %i0, %o1        /* operand is copied to %o1 */
add     %i5, 0x1, %o0   /* newval = oldval + 1; */
mov     %i5, %o2        /* oldval is copied to %o2 */
cas     [%o1], %o2, %o0
ret
restore %i5, 0x1, %o0

Still, though, the do{ }while() loop was optimized away, because the compiler refuses to acknowledge that newval can change values! Sun’s compiler will indeed leave the while loop in, but will often use the WRONG REGISTER for comparison (such as %i2 instead of %o0).

But check out this minor change:

static inline int atomic_inc(int * operand)
{
    register uint32_t oldval, newval;
    do {
        newval = *operand;
        oldval = newval;
        newval++;
        __asm__ __volatile__ ("cas [%1], %2, %0"
            : "=&r" (newval)
            : "r" (operand), "r"(oldval), "0"(newval)
            : "cc", "memory");
    } while (oldval != newval);
    return oldval+1;
}

Note that, rather than using the output of the cas instruction, we’re throwing it away and re-reading operand no matter what. And what happens:

save     %sp, -0x60, %sp
ld       [%i0], %i5           /* oldval = *operand; */
add      %i5, 0x1, %o0        /* newval = oldval + 1; */
mov      %i0, %o1             /* operand is copied to %o1 */
mov      %i5, %o2             /* oldval is copied to %o2 */
cas      [%o1], %o2, %o0
cmp      %i5, %o0             /* if (oldval != newval) */
bne,a,pt %icc, atomic_inc+0x8 /* then go back and try again */
ld       [%i0], %i5
ret
restore  %i5, 0x1, %o0

AHA! The while loop returns! And best of all, both GCC and Sun’s compiler suddenly, magically, use the right registers for the loop comparison! It’s amazing!

And yet, also completely idiotic. So, we can get it to work… but we have to be inefficient in order to do it, because otherwise (inexplicably) the compiler refuses to acknowledge that our output register can change.

In case you’re curious, the gcc version is:
sparc-sun-solaris2.10-gcc (GCC) 4.0.4 (gccfss)
and the Sun compiler is:
cc: Sun C 5.9 SunOS_sparc 2007/05/03

About Research

This page contains an archive of all entries posted to Kyle in the Research category. They are listed from oldest to newest.

Religion is the previous category.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type 3.34