Kyle: Research Archives

Research Idea

In the Lord of the Rings, the body of the actor was used to map and control the body of Gollum in a realistic fashion. The same was not done with his face, but it could be - although perhaps not to the same amount of detail.

Now imagine a simple line drawing of a face. How detailed would a camera's model of a human face need to be in order to extract just that much information from the face - the size of the smile, dimples, the angle and placement of the brows and perhaps lines on the forehead. Could this be extracted using a simple webcam?

How cool would it be to have this cariacature of a face (simple smiley line drawing) represent the other person in IM conversations, as a two way conversation? How much data would need to be sent in order to get decent responsiveness?

How well could you map this line drawing to a 3D face, that merely morphs along the same control lines as this line-drawing face?

How possible is this?

Posted by Kyle Wheeler on May 2, 2003 8:42 AM | Permalink | Comments (0)

Academic Writing...

Academic writing is obtuse, yes. But it can also be fun! see?

Posted by Kyle Wheeler on February 11, 2004 12:04 PM | Permalink | Comments (0)

Misprediction

Why is misprediction not a word?

Posted by Kyle Wheeler on March 14, 2005 10:29 PM | Permalink | Comments (0) | TrackBacks (0)

Why Are Compilers So Idiotic?

It astonishes me how idiotic some compilers can be.

I’ve been working with some unusual compilers recently: the PGI compilers from the Portland Group, and the XL compilers from IBM. I’ve been attempting to get them to compile some standard libraries for testing parallel operations. For example, I want them both to compile LAM-MPI, and BLAS, and LAPACK.

In fighting with them to get them to function, I’ve discovered all sorts of quirks. To vent my frustrations, I am documenting them here. Specifically, IBM’s XL compilers today.

I’m compiling things with the arguments -O4 -qstrict -qarch=ppc970 -Q -qcpluscmt. This takes some explanation. I’m using -O4 instead of -O5 because with the latter, the LAPACK libraries segfault. That’s right. Fortran code, with nary a pointer in sight, segfaults. How that happens is beyond me. The -qarch=ppc970 flag is because, without it, code segfaults. What does this mean? This means that the compiler can’t figure out what cpu it’s running on (which, hey, I’ll give them a pass on that one: I’m not running this compiler on a “supported” distribution of Linux) and is inserting not only bad code but bad pointer dereferences (HUH?!?).

When compiling LAPACK, you’re going to discover that the standard Fortran function ETIME, which LAPACK uses, doesn’t exist in XL-world. Instead, they decided it would be more useful to have an ETIME_ function. See the underscore? That was a beautiful addition, wasn’t it? I feel better already.

While compiling LAM with any sort of interesting optimization (the benefits of which are unclear in LAM’s case), you’re going to discover that XL’s -qipa flag (which is implicitly turned on by -O3 and above) can cause extremely long compile times for some files. How extreme? I’m talking over an hour on a 2Ghz PPC with several gigabytes of RAM. But don’t worry! Even though it looks like the compiler is stuck in an infinite loop, it’s really not, and will actually finish if you give it enough time. Or you could just compile LAM without optimization, it’s up to you.

Next nifty discovery: some genius at IBM decided that all inline functions MUST be static. They have to be, otherwise the world just comes apart at the seams. Nevermind the fact that the C standard defines the inline keyword as a hint to the compiler, and specifically forbids the compiler from changing the semantics of the language. What does this matter? A common, sensible thing a library can do is to define two init functions, like so:

inline int init_with_options(int foo, int bar, int baz)
{
        ...do stuff...
        return n;
}

int init(void)
{
        return init_with_options(0, 0, 0);
}

Now what do you suppose the author of such code intends? You guessed it! He wants to let the compiler know that dumping the contents of init_with_options() into init() is a fine thing to do. The author is not trying to tell the compiler “nobody will ever call init_with_options().” But that’s what the XL compilers think the author is saying. Better still, the documentation for XL explains that there’s a compiler option that may help: -qnostaticinline “Wow!” you say to yourself, that sounds promising! Nope. The option doesn’t seem to do a thing. You should have been clued in by the fact that the documentation says that that option is on by default. No, sorry, all inline functions are static, and there’s nothing you can do about it. If you didn’t want them static, you shouldn’t have given such hints to the compiler.

Here’s another good one: what does the user mean when he specifies the -M compiler flag? Well, let’s think about this. The documentation for that option says:

Creates an output file that contains information to be included in a “make” description file. This is equivalent to specifying -qmakedep without a suboption.

Now, what do you think -M really does? Oh, it does what it says, alright: creates a .d file. But it also doesn’t stop the compiler from actually attempting to COMPILE the file. So, now that you’ve got your dependencies built so that you know in what order to compile things, you tell make to have another go at building things. But what’s this? It’s already been compiled (incorrectly!)! Joy! My workaround is to run the compiler like so:

rm -f /tmp/foo
xlc -M -c -o /tmp/foo file.c

Now, when gcc and other compilers handle the -M flag, they spit out dependencies to stdout, rather than creating a file. Many complex Makefiles that you really don’t want to go mutzing with rely on that behavior. How do we get XL to do the same? Here’s one you wouldn’t have suspected: -MF/dev/stdout What’s the -MF flag supposed to do? Modify an existing Makefile, that’s what. See, isn’t that an excellent idea?

Speaking of excellent ideas, IBM decided that the C language needed some extensions. And I can’t begrudge them that; everybody does it. Among the extensions they added was a __align() directive, along the lines of sizeof(), that allows you to specify the alignment of variables that you create. You’d use it like so:

int __align(8) foo;

Unfortunately, in the standard pthread library headers, there are several structs defined that look like this:

struct something {
        void * __func;
        int __align;
}

You see the problem? Of course, there’s no way to tell XL to turn off the __align() extension. You would think that using -qlanglvl might do it, because it supposedly allows you to specify “strict K&R C conformance”. You’d be wrong. Your only option is to edit the headers and rename the variable.

Other ways in which XL tries to be “intelligent” but just ends up being idiotic is it’s handling of GCC’s __extension__ keyword. For example, in the pthreads headers, there is a function that looks like this:

int pthread_cancel_exists(void)
{
        static void * pointer =
        __extension__ (void *) pthread_cancel;
        return pointer != 0;
}

The reason GCC puts __extension__ there is because pthread_cancel may not exist, and it wants to have pointer be set to null in that case. Normally, however, if you attempt to point to a symbol that doesn’t exist, you’ll get a linking error. XL, of course, barfs when it sees this, but not in the way you think. XL attempts to be smart and recognize common uses of __extension__. Somehow, somewhere, the error you get will be:

found ), expected {

Really makes you wonder what the heck it thinks is going on there, doesn’t it? The solution? Remove “__extension__” and it works fine.

That’s all for now. I’m sure I’ll find more.

Posted by Kyle Wheeler on September 7, 2006 9:40 AM | Permalink | Comments (0) | TrackBacks (0)

Concurrent Hash Table Tricks

So, I’m working on qthreads (which is open-sourced, but currently lacks a webpage), and thinking about its Unix implementation.

The Unix implementation emulates initialization-free synchronization (address locks and FEBs) by storing addresses in a hash table (well, okay, a striped hash table, but if we make the stripe 1, then it’s just a hash table). Let’s take the simplest: address locks. The semantics of the hash table at the moment are really pretty basic: if an address is in the hash, it’s locked. If it’s not in the hash, it’s not locked. The hash is the cp_hashtable from libcprops, a library which I appreciate greatly for giving C programmers easy access to high-quality basic data structures (I’ve contributed some significant code to it as well). Anyway, the downside of using this hash table is that it’s a bottleneck. The table is made thread-safe by simply wrapping it in a lock, and every operation (lock and unlock) requires locking the table to either insert an entry or remove an entry.

So how could we do this with a more concurrent hash table? I’ve seen two hash table APIs that are concurrent: the lock-free hash in Java that I talked about previously, and the concurrent_hash_map from Intel’s Thread Building Blocks library (which, given that it’s in C++, is something I can actually use).

The way the TBB hash works is that you can perform three basic operations on your hash: find(), insert(), and erase(). When you do either of the first two operations, you can lock that entry in the hash and prevent others from getting at it, or you can access it read-only. The erase function merely takes a key and removes it from the hash table, giving you no access to whatever might have been deleted from the hash table. Worse yet, you cannot erase something that you currently have a lock on, even if it’s a write lock!

Using this hash the way that I currently use the cprops hash is thus impossible. Why? Because erasing things from the TBB hash is ALWAYS a race condition. Put another way, all TBB hash erase operations are “blind erase” operations, when what you really want is “erase if it’s still in an erasable state”. You can never be certain that erasing an entry from the hash table is a good idea, because you can never be certain that someone else didn’t add something important to that entry in between the time that you decided the entry was erasable and the time you actually erased it. If I insert a value (to “lock” an address, say), I can associate that value with a queue of waiting threads (i.e. other threads that also want to lock that address), but I can never erase that entry in the hash table! The reason is that since I can’t erase something that I have access to (i.e. have a write-lock on), there’s a race condition between me fetching the contents of that hash table entry and me removing that entry from the hash table.

A different approach to this might be to simply never remove entries from the hash table, and to simply say that if the associated list of threads is empty (or NULL), then the lock is unlocked. That would work well, except for that tiny little problem of the hash table eternally growing and never reclaiming memory from unused entries. So, if I had an application that created lots of locks all over the place (i.e. inserted lots of different entries into the hash), but never had more than a handful locked (i.e. in the hash) at a time, I’d be wasting memory (and possibly, LOTS of it).

Is there another way to use such a hash table to implement locks more efficiently? I don’t know, but I don’t think so (I’d love to be proved wrong). Any way you slice it, you come back to the problem of deleting things that are in a deletable state, but not knowing if it’s safe to do so.

The Azul Java-only hash is an interesting hash that behaves differently. It is based upon compare-and-swap (CAS) atomic operations. Thus, for a given key, you can atomically read the contents of a value, but there’s no guarantee that that value isn’t changed the MOMENT you read it. Deleting an entry, in this case, means swapping a tombstone marker into place where the entry’s value is supposed to be, which you can avoid doing if that value changed before you got to the swap part (the C of the CAS). Thus, after you’ve extracted the last thread that’d locked that address (i.e. you’ve set the value to NULL) you can avoid marking a thing as “deleted” when it has really just been re-locked because if the value changed to non-NULL (and the compare part of the CAS fails), you can simply ignore the failure and assume that whoever changed it knew what they were doing. Thus, you CAN safely delete elements from the hash table. Better still, it easily integrates with (and may even require) a lock-free CAS-based linked list for queueing blocked threads. (You may be saying to yourself “um, dude, a hash table entry with a tombstone as a value is still taking up memory”, and I say to you: yeah? so? they get trimmed out of the hash table whenever the hash table is resized, thereby being an awesome idea.)

And, as I think about it, forcing users to do blind erases makes Intel TBB hash tables ALMOST unusable for an entire class of problems and/or algorithms. That category of algorithms is any algorithm that needs to delete entries that could potentially be added back at any time. They really ought to provide an equivalent of a CAS: let the user say “delete this hash entry if the value is equal to this”.

I say “ALMOST unusable” because it’s fixable. Consider the ramifications of having control over the comparison and equivalence functions: a key can be associated with a “deletable” flag that provides much of the needed functionality. With such a flag, the result of any find() operation can be considered invalid not only if it returns false but also if the deletable flag associated with the result’s key is true. Essentially, finding something in the hash becomes:

while (hash.find(result, &findme) && result->first->deletable) {
    result->release();
}

It’s an extra layer of indirection, and can cause something to spin once or twice, but it works. Your comparison struct functions must then be something like this:

typedef struct evilptrabstraction {
    bool deletable;
    void * key;
} epa_s;

typedef epa_s * epa;

struct EPAHashCompare {
    static size_t hash(const epa &x) {
        return (size_t)x->key; // or a more complex hash
    }
    static bool equal (const epa &x, const epa &y) {
        if (x->deletable && y->deletable) return true;
        if (x->deletable || y->deletable) return false;
        return x->key == y->key;
    }
};

Note that anything marked deletable is equivalent, but doesn’t match anything non-deletable. Thus, safely deleting something becomes the following (assuming findme is a version of the epa struct not marked deletable):

accessor *result = new accessor();

bool found = hash.find(*result, &findme);
while (found && (*result)->first->deletable)  {
    (*result)->release();
    found = hash.find(*result, &findme);
}

if (found) {
    (*result)->first->deletable = true;
    delete result; // release the lock
    findme.deletable = true;
    hash.erase(&findme);
} else {
    delete result;
}

This opens the question of inserting safely, though, because during the insertion process, your inserted object might have already existed, and if it already existed, it may have been in the process of being deleted (i.e. it might have been marked as deleted). There’s the potential that your “freshly-inserted” item got marked deletable if it was already in the hash. So how do you insert safely?

bool inserted = hash.insert(result, insertme);
// !inserted means insertme was already in the hash
while (!inserted && result->first->deletable) {
    result.release();
    inserted = hash.insert(result, insertme);
}
if (!inserted) delete insertme;

Note that we can’t simply toggle the deletable mark, because an erase() operation may already be waiting for the hash value, and it doesn’t expect that the key for the item may have changed while it was waiting for the item to be locked (so changing the deletable flag won’t stop it from being erased). The downside, of course, is that popularly erased/re-inserted items may cause a fair bit of memory churn, but that’s unavoidable with the TBB’s bare-bones erase() interface.

Posted by Kyle Wheeler on October 9, 2007 2:50 PM | Permalink | Comments (3) | TrackBacks (0)

A C Lock-Free Hash Table Implementation

Well, I finally have one: a lock-free hash table implementation in C. The hash table I implemented is based on the work by Ori Shalev and Nir Shavit. Much of the code is similar to the example code they published in their paper Split-Ordered Lists: Lock-Free Extensible Hash Tables, back in 2006, but has been modified to support additional conveniences so it has a library-esque interface.

There is one problem unique to this algorithm: It doesn’t compact well if you go from lots and lots of entries back down to just a few entries. Not that it compacts horribly, but it doesn’t shrink all the way back down; it keeps some meta-data around (what the algorithm calls “dummy keys”). On the other hand, it does quite well if you tend to have a rapidly rotating set of keys (even with random-ish key values) but with a ceiling on the number of them in the hash table at any one time: it uses a static set of dummy keys in that case. In an effort to reduce the impact of the compaction problem, I didn’t implement the dynamic-array extension, so this implementation is performance-limited to around 2048 concurrent keys or so (though you can insert as many as you like, more than that will begin to get slow).

Of course, there are three problems with this algorithm (and implementation) that are pretty typical:

It has the ABA problem. I think this can be solved (or at least mitigated) with generation counters, so I don’t see it as a big issue.
It has the blind-delete problem. This is a really tough one. I think this can be solved, but my current best thinking on the matter ends up devolving to a highly-contended CAS operation, which is obviously sub-optimal.
For simplicity, I didn’t use memory pools for the nodes, so I end up doing malloc/free calls on insertion and delete. This isn’t a correctness problem, but can become a performance bottleneck.

Anyway, without further ado, here’s the code. You can download it and compile it yourself; it is self-contained C99 code with a trivial (single-threaded) driver in main().

I’ve talked about concurrent hash tables before, here and here, and those discussions may be interesting to anyone who found this code interesting and/or useful. Having re-read Chris Purcells’ paper, I should point out that while he does have some very worthwhile code, his algorithm also has the compaction problem. He does, however, have a good summary of several designs, and of the drawbacks of each (primarily in the area of a need for garbage collection).

Posted by Kyle Wheeler on June 2, 2011 11:26 PM | Permalink | Comments (0) | TrackBacks (0)

Kyle

High catecholamine levels

Research Archives

May 2, 2003

Research Idea

February 11, 2004

Academic Writing...

March 14, 2005

Misprediction

September 7, 2006

Why Are Compilers So Idiotic?

October 9, 2007

Concurrent Hash Table Tricks

June 2, 2011

A C Lock-Free Hash Table Implementation

Search

About Research