Database Back-end for Large Email Systems

The (mostly) bad idea that just won't die

Contents

Introduction

There has been a recent discussion on the qmail mailing list regarding using a relational database (à la MySQL or Postgres or Oracle) as the back end storage for email. This is an idea that comes up many times, in many contexts, and I am writing this to hopefully educate those who have come up with the same idea again. Thanks to Charles Cazabon and the rest of the qmail mailing list for the ideas contained here.

The primary motivation typically behind desires to store email in a database tends to be a desire to search through email quickly. This is an understandable desire, and the urge to go to a database is obvious: databases are built for searching (aka. retrieving data with desired characteristics). Delivering mail to a database has been done many times by many people. One example is the DBMail project. Another example is the Microsoft Exchange server, which uses a relational database to store everything.

Let's be clear about something: a filesystem is a type of database. In the realm of information theory, any collection of data is a database. The difference between a filesystem and a generic database is the complexity of the organization schemes. The complexity becomes important for two reasons: complex systems are more difficult to map to disk operations, and require more computational overhead. The latter is becoming less important, as overpowered CPUs take advantage of the relative glacial speed of disks, however, disk I/O can be clustered and cached, while CPU operations generally cannot (this affects the scalability of both). In any case, when translating from one data organization scheme to another, there is overhead. Think of it as encoding: when transforming a binary file into a BinHex'd string of text, there is spacial and computational overhead, even though both are functionally identical and contain the same information.

However, this is not to suggest that it is impossible to improve on a filesystem for mail storage; highly customized, optimized storage mechanisms (possibly that incorporate features from traditional relational databases) could be extremely good at mail storage. However, traditional relational databases are not the way to go for scalable email storage. For more information about the utility of flat files, read this O'Reilly Radar from May 2006.

In any case, I am not trying to argue about the general information-theoretic concept of a database. For the purposes of this document, when I refer to a "database", I am referring to a traditional relational database. I have no doubt that a database (in the information-theoretic sense) can be designed and implemented that is more efficient for storing email than current filesystems. However, filesystems are more capable than most people give them credit for, and unless you have very particular, very unusual, very specific requirements that justify using a database and which overcome the drawbacks inherent in using a database (or are willing to design a custom database implementation). Storing email in a relational database rather than a file-system is usually a bad idea for many reasons, which tend to fall into two categories: Databases Are Bad for Email, and Email is Bad for Databases.

Databases Are Bad for Email

In general, a good email system must have certain specific characteristics:

  1. Reading email must be fast. (goto)
  2. Many users must be able to access their mail at the same time, quickly. (goto)
  3. Backup must be easy. (goto)
  4. Corruption must be easy to recover from. (goto)

Let's take them one at a time:

1. Reading email must be fast.

In the abstract, your server has an email message that you want to retrieve. With email fetching protocols like POP3 and IMAP, messages are identified by a unique ID (UID), and so can be referenced and their contents requested this way.

When a user requests an email message's contents, that email message may be in one of two places: loaded into memory on the server, or on disk on the server. For the moment, let's presume that the message is on disk, and hasn't been recently loaded into memory for whatever reason.

In a simple server using a file-system using (for example) Maildir format for storing messages, the way reading an email works is: I send a request for a message to (for example) my POP3 server, which trivially constructs the path of the file I want (for example, based on the user's home directory and the UID requested), and asks the OS for it. The OS fetches the message (say, mmap()'s it into the server's process space), and returns, at which point the POP3 server sends it to me (which transfers the data back into the OS to talk to the network). That architecture requires the message to be copied at least three times, probably four: once from disk into kernel memory, once from kernel memory into the server's memory, back from the server's memory into the kernel's memory, and from the kernel's memory into the network card's memory. The speed of this operation can be significantly improved if the server and OS both support something more direct, such as the sendfile() interface. If they can, then this operation is much simpler and much faster: the POP3 server tells the OS to send the file containing the email to user, and the OS returns to the POP3 server when it has been done; a process which requires one copy, from the disk to the network interface.

This operation (reading an email) is more difficult if the message is stored in even a simple database (such as an sqlite file). In such a case the POP3 server is forced to do several disk accesses (going in and out of the kernel several times) to find out first where the message is stored in the file by reading the database meta data, and then fetching the message out of the file into a local buffer (reassembling it from the extents of the internal database format), and then sending that buffer out to the client. Reading this meta data is faster if the entire database has been mmap()'d into the server's memory space, and generally this meta data is both conceptually similar and performance-wise similar to the meta data on the filesystem that must be read to locate the file. However, the larger the database's meta data is, the more memory intensive this operation becomes, and thus the slower it becomes. The key detail is that unless the database can store the email contiguously in its internal format (in which case the sendfile()-like interfaces may be used), the message must first be extracted from the database (at least one copy). Sending a message from a buffer requires another, and possibly two, copies: from user space into kernel space, and from kernel space to the network interface.

In essence, because the OS knows the format of the filesystem, it can "cheat" and avoid copies. Unless the database can guarantee that the message is stored contiguously in it's database file, it cannot pull the same trick. This is not impossible, but requires a lot of custom work.

2. Many users must be able to access their mail at the same time, quickly.

The last word of this requirement, "quickly," is the difficult part, of course. Generally, which solution is best depends on what kind of concurrency you're going to need to handle. The solution is different if your user base is 2,000 people checking their email occasionally versus 50,000 people whose mail programs check their email every five minutes.

It's possible to use a poor OS/file-system combination with this, though this sort of thing gets down to very basic sysadmin training. For example, many OS's (e.g. Linux before 2.6) use a single lock for each file-system. This makes dealing with concurrent access slower (and harder) than it needs to be. Choosing a good OS/file-system implementation can improve the available concurrency quite a bit. For example, Linux 2.6 with an XFS partition is capable of far higher concurrency than Linux 2.4 with EXT2 partition. Storing your email on a RAID system can also help improve the available concurrency, depending on what variety of RAID you use (RAID 1 is better than RAID 0, for example, because it allows multiple requests to use different disks for their accesses.

With databases, the questions you need to ask are very similar, and just as file-system concurrency depends on the file-system, database concurrency depends on the database. However, the documentation on these details is often harder to come by. The important questions are all the same for databases, just phrased slightly differently: what's the format of the back-end, what's the front-end like, and what kind of concurrency does it support? What are the performance characteristics of the database? And, of particular importance for databases, is it optimized for insertion, deletion, or for selection? A big database, like Oracle, is going to support more concurrent accesses than something like sqlite, but is likely to require heavy-duty hardware in order to do it. Databases also tend to be optimized for input and output of small chunks of data (less than 1k or so) than a standard email (4k-20k), and the performance (and on-disk format) when storing a typical email should be considered.

3. Backup must be easy.

Backing up a file-system is a very well studied procedure, and is basically trivial. There's the basics of backup (tar, rsync, etc.), and then the more advanced solutions (amanda, etc.).

Backing up a database is significantly more of a pain, especially if you want to be able to do it without taking the database off-line temporarily or want to be able to restore from backup very easily and very selectively. The reason is, of course, the database back-end file is not guaranteed to be in a consistent state at any time while the database is running: the database front-end caches LOTS of things that it hasn't necessarily written back to disk. The back-end file is, in busy databases, frequently in an inconsistent state. The ACID standard for database transactions is designed to remedy this problem in databases that support it, but compliance imposes a performance penalty (not to mention is very difficult to do correctly), which is why the standard's requirements are often ignored or relaxed. (For example, MySQL's InnoDB tables are not fully ACID.)

The easiest safe way to do a backup of, for example, MySQL is to use the "mysql-dump" utility, which outputs the database as a text file full of all of the SQL statements necessary to regenerate the database from scratch. This is not exactly convenient if you want to do something like recover a single message from the backup. Some databases give you things like replication and such, and have complex solutions to the backup problem that function much better than mysql-dump. These solutions are, without exception, extremely complicated and tend to require significant hardware investments (they tend to be part of "high availability" packages).

4. Corruption must be easy to recover from.

There are times when unexpected events, or software bugs, cause data corruption of some kind. A typical example is a power outage or surge, causing your system to shut down suddenly. This sort of thing causes problems primarily because of useful speed enhancements like write-caching. In a file-system, both the meta-data (i.e. information about the files stored) and the files themselves may become out of sync, in which case the corruption may be unrecoverable. Fortunately, this is a well-studied problem for file-systems, and there are several common and easy solutions. For example, journaling file-systems (available for all good server operating systems) ensure that the meta-data is never in an unrecoverable inconsistent state. Additionally, in an email system based on the Maildir storage format, messages are not considered "delivered" until they are fully flushed to disk, making them essentially crash-proof. With a well-designed mail server (such as qmail), even the mail queue is protected against unexpected crashes, which covers all of the places where mail may be lost unexpectedly.

Even if, say, you were using something like an mh back-end or a simple database like an mbox back-end (assuming you're using a journaled file-system), the worst that can happen is that messages in the middle of delivery (in the mh case) or an entire mailbox (in the mbox case) can be corrupted. The rest of the system is isolated from corruption. Additionally, on most OSs, disk caches get flushed every couple of minutes, so the potential corruption is limited even further.

If power goes out on a database, however, you are at significantly more risk. More so than the OS, databases cache many many things, all of which vanish when the power goes out. For the same reason backups are hard, this is essentially instant corruption of your database, which means you're going to have to rely heavily on a high-quality, frequent, backup system for any sort of reliability through unexpected events (power outages, kernel panics, what have you). Many complex databases (e.g. PostgreSQL and Oracle) do journaling to help with this problem. The only other way to address unpredictable events (which is a losing game) is to go to extreme extents like putting your system on battery backups, in an underground waterproof chamber somewhere far from a seismic fault.

File-systems are also more convenient than databases for dealing with problems. Namely, there are many filtering, parsing, searching, and editing utilities for email files that have been available and refined for decades. Storing mail in a database means that you cannot rely on anyone else's solution for pretty much anything tricky. DBMail, for example, gives you utilities for putting mail into the database and for getting mail back out, but doing something like correcting messages that got mangled by a poorly implemented filter, or searching for a nonstandard header (to name a few things off the top of my head), is something you'd have to write and debug your own tools for.

Email is Bad for Databases

While databases are not good for storing email (or at least, aren't as good as file-systems at doing so), email is not particularly good for storing in databases, for several reasons.

Perhaps the most understandable of these reasons is that email is not relational data. While email does have some structured meta-data (like sender, receiver, subject, date, and so forth), it's general structure is as a blob of unstructured text. Standard files in file-systems have a similar amount of meta-data (name, access times, modify times, owner, etc.), and are similarly unsuited for relational databases.

The content of email is essentially unstructured, as I said. This sort of storage is not what relational databases are designed to handle efficiently. The BLOB datatype in relational databases is intended for very limited use, such as encoding a small icon for a data type. While it can be used as the only thing that is stored in a database, databases are not structured for storing generic BLOBs as efficiently as filesystems.

Typical relational databases are designed for storage of large collections of relations between relatively small blocks of data (for example, associating client names, addresses, account number, purchase history, etc.) A typical size for a frequently used database is in the several hundred megabytes. I worked on a system using a database for tracking ticket sales for moderately busy theaters, and the database took several months to get above a hundred megabytes. The volume of email, on the other hand, is typically measured in gigabytes; a medium-sized mail server I administer currently requires approximately an additional gigabyte every month or so to store email. Even a modestly active user can easily accumulate several hundred megabytes of mail, and the storage requirements of email continue to increase with evolving mail uses (such as generic file transfer and Comcast Video Mail). While there are databases that can handle even multiple terabytes of data (such as Oracle), such databases require serious hardware, complex setup, and significant amounts of administration to keep running at their peak.

But What About Searching Speed?

When trying to solve a problem like "searching speed," you should not merely cast about and grab any random buzzword (like "database") and throw out your existing solution in favor of this buzzword. Instead, first carefully examine your current system and determine why it is being slow. What is the common operation that is slow (and no, I don't mean "searches", I mean things like "searching for the word X in the body of all messages" versus "searching for all mails from user X in folder Y" or "searching for all unread mails"). What is the bottleneck? What is the limiting factor?

If your method of accessing the email is a POP3 server, your options are limited: by the POP3 protocol, you can only fetch messages, and then search them on your own time. Storing those messages, on the server, in a database doesn't help with that. Things like "searching speed" typically only apply to mail servers when the mail is stored on the server and you wish to have the server do the work of searching through them rather than the client; as with an IMAP server. Offloading client searching to the IMAP server does, however, make the IMAP server a bottleneck, and requires beefier server hardware than would be necessary otherwise.

It is important to point out that most IMAP mail clients treat mail similarly to POP3 mail: it is downloaded, cached, and searched locally as necessary. Very few clients rely on a central IMAP server for such features, in part because IMAP servers tend to be bad at it, but also because it meshes well with offline functionality (being able to read your old email without connecting to the IMAP server). For clients that download or cache email messages, server searching speed is generally irrelevant: unless the client is poorly designed or does not have the messages cached, it is always faster for the client to search through it's cache than to send the search to the central IMAP server (which most likely has other things to do, like send other people their email). On the other hand, there are valid situations where this is either difficult or impossible. For example, if the majority of email access is done with a webmail client, client message caching is significantly less effective, and server-based searches often become very important.

There are several common reasons that searching on an IMAP server is slow. The most common problem: the mail client is not actually using IMAP's SEARCH command, but is instead fetching every message and searching it by hand (treating the IMAP server like a POP3 server). This is a common behavior of many mail clients, in part because messages are often encoded in strange and unusual ways (PGP-encryption, Base64 encoding, and unusual charsets (like UTF-8) come to mind), but more importantly because they have to do it anyway in order to support offline mail reading. A database back-end will not help with this, just as it will not help with the POP3 protocol. Using a different mail client, on the other hand, will help. On the other hand, there are valid reasons not to use the IMAP SEARCH command; one of them being those unusually encoded emails. Most IMAP servers simply treat email as blobs of ASCII text (rather than decoding each one into an internal representation that can handle all possible unusal characters).

There are other limitations to consider that may be slowing your IMAP server's searches. Maybe the file-system you are using isn't designed for use in a server (i.e. for many essentially random accesses); for example, perhaps you should move to an EXT3 or XFS or ReiserFS file-system instead of FAT32 or NTFS. Maybe your file-system isn't set up properly; for example, turning on read-caching, increasing the size of the read-cache, and/or turning on directory hashing/indexing will usually help a great deal. In some cases, simply adding RAM to your system, thereby allowing your kernel to cache more of the filesystem data, can drastically increase performance. On the other hand, maybe your OS has a poor IO scheduler and you'd be better off switching to either a different scheduler or a different OS; for example, OpenBSD's IO system tends to be very slow, whereas Linux's can be tailored to the types of access patterns on your system. Another possibility is that your storage system (disks, RAID, SAN, NAS, etc.) is the bottleneck: perhaps getting a faster one (or configuring the one you have differently) is the real answer. The default drive settings (configured with hdparm on Linux) are not always optimal: setting the read-ahead size, setting the bit width of requests (16-bit versus 32-bit), enabling DMA transfers and setting the DMA transfer mode, changing the size of read/write buffers, changing the number of sectors that can be transferred per interrupt, altering acoustic management (slower disk speed to reduce noise), tagged queueing (for out-of-order responses), and more can be used to get the most out of your disk drive. If you're using a RAID-based storage, you may not be using the fastest RAID level for your needs. SAN/NAS storage may have it's own problems (for example, you need to upgrade to a faster link between your server and the storage device). These are all changes and improvements that can be made without radically altering your mail setup.

Finally, examine your IMAP server software itself. It may not be able to handle the concurrency you want, or it may simply not be configured properly to handle what you want it to do: never trust the defaults, if your goal is to extract every last drop of performance.

Using Databases to Speed Searches

Many IMAP servers have very simplistic SEARCH implementations; for example, BincIMAP's SEARCH implementation is roughly equivalent to grep. This sort of searching is easy to implement, but is sub-optimal for searches that only need specific, well-identified header information within emails (such as looking for all messages with a specific sender). Even full-text searches can be made faster through judicious use of common indexing techniques. This sort of meta-data about your email is indeed relational data: it is related to the email itself, and relates email together. This is precisely what relational databases are good at. Some IMAP servers, such as CyrusIMAP and to some extent Courier's IMAP server, use small, simple database files as caches and indexes into email collections. Note that they do not normally store the email itself in a database.

Storing such meta-data in a database is often convenient for mail clients as well (which is, for most situations, a better idea anyway). The mutt mail client is a good example of how mail clients handle email. Mutt uses a small database (such as qdbm, gdbm, etc.) to cache and index email header information, which makes searching for messages by their header information very fast. However, mutt also uses a file-based message cache for full messages, rather than putting it into one of these databases.

Benchmarks

Other people have examined this issue. The following benchmarks have been run:

It is important to take all benchmarks with a grain of salt; who runs them and under what conditions is crucial to deciding how accurate and/or applicable it is to your situation.

Conclusion

If you're trying to store things that can easily map to storage as files (like email) in a safe, fast, reliable way, you need a very good reason to avoid using a FILE-system. File-systems have been refined and improved since the dawn of disk drives to do one thing really well: store files. There's basically nothing better suited for the job.

Databases can be used to improve the speed of searching mail, but that doesn't mean that a database should be used to store email. A database can be used as a cache of mail meta-data, and thus provide the indexing features generic filesystems lack without all of the penalties of exclusively using a database. For example, the mutt mail client does this to great effect, as does the Cyrus IMAP server.

In any case, if the answer to your email question is to replace the back-end with a relational database, chances are you're asking the wrong question.

Acknowledgements

Contributors to this page: