Qmail Patch Explanations

This is frequently described as a "bug" in qmail. It's not. Let me explain...

40x errors mean, in essence, "try again later". We can quibble over whether that's the smartest thing to do, but as-such, qmail's behavior is consistent with the RFC. The usual complaint is to point out that RFC 2821 says (in section 5):

To provide reliable mail transmission, the SMTP client MUST be able to try (and retry) each of the relevant addresses in this list in order, until a delivery attempt succeeds.

Qmail is able, and does so if the lowest-preference MX is unreachable. The RFC does not specify the relationship between failure type and MX choice (for example, the language it uses in section 4.2.1 suggests that retrying within the same connection is acceptable), so again there, qmail is consistent with the RFC's stricture. I think there's a real point to be made out of the fact that the RFC authors require that the client must be able to try other MX entries, rather than saying that the client must always try other MX entries. It leaves plenty of room for choosing different retry policies based on the type of error.

The policy at issue here, however, is regarding errors given as part of the greeting. When qmail connects to a mail server and is greeted with an error (instead of "250 Hi there!"), how should that be interpreted, and how should it be handled? What should it mean for that particular message you're trying to deliver? (This is (primarily) what Matthias Andree's patch changes.)

Some (vocal) mail administrators seem to be of the opinion that a greeting error is a resonable thing to use to indicate an overload situation (for example, that the server is overwhelmed by a spam attack, and cannot handle additional email at the moment). But consider: is this a reasonable thing to do? If a server cannot (or will not) accept email, and this is known at connection time, why accept the connection? It wastes bandwidth, it wastes server resources, it wastes time. Why would anyone use scarce resources—in the middle of being overloaded—to tell senders about it? Why accept a connection when you cannot accept email? It's more efficient to simply refuse the connection. Imagine if taxicabs worked on the same principle. When they're hired and full, they cannot accept new riders. The easiest and most direct way of not accepting new riders is to ignore the folks on the sidewalk waving at the taxi. The idea that the taxi would pull over to tell them "sorry, I'm busy" seems downright goofy (almost as if the taxi driver is taunting the people on the sidewalk). Similarly, if a server is overloaded, the idea that it would accept connections for the sole purpose of telling the sender "sorry, I'm busy" also seems goofy. If you're busy, you should be using your resources to do your job rather than using them to tell everyone how terribly busy you are. So it seems reasonable to conclude that if a server is willing to accept new connections, then it's probably not overloaded.

However, the more important issue is that when a server KNOWS it cannot accept email for whatever reason, what should it do? First it needs to decide if it wants that email delivery attempt be retried immediately or later? And, in either case, should the sender re-contact the most-preferable MX (i.e. the one that is currently overloaded) or a less-preferable MX when it retries? And once the answers to those questions are determined, what is the correct (and/or most reliable) means of expressing that intent?

Answering the latter question requires answering another question first: what do backup MX records mean? Overload situations (or, more typically, spammers) are NOT the reason to have secondary/backup MX servers (with higher MX priorities). If you have multiple servers available to deal with high load, why not assign them all the same priority and use them ALL during low-load situations as well (e.g. to decrease latency)? Waiting for one server to become overloaded before involving another (that you already had available) is a lousy management policy because it leads to wasted resources (read: wasted money) during normal-load situations (i.e. most of the time). Additionally, overload is particularly undesirable because it leads to slow response time; it's better to avoid overload completely—if you can—by using all the resources at your disposal. Using backup servers to handle overload is not impossible, obviously, because some people do it, but it is not a smart use of resources.

So, what IS the reason to have secondary MX servers? Since SMTP-compliant senders are already required to queue undeliverable messages and retry later, the primary benefit of a backup MX is to reduce latency in catastrophic situations. For example, you may have an arrangement where you can tell the backup MX to deliver all of the messages it's holding for you in a single block immediately after your primary email system comes back online. That way the messages get delivered as soon as you're ready, rather than waiting for all the myriad of senders to realize that you're back online and retry—depending on their retry schedules, that could take hours. So what would you use as a backup MX? If it's just another server you have in the machine room, there's no reason not to use it as one of your primary mail servers... To reiterate, there has to be a reason why the backup is less preferable and is not part of your primary mail system. By knowing that, we can know what sort of penalty is associated with using a backup MX, and how much effort should be put into avoiding paying that penalty.

In my opinion, using the MX preference ordering system as an overload failover system is a bad setup. The best way to handle overload is to avoid ever being overloaded, rather than by arranging a failover. Additionally, it seems wiser to assume that the admin is smart and isn't using the MX ordering as a bad overload compensation technique. We ought to assume that the admin is using the MX preference for a good reason (such as "the backup is an arrangement we have with another company in case of emergencies") rather than a bad reason (such as "the admin doesn't know what he's doing"). Thus, it is reasonable to avoid using the backup MX records unless the primary cannot be contacted at all.

Now then, if one of the primary MX servers cannot accept additional email and intends for deliveries to be retried, what is the most effective and efficient means of ensuring that deliveries are retried to the next-most-preferable MX record? Simple: refuse to accept the connection. Accepting the connection just to emit an error message is not only wasteful, but unreliable. At best, it depends on a particular interpretation of a "proposed standard" revision of the email RFC that not everyone agrees on.

For what it's worth, qmail always tries to get the most-preferable MX. Unreachable IPs, however, get added to a 1-hour do-not-try list (see qmail-tcpto man page; it's slightly more complicated, but not much). If the lowest MX was unreachable, qmail won't retry that unreachable IP for an hour, and so will retry the higher MXs. After the one hour timeout, qmail will allow itself to retry the lower MX to see if it came back. Since the standard retry schedule has two retries in less than an hour (one 6 minutes after the first, and the next 26 minutes after the first), if the lowest MX was unreachable, the next two retries will be to the higher MX, but the fourth delivery attempt will start with the lowest MX again.

Hide Comment

#!/bin/sh GoodRecipient=0 BadRecipient=100 Error=120 User=$( echo "$RECIPIENT" | cut -d@ -f1 ) Domain=$( echo "$RECIPIENT" | cut -d@ -f2- ) if ! type id >/dev/null 2>&1 ; then # id program not in PATH exit $Error fi if id "$User" >/dev/null 2>&1 ; then exit $GoodRecipient else exit $BadRecipient fi

Qmail Patch Explanations and Recommendations

Contents

Introduction

Minimum Patches

Small Site Patches

Patches for Big Sites