he problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches. The problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches.
Unethical e-mail senders bear little or no cost for mass distribution of messages, yet normal e-mail users are forced to spend time and effort purging fraudulent and otherwise unwanted mail from their mailboxes. In this article, I describe ways that computer code canhelp eliminate unsolicited commercial e-mail, viruses, trojans, and worms, as well as frauds perpetrated electronically and other undesired and troublesome e-mail. In some sense, the final and best solution for eliminating spam will probably take place on a legal level. In the meantime, however, you can do some things from a code perspective that can serve as an interim solution to the problem, until (if ever) the laws begin to evolve at the same rate as public frustration. Considering matters technically — but also with common sense — what is generally called “spam” is somewhat broader than the category “unsolicited commercial e-mail”; spam encompasses all the e-mail that we do not want and that is only very loosely directed at us.
Such messages are not always commercial per se, and some push the limits of what it means to be solicited. For example, we do not want to get viruses (even from our unwary friends); nor do we generally want chain letters, even if they don’t ask for money; nor proselytizing messages from strangers; nor outright attempts to defraud us. In any case, it is usually unambiguous whether a message is spam, and many, many people get the same such e-mails. The problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches.Unethical e-mail senders bear little or no cost for mass distribution of messages, yet normal e-mail users are forced to spend time and effort purging fraudulent and otherwise unwanted mail from their mailboxes. In this article, I describe ways that computer code can help eliminate unsolicited commercial e-mail, viruses, trojans, and worms, as well as frauds perpetrated electronically and other undesired and troublesome e-mail. In some sense, the final and best solution for eliminating spam will probably take place on a legal level. In the meantime, however, you can do some things from a code perspective that can serve as an interim solution to the problem, until (if ever) the laws begin to evolve at the same rate as public frustration.
Considering matters technically — but also with common sense — what is generally called “spam” is somewhat broader than the category “unsolicited commercial e-mail”; spam encompasses all the e-mail that we do not want and that is only very loosely directed at us. Such messages are not always commercial per se, and some push the limits of what it means to be solicited. For example, we do not want to get viruses (even from our unwary friends); nor do we generally want chain letters, even if they don’t ask for money; nor proselytizing messages from strangers; nor outright attempts to defraud us. In any case, it is usually unambiguous whether a message is spam, and many, many people get the same such e-mails. The problem with spam is that it tends to sware spams than I did legitimate correspondences. On average, I probably get 10 spams for every appropriate e-mail. In some ways I am unusual — as a public writer, I maintain a widely published e-mail address; moreover, I both welcome and receive frequent correspondence from strangers related to my published writing and to my software libraries. Unfortunately, a letter from a stranger — with who-knows-which e-mail application, OS, native natural language, and so on, is not immediately obvious in its purpose; and spammers try to slip their messages underneath such ambiguities.
My seconds are valuable to me, especially when they are claimed many times during every hour of a day. Hiding contact information:For some e-mail users, a reasonable, sufficient, and very simple approach to avoiding spam is simply to guard e-mail addresses closely. For these people, an e-mail address is something to be revealed only to selected, trusted parties. As extra precautions, an e-mail address can be chosen to avoid easily guessed names and dictionary words, and addresses can be disguised when posting to public areas. We have all seen e-mail addresses cutely encoded in forms like “echo zregm@tabfvf.pk | tr A-Za-z N-ZA-Mn-za-m”. In addition to hiding addresses, a secretive e-mailer often uses one or more of the free e-mail services for “throwaway” addresses. If you need to transact e-mail with a semi-trusted party, a temporary address can be used for a few days, then abandoned along with any spam it might thereafter accumulate. The real “confidantes only” address is kept protected. In my informal survey of discussions of spam on Web-boards, mailing lists, the Usenet, and so on, I’ve found that a category of e-mail users gains sufficient protection from these basic precautions. Looking at filtering software:This article looks at filtering software from a particular perspective. I want to know how well different approaches work in correctly identifying spam as spam and desirable messages as legitimate. For purposes of answering this question, I am not particularly interested in the details of configuring filter applications to work with various Mail Transfer Agents (MTAs). There is certainly a great deal of arcana surrounding the best configuration of MTAs such as Sendmail, QMail, Procmail, Fetchmail, and others. Further, many e-mail clients have their own filtering options and plug-in APIs. Fortunately, most of the filters I look at come with pretty good documentation covering how to configure them with various MTAs. For purposes of my testing, I developed two collections of messages: spam and legitimate. Both collections were taken from mail I actually received in the last couple of months, but I added a significant subset of messages up to several years old to broaden the test. I cannot know exactly what will be contained in next month’s e-mails, but the past provides the best clue to what the future holds. That sounds cryptic, but all I mean is that I do not want to limit the patterns to a few words, phrases, regular expressions, etc. that might characterize the very latest e-mails but fail to generalize to the two types. A general comment on testing is worth emphasizing. False negatives in spam filters just mean that some unwanted messages make it to your inbox. Not a good thing, but not horrible in itself.
False positives are cases where legitimate messages are misidentified as spam. This can potentially be very bad, as some legitimate messages are important, even urgent, in nature, and even those that are merely conversational are ones we do not want to lose. Most filtering software allows you to save rejected messages in temporary folders pending review — but if you need to review a folder full of spam, the usefulness of the software is thereby reduced. 1. Basic structured text filtersThe e-mail client I use has the capability to sort incoming e-mail based on simple strings found in specific header fields, the header in general, and/or in the body. Its capability is very simple and does not even include regular expression matching. Almost all e-mail clients have this much filtering capability. Over the last few months, I have developed a fairly small number of text filters. These few simple filters correctly catch about 80% of the spam I receive. Unfortunately, they also have a relatively high false positive rate — enough that I need to manually examine some of the spam folders from time to time. (I sort probable spam into several different folders, and I save them all to develop message corpora.) Although exact details will differ among users, a general pattern will be useful to most readers: • Set 1: A few people or mailing lists do funny things with their headers that get them flagged on other rules. I catch something in the header (usually the From:) and whitelist it (either to INBOX or somewhere else). • Set 2: In no particular order, I run the following spam filters: o Identify a specific bad sender. o Look for “<>” as the From: header. o Look for “@<” in the header. The training sets are about twice as large. A general comment on testing is worth emphasizing. False negatives in spam filters just mean that some unwanted messages make it to your inbox. Not a good thing, but not horrible in itself. False positives are cases where legitimate messages are misidentified as spam. This can potentially be very bad, as some legitimate messages are important, even urgent, in nature, and even those that are merely conversational are ones we do not want to lose. Most filtering software allows you to save rejected messages in temporary folders pending review — but if you need to review a folder full of spam, the usefulness of the software is thereby reduced. 1. Basic structured text filters: The e-mail client I use has the capability to sort incoming e-mail based on simple strings found in specific header fields, the header in general, and/or in the body. Its capability is very simple and does not even include regular expression matching. Almost all e-mail clients have this much filtering capability.
Over the last few months, I have developedhis for some reason). o Look for “Content-Type: audio”. Nothing I want has this, only virii (your mileage may vary). o Look for “euc-kr” and “ks_c_5601-1987″ in the headers. I can’t read that language, but for some reason I get a huge volume of Korean spam (of course, for an actual Korean reader, this isn’t a good rule). • Set 3: Store messages to known legitimate addresses. I have several such rules, but they all just match a literal To: field. • Set 4: Look for messages that have a legit address in the header, but that weren’t caught by the previous To: filters. I find that when I am only in the Bcc: field, it’s almost always an unsolicited mailing to a list of alphabetically sequential addresses (mertz1@…, mertz37@…, etc). • Set 5: Anything left at this point is probably spam (it probably has forged headers to avoid identification of the sender). 2. Whitelist/verification filters: A fairly aggressive technique for spam filtering is what I would call the “whitelist plus automated verification” approach. There are several tools that implement a whitelist with verification: TDMA is a popular multi-platform open source tool; ChoiceMail is a commercial tool for Windows; most others seem more preliminary. (See Resources later in this article for links.) A whitelist filter connects to an MTA and passes mail only from explicitly approved recipients on to the inbox. Other messages generate a special challenge response to the sender. The whitelist filter’s response contains some kind of unique code that identifies the original message, such as a hash or sequential ID. This challenge message contains instructions for the sender to reply in order to be added to the whitelist (the response message must contain the code generated by the whitelist filter). When a legitimate sender answers a challenge, her/his address is added to the whitelist so that any future messages from the same address are passed through automatically. Although I have not used any of these tools more than experimentally myself, I would expect whitelist/verification filters to be very nearly 100% effective in blocking spam messages. It is conceivable that spammers will start adding challenge responses to their systems, but this could be countered by making challenges slightly more sophisticated (for example, by requiring small human modification to a code). Spammers who respond, moreover, make themselves more easily traceable for people seeking legal remedies against them.
The problem with whitelist/verification filters is the extra burden they place on legitimate senders. Inasmuch as some correspondents may fail to respond to challenges — for any reason — this makes for a type of false positive. In the best case, a slight extra effort is required for legitimate senders. But senders who have unreliable ISPs, picky firewalls, multiple e-mail addresses, non-native understanding of English (or whatever language the challenge is written in), or who simply overlook or cannot be bothered with challenges, may not have their legitimate messages delivered. Moreover, sometimes legitimate “correspondents” are not people at all, but automated response systems with no capability of challenge response. Whitelist/verification filters are likely to require extra efforts to deal with mailing-list signups, online purchases, Web site registrations, and other “robot correspondences”. 3. Distributed adaptive blacklists: Spam is almost by definition delivered to a large number of recipients. And as a matter of practice, there is little if any customization of spam messages to individual recipients. Each recipient of a spam, however, in the absence of prior filtering, must press his own “Delete” button to get rid of the message. Tools such as Razor and Pyzor (see Resources) operate around servers that store digests of known spams. When a message is received by an MTA, a distributed blacklist filter is called to determine whether the message is a known spam.
These tools use clever statistical techniques for creating digests, so that spams with minor or automated mutations (or just different headers resulting from transport routes) do not prevent recognition of message identity. In addition, maintainers of distributed blacklist servers frequently create “honey-pot” addresses specifically for the purpose of attracting spam (but never for any legitimate correspondences). In my testing, I found zero false positive spam categorizations by Pyzor. I would not expect any to occur using other similar tools, such as Razor.
There is some common sense to this. Even those ill-intentioned enough to taint legitimate messages would not have samples of my good messages to report to the servers — it is generally only the spam messages that are widely distributed. It is conceivable that a widely sent, but legitimate message such as the developerWorks newsletter could be misreported, but the maintainers of distributed blacklist servers would almost certainly detect this and quickly correct such problems. As the summary table below shows, however, false negatives are far more common using distributed blacklists than with any of the other techniques I tested. The authors of Pyzor recommend using the tool in conjunction with other techniques rather than as a single line of defense. While this seems reasonable, it is not clear that such combined filtering will actually produce many more spam identifications than the other techniques by themselves. In addition, since distributed blacklists require talking to a server to perform verification, Pyzor performed far more slowly against my test corpora than did any other techniques. 4. Rule-based rankings: The most popular tool for rule-based spam filtering, by a good margin, is SpamAssassin. There are other tools, but they are not as widely used or actively maintained. SpamAssassin (and similar tools) evaluate a large number of patterns — mostly regular expressions — against a candidate message.
Popularity: unranked [?]


