A theoretical foundation for Spam Filtering
Question asked by Douglas Foster - 1/15/2020 at 9:02 AM

While reviewing the email filtering marketplace, I was surprised by the limitations in many products, and the inconsistency in features between products.   It left the impression that the vendors have no theoretical foundation for the features that they implement.

This is my attempt to provide that theoretical foundation.   Declude is the only product that seems capable of getting me to the configuration I want, although it will require integration of third-party tests to fully implement this architecture.   I will discuss my Declude approach in a later post.

I am very interested in whether the community finds this document coherent and whether you can agree with, or improve on, the architecture.

Threat Actors

I classify threat actors into two primary categories:

  • Organizations that are fully untrusted, as they send exclusively malicious or otherwise unwanted messages.
  • Legitimate organizations that have a compromised device or compromised account which is being used to send malicious or unwanted messages.

In practice, the threat landscape is slightly more complicated, because some traffic involves two organizations.  For these configurations, discerning the sender trust level has additional complexity.

  • An email hosting service may serve a mix of acceptable and unacceptable clients, or
  • A mass-mailing service may send messages on behalf of both acceptable and unacceptable clients.

Message Filtering Overview

I propose this model for discussing how message filtering must work to be effective.

  1. The message is evaluated to determine the apparent sending organization.   If the sending organization is not acceptable, the message should be blocked.
  2. If the apparent sending organization is not blocked, then the message is evaluated to determine if the apparent sending organization has been spoofed.  If the source can be determined to be spoofed, the message should be blocked.
  3. If the message has not been blocked based on sender identity, the message content is filtered for disallowed or suspicious content.     
  4. To prevent false positives, some filters will be disabled or ignored for some highly trusted senders.
  5. Based on the results of sender-specific content filtering, the message may be blocked, quarantined, tagged, or delivered to the end-recipient.

Key implications of this model

  • Content filtering is irrelevant and unnecessary if the sender is untrusted.
  • Sender-specific content filtering exceptions are only safe when sender identification is reliable.

Spoofing vs. Agency

Agency occurs when one organization sends messages on behalf of another organization, with authorizations to do so.   The vast majority of incoming mail is generated by mass mailing services, and involves a contracted agency relationship.   Spoofing occurs when an organization sends messages pretending to be another organization, and does so without such authorization.   A receiving organization has no visibility into the contracting relationship between other organizations, so the difference between agency and spoofing is not inherently obvious.  

Solving Sender Authentication is the most important, and most neglected, aspect of email filtering.   As discussed below, SPF and DKIM provide a starting point, but because they have limited applicability, they do not fully solve the problem.

Elements of Sender Identity

When a sender connects to an incoming mail gateway, it sends a connection request containing important identity information:

  1. The Source IP address.
  2. The name of the sending server host name (HELO name).   (The HELO name can be used in a forward DNS lookup to look for a match with the Source IP address.   
  3. The Source IP address can also be used with a reverse DNS lookup to obtain another possible name for the sending host.)
  4. The sender email address (Envelope-From).
  5. The recipient email address (Envelope-To).

If the connection is accepted, the message is transmitted.   The transmitted message will contain Message-From and Message-To, and may contain other headers that have parameters in email address format.

Sender Authentication - Testing for Identity Fraud

Sender Authentication is the process used to determine whether the identity attributes of a message are validated or fraudulent.   The recipient needs to validate both individual attributes and the relationship between attributes.   The primary tools for this process are well known:  SPF, DKIM, and DMARC.   However, each of the policies have been published as sender options, and each of these have known implementation problems.   As a result, sender-initiated mechanisms must be supplemented by local policies enforced by the email filter.   With the appropriate local policies, the recipient organization should be able to configure an email filtering mechanism that requires successful sender authentication.

These are the fraud opportunities that create risk, and the validation techniques available to mitigate that risk.

Source IP Address:  The Source IP address is partially validated because the recipient server must reply to the source IP for the connection to be completed and the message to be transmitted.   While one can conceive of a NAT-translation device creating false Source IP addresses, detecting such an attack is outside the capability of email systems, and the consequences of such an attack are much greater than the effect on email integrity.  Consequently, email filters assume that the Source IP is valid.

HELO name: Host names can be set to anything the sender desires, and is often an internal only name, such as host1.something.local.  The HELO name can be considered verified if a DNS forward lookup on the HELO name  includes the Source IP.

Reverse DNS host name:   The Reverse DNS name can only be configured with the assistance of the IP address owner, usually the Internet Service Provider.   Consequently, it is often an arbitrary value unrelated to the sending domain, and sometimes is misconfigured to confuse email filters.    If the Reverse DNS name can be forward-confirmed to the Source IP, the DNS name can be considered verified.  This process can be complex, as the Reverse DNS lookup can produce multiple names, although this seems to be rare, and the forward lookup on each name can produce multiple IP addresses.

Envelope-From Sender Address:   The Envelope-From address is best conceived as the login account used to send the message.    In the case of intentional fraud, this field could be set to any arbitrary value, and it can also be null.   Validation involves demonstrating that the Source IP is authorized to send on behalf of the Envelope-From domain.   The primary method for this is Sender Policy Framework (SPF).

  • The recipient organization should have the ability to either BLOCK or PASS any specific Envelope-From and Source Server pair.    If a local policy exists, that action is applied and the SPF test does not need to be performed.
  • If the SPF test returns PASS, the Envelope-From address and Source IP relationship is verified.
  • If the SPF test returns TempError, the preferred action is to return a TempError to the sending system, deferring reception until a later time.
  • If the SPF test returns NONE or PERMERROR, the sender has not configured a usable SPF policy.    The suggested workaround is to use these criteria:
    • PASS if any of the MX entries can be resolved to the Source IP.
    • PASS if the HELO name resolves to the Source IP
    • PASS if the Reverse DNS of the Source IP can be forward-resolved back into the Source IP.
  • If none of these conditions is true, then the SPF status is not validated.   The root cause may be spoofing, auto-forwarding, or an omission from the sender’s SPF policy.   The entire message must be processed before a final disposition can be made.
  • If DKIM validates the Message-From address, it is recommended that the SPF non-validation be ignored.   Exceptions based on the Message-From can be safely activated.   However, exceptions based on the Envelope-From (if different) may still be risky.
  • If content filtering causes the message to be blocked, SPF non-validation becomes a moot point.
  • If no other rule provides a disposition, it is recommended that the non-validated message should be quarantined, so that the recipient system administrator can configure a local policy to BLOCK or PASS subsequent messages with this identity.

Message-From Header:   The Message-From header can contain two parts:  The From-Address and an optional Friendly Name.  Conceptually, if the Message-From is different from the Envelope-From, then the sender is claiming authority to impersonate the Message-From address.   The Message-From can be considered verified:

  • If the Envelope-From domain is the same as the Message-From Domain.
  • If the Envelope-From domain is a parent of the Message-From Domain.
  • If the domain of a verified DKIM signature matches the Message-From domain.
  • If the domain of a verified DKIM signature is a parent domain of the Message-From domain.

If the Message-From is validated by a DKIM signature, SPF PASS is typically not necessary.   This permits auto-forwarded mail to be considered validated, even though the delivery path is indirect.

The Friendly Name is unrestricted and therefore difficult to validate.    If it is present, and has the format of an email address, but does not duplicate the Message-From address, then spoofing can be inferred and the message should be blocked.

Blacklisting vs. Whitelisting

For Blacklisting, an attribute match can be assumed valid without the need for attribute verification.    There is no incentive for a legitimate sender to masquerade as an illegitimate sender, so there is no reason to doubt a negative attribute.

For Whitelisting (or more properly, Exceptions), the receiving system must use caution when implementing an exception using unverified attributes, as this creates risks that a spoofing attack may be acceptable.  As much as possible, exceptions should be predicated on validated attributes, and exceptions should be granular.   For example, an exception to override SPF validation should not also override content filtering, unless the recipient system administrator specifically requests this behavior on a specific exception rule.

For most exceptions, the exception process must evaluate multiple attributes:  such as the Envelope-From email address, the validation status, DNS verification status, or the Message-From email address.   

Commonly observed problems in email filtering products include:

  • Inability to define exceptions to correct SPF FAIL or SOFTFAIL false positives.  In most products, the solution to false positives is to disable SPF testing for that email domain.
  • Inability to override SPF based on the Source Server – Envelope-From domain pair.
  • Inability to define policies for handling other SPF status codes.
  • Inability to configure multi-attributed exceptions.
  • Inability to configure granular exceptions.
  • Inability to distinguish between validated and non-validated DNS names.

Site-Specific Policies

Single Attribute Blacklisting

Sender Blacklisting is typically applied if any of the available identification attributes are unacceptable.    Because spammers are known to change identities frequently, the appropriate response to detected spam is to block all of the identity attributes in the problem message, other than identifiers that have been spoofed.    Therefore, a good email filter should provide single-attribute blocking on all of these characteristics:

  • Source IP.
  • The domain portion of the HELO and Reverse DNS names.
  • The domain portion of the Envelope-From and Message From-Header email address.

These blocks can occur based on locally-defined rules or externally-referenced Reputation Block Lists (RBLs).

Unfortunately, many email filtering applications only provide filtering on a subset of these attributes.   An inability to filter on Reverse DNS is surprisingly common.   Filtering on HELO name is rare, even though SPF checking is common and it requires HELO name evaluation.   Filtering on IP Address is missing in a few products.   Filtering on the Message-From Header is generally limited to DMARC enforcement for domains that publish a DMARC policy, and surprisingly many email filtering products still lack any DMARC enforcement capability.   These limitations are only aggravated in products that cannot do granular exceptions or cannot do multi-attribute exceptions.

Content Filtering

Once confidence is established in the sender identity, sender-specific exceptions can be configured safely   As an example, suppose you are receiving emails with either phony invoice attachments or phony invoice web links, so you want to block messages with the word “Invoice” in the subject line.   At the same time, you will have some trusted vendors who do send invoices by email.     You create a rule that says “Bypass the invoice filter if the sender is authenticated and the sender domain is <list>”.

Handling Exceptions

These are typical problems observed when trying to implement sender authentication.

Automatic Forwards

If a message is automatically forwarded, such as from user1@domain1 to user2@domain2, the final recipient will be able to detect that the message is not being transmitted by the originating domain, which violates SPF.   SPF PASS can be assured if the forwarding domain uses Sender Rewriting Service (SRS) to alter the Envelope-From address in a manner that takes responsibility for the forwarded message.  Since forwarding server behavior is outside the control of the receiving system, contingencies must be in place for both SRS and non-SRS forwards.

In general, the receiving system can benefit from a registration process, so that filters can be tuned to the specific forwarding source and target addresses that are involved.    

If the forwarding system does not use SRS, and the receiving system still wants to allow the messages, they will need to configure a local policy to ignore SPF for messages received from the auto-forwarding servers.   

If the forward uses SRS, the Envelope-From sender will not authenticate with the Message-From address, unless the message has a verifiable DKIM signature, but for a general mail stream this cannot be assured.

In either case, the receiving system will need an exception to Sender Authentication, and defenses will need to rely on whatever screening is provided by the relay system and on the content filtering capabilities of the receiving system.

SPF Entries with Omissions

Senders can comply with SPF with little difficulty, so even spam messages will often pass SPF.   At the same time, legitimate messages will have SPF FAIL status.   Often this occurs because the organization has contracted with a vendor but failed to include the vendor in its SPF entry.   To allow the legitimate traffic without disabling SPF, the recipient organization needs a mechanism to correct the SPF entry.   The ideal mechanism would be an override written in SPF syntax, so that an Include clause code be configured based on how other clients reference that vendor.   In the absence of that syntax, the recipient needs a mechanism to specify an exception for that source and the problem mail domain.    The difficulty occurs when trying to decide how to specify the source.     IP lists are the most precise, but for a large vendor, the list is likely to change without warning.   DNS matching rules are usually adequate, but the recipient may have difficulty knowing whether the DNS name will always verify.   If the rule is applied without verification, spoofing cannot be ruled out.    All of this is dependent on the ability of the email filtering product to specify a condition of this type.   As stated previously, most cannot.

= = = = =

Sender Authentication

Sender Authentication addresses the question, “Is the Source  IP address authorized to send on behalf of this email address?”  Because every email has two “From” addresses, the problem can be segmented:

  • Is the Source server authorized to send messages using the Envelope-From sender address?    The primary tool for this test is SPF.
  • Is the Source Sever – Envelope-From address pair authorized to send messages using the Message-From address?   The primary tool for this test is DKIM.

In both cases, the evaluation only needs to consider the domain name.   The domain owner is responsible for controlling whether one domain account can impersonate another account within the same domain.   If the recipient organization chooses to accept any messages from that domain, it implicitly assumes that the sending organization’s internal controls are sufficient to prevent unauthorized internal spoofing.

Envelope-From Validation using SPF and Other Techniques

Sender Policy Framework (SPF) is a technique that the sending organization can use to specify which IP addresses are allowed to send on its behalf.   SPF has many limitations for the receiving organization, because:

  • Some domains will not have an SPF policy.
  • Some domains will have syntax errors or excessive nesting which make the policy unusable.
  • Some domains will have omissions that produce false positives.
  • Many domains will have ambiguous policies which produce a neutral or softfail result.

One of the most common SPF violations occurs because an organization contracts with a service, giving them authority to send email on their behalf, but fails to update the SPF record to indicate this relationship.

Without correction, these problems render SPF virtually useless.   The receiving organization needs an email filter that can apply policies to supplement and override sender SPF entries.  The ideal would be to specify these exceptions using SPF syntax, but this capability has not been observed.    The key feature of SPF overrides and corrections is that they will typically require multi-attribute exceptions:  a message is blocked or not blocked based on the combination of Source Server and Envelope-From address.   Due to limits on information about sender configuration, Source Server may be indicated by IP addresses, Reverse DNS domain, or HELO domain.

Regardless of syntax, the recipient organization should be able to configure rules to ensure that the message is inherently verifiable as authorized, or authorized by local policy.    This means the email filter should have these capabilities:

  • Messages can be blocked based on the combination of Source Server and Envelope-From sender.
  • Messages can be exempted from SPF based on Source Server alone or on the Source Server and Envelope-Sender pair.
  • Messages can be exempted from SPF based on Source Server alone.
  • Messages can be exempted from SPF if the Source IP is in the domain’s MX list.
  • All other messages should produce SPF PASS.  
  • Messages which do not meet any of these rules will be evaluated for authorized forwarding using DKIM and possibly other criteria.  
  •  If no such exception is found, the message is quarantined so that an appropriate local policy rule can be established.

Message-From validation using Domain Keys Identification Method (DKIM)

Domain Keys Identification Method uses a digital signature, validated using a DNS entry, to prove that a domain owner has authorized a particular message.   Typically, this is used to validate the Message-From.

Because these technologies are optional, they provide no immediate benefit if the sending domain does not implement them.    Additionally, both have difficult challenges in operation:

  • Some domains have incorrect SPF information, producing false failures.
  • Some legitimate systems violate SPF, even though the messages they send are important to the recipient.
  • Some DKIM signatures will not validate because of changes in transit.  In one sample, the rate of signature verification failure was 13%.

Despite these difficulties, recipients need the ability to enforce mandatory sender authentication.    The policy should be configurable by the receiving system manager.   I propose the following sample policy:

SPF Enforcement

The following discussion assumes that all single-factor blacklist rules have been applied, and the message has not been blocked.

At this point, multi-factor block policies are applied.   Policy rules define unacceptable relationships between the source server (identified by IP address, HELO name, or Reverse DNS name), and the Emvel;ope-From or Message-From address.

If the message is not explicitly blocked, then it must pass  one of the following criteria:

  • SPF returns PASS
  • The Source IP is in the sender domain’s MX list.

Additionally, every domain that accepts incoming mail will have a DNS MX entry indicating the servers used for incoming mail processing.   It seems reasonable to assume that if a message is also transmitted by one of the MX servers, the message is also authorized, unless it has been explicitly prohibited by other information.

Message Processing

This information is often sufficient to identify an unacceptable (blacklisted) sender.  When this occurs, the connection can be rejected immediately and the message content is never transmitted.

If the connection is accepted, the message is transmitted, and the contents evaluated.   The sender is given a return status indicating whether the message is accepted or rejected.   The message itself as From and To headers, and the email addresses can be completely unrelated to the Envelope-From and Envelope-To addresses.    This is fully comparable to a postal letter, where the addresses on the envelope and the addresses on the inside letter do not need to match.

If a device accepts a message for delivery, and then detects that delivery is not completed, it can (and typically will) notify the sender by generating a Non-Delivery Report (NDR) as a new email message, using the Envelope-From mail address as the recipient address for the NDR.   Typically, the Envelope-From address of an NDR is empty, although other options are allowed.   Since an Envelope-From address can be fraudulent, NDRs are now discouraged.    Nonetheless, most mail systems will have a default configuration to send NDRs, and some will have no ability to turn them off.

1 Reply

Reply to Thread
Matt Petty Replied
Employee Post
I just now saw this post, some how it got lost in the mix caught it while searching for something else. This is a very nice writeup. I'll be bookmarking this.
Matt Petty Software Developer SmarterTools Inc. (877) 357-6278 www.smartertools.com

Reply to Thread