Email Filtering Architecture / Strategy?
Question asked by Douglas Foster - 5/31/2020 at 12:28 PM
I am trying to define an optimal spam filtering architecture, as a step toward refining my implementation using a mixture of available products and custom code.   

I am interested in learning how others have approached, and attempted to solve, the spam filtering problem.   I am also interested to know how your current tools enable or inhibit your preferred solution. 

2 Replies

Reply to Thread
echoDreamz Replied
SmarterMail itself inhibits our preferred solution. SM used to put emails into a spam/ham folders where we could use to train our gateways, this was removed back when they removed Bayesian filtering.
Douglas Foster Replied
For my part, in addition to what I have posted previously posted:

Organization Attribution
Up to three organizations can be involved in a message:
  1. The host system owner is responsible for the server
  2. The sender address domain owner is responsible for the mail system.
  3. The from address domain owner is responsible for the content (assuming the address is not forged.)
Email filtering involves identifying a message with each of the participating organizations.   Messages from organizations with negative reputation get blocked.   Messages from key business partners get preferred treatment.   RBLs help with this process, but locally-defined policies are needed as well.   

I am intrigued by products that use WHOIS information to help determine a reputation of an unknown source, by comparing the ownership of the unknown domain to the reputation of other domains with the same owner.

Mailing Services
Messages from mailing services can be especially difficult to categorize.   The desired message disposition is probably determined by the client, but the distinction between mailing service and client organization may be difficult to determine:
  • In many cases, the mailing service domain is used for the MailFrom / Sender Address, because this makes it easy for the mailing service to ensure that all messages pass SPF, since only their SPF record will be checked.
  • In some cases, the mailing service domain is only evident in the server host names, because the client domain is used for both Sender Address and Message From address.
  • In some cases, I have seen the mailing service address used for both Sender Address and Message From, so that the client can only be identified by the message subject or message body.
The rules engine needs a fair amount of sophistication to distinguish between primary and third-party mailing, and to allow dispositioning based on the client of a mass mailer. 

Content Filtering
Organization filtering is the primary defense, because the universe of possible content is infinite.   Content filtering is used 
  • to defend against compromised accounts or devices that are sending malicious messages from normally-trustworthy organizations 
  • to categorize message traffic where the reputation of the source organization is not yet known.
Since current attacks often involve malicious links, filtering products that evaluate links at time of reception, and then rewrite the URL so they can be checked again at click time, have significant advantages.   A good web filtering configuration can also provide click-time defenses, but only if the user device is under control of the web filter at click-time.   URL rewrite protects clicks made from cell phones or home devices.

Legitimate but Unwanted Messages
Even after malicious email is blocked, not all legitimate messages are equally wanted.    For legitimate messages, Allow / Block rules should be available based on category and recipient.   This is an extrapolation of the design used for web filtering.   For example, a recipient organization policy could be:
  • Mail from Job Search services will be allowed for the Personnel department staff, blocked for others.   Don't use the company email system to find a different job.
  • Mail from Social Networking sites will be allowed for the Marketing staff, blocked for others.   Use your personal account for Facebook.
  • Mail from gambling sites is blocked for everybody
I have not yet seen a spam filter product that provides a category-based mechanism.

Allow (whitelist) rules require a positive identification of the source
Allow rules should only be applied when the qualifying information can be verified.  There are multiple levels of detail available for verifying a source, the level of detail required for a particular Allow rule will be based on local policy.  In order from most liberal to most strict, they are:

  • Allow based on IP Address alone requires no secondary verification.  
    Source IP address are assumed to be true. However, this technique is only useful only when the Sender's Source IPs are well known and few in number.

  • Allow based on ReverseDNS or HELO host name requires Forward-Confirmed DNS to the Source IP.

  • Allow based on MailFrom / Sender Address address requires:
    •  SPF PASS, or 
    • a local policy that links the MailFrom / Sender Address to an IP Address, or 
    • a local policy that links the MailFrom / Sender Address to a verified host name..

  • Allow based on Message From header requires:
    • A valid DKIM signature from the Message From domain, or 
    • SPF PASS with domain alignment  between Sender Address and Message From Address (equivalent to applying a DMARC rule whether or not a DMARC policy exists), or
    • a local policy that links the SPF PASS domain to the Message From address.
Which email filtering products support this concept?   Too many cannot begin to do this because they do not implement multiple-attribute rules. 

More efficeint match rules
I want to use database indexes to permit high performance even when I have many thousands of Allow / Block rules.   Nearly all of my current comparisons are defined using inefficient "ends with" comparisons, but what I am really trying to do is to match an ending segment of the address or host name.    This can be used to implement more efficient matching.   Instead of comparing "john.doe @ bounce.email.example.com" to many ends-with rules, look for Allow / Block rules that match one of the few sub-segments exactly: 
  • john.doe @ bounce.email.example.tld
  • ,bounce.email.example.tld
  • .email.example.tld
  • .example.tld
  • .tld
This reduces flexibility a little while increasing expected performance a lot.   It also permits allow and block rules to be searched at the same time, with the winning entry being the most complete match.

Keep building the list of untrusted sources
When a message is blocked by an RBL, I probably want to ask whether messages from that source could ever become desirable and relevant to my domain.   If not, the RBL block, which may be temporary, should be converted to a local policy, which blocks permanently.

Efficient correction of Dispositioning Errors
After a message is received and dispositioned, I want a simple mechanism for flagging incorrect dispositions, including blocked messages that should have been allowed, and allowed messages that should have been blocked.   Then this data can be used to update the Allow / Block rules.   Tagged and Quarantined messages are evidence of imprecise filtering rules, so these messages are prioritized for review.

If a message is determined to be from a malicious source, we typically want to block all or many of the message attributes:  IP Address, Helo host name, ReverseDNS host name, sender address domain, from address domain. This helps to ensure that I remain protected if the malicious sender changes an IP Address, host domain name, or email address domain in an attempt to evade detection.    Flagging individual messages, and then aggregating the results, seems like the most effective tool for determining what policy changes are required.   I have not seen a spam filtering product which makes this process efficient, or which helps ensure that all of the attributes of a hostile message are blocked simultaneously.

Message Review User Interface
Finally, I need a good user interface to evaluate incorrect dispositions.  This includes being able to drill down from message summaries to message headers.   Many filtering products provide a sanitized view of the formatted message, to prevent accidentally launching a malicious link while performing message review.

Are there any spam filtering products that come close to this?   At present I am pursing custom development to move in this direction.

Reply to Thread