Optimizing SPF
Problem reported by Douglas Foster - Today at 8:31 AM
Submitted
SPF Pass tells me that I the SMTP Mail From address is accurate, and therefore not impersonated, to within the limits of the technology.   Any other result is some form of failure is an ambiguous result, and the optimal disposition is quarantine.   I want to minimize risk while minimizing quarantine review effort, so I want to maximize the frequency of SPF Pass.  There are two different contexts for SPF evaluation:

  • When I am evaluating my own SPF policy, I want to interpret rules strictly, to ensure that other organizations will use my SPF record to trust my messages.
  • When I am evaluating someone else’s SPF policy, I wanted to interpret rules with grace, because getting the correct answer is more important than enforcing the rules.
Because I am using SPF in the evaluation context, I want to apply relaxed rules to minimize PermError and TempError results.  The impact has been pretty dramatic, as indicated at the last paragraph.

Notes:
  • I evaluate SPF in Declude, using a modified version of the Python PYSPF module.  Messages are sent to quarantine if they do not produce SPF PASS and are not authenticated by DKIM or local policy settings.
  • I detect messages with no valid recipients, and silently discard them, without doing any SPF checks.   This process quickly excludes about 65% of all incoming messages.   The statistics in the last paragraph only apply to the subset that is not discarded.

Minimizing PermError results:

  • Multiple SPF policy records are not allowed.  Just pick one to use.
     Experience indicates that this is a simple human error where the SPF record needs to be modified, but the user creates a second one.     The differences between the two records are usually minor, and may not be relevant to the Source IP that I need to check.   So, I pick one of the results and use it.    An alternate strategy would be to check both of them, at the cost of extra processing effort, much of which is expected to be redundant.
     
  • Minor syntax errors make the whole policy invalid.  Fix common errors.
     Substitute the correct token for these common data entry errors:
     prt -> ptr
     ip -> ip4
     ipv4 -> ip4
     ipv6 -> ip6
     all. -> all
     Some errors can also be fixed by finding and inserting a missing space.   For example, a ‘+ character should always be preceded by a space, so a space could be inserted if it is missing.
     
  • Some policy records have too much recursion.   Relax the limits.
     The specification limits DNS include lookups to 10.   A valid of 20 is sufficient to avoid some PermErrors without creating a denial-of-service risk.
     
  • Some policy records have too many void lookups.  Relax the limits.
     DNS lookups that return no result are called Void lookups.   This applies to “A”, “MX”, and “INCLUDE” references that cannot be resolved in DNS.   The specification says to throw PermError after 2 void lookups.   I have found that increasing the limit to 5 allows me to get a valid result without creating a denial-of-service risk.
     
  • Some INCLUDE references are invalid.   Ignore them and keep checking.
     If a PASS result can be determined without using the Include, use it.   If no result is achieved, the default “ALL” result will be applied.
  • Change order of processing to evade some errors and improve performance.
     Instead of evaluating a policy terms from left-to-right, consider reordering them from simple to complex.  If the simple term returns Pass, the complex terms do not need to be evaluated.   This appears to be the optimal order:
     Ip4 (no DNS lookups)
     Ip6 (no DNS lookups)
    A (one DNS lookup)
     Exists (one DNS lookup, but usually requires macro expansion as well)
     MX (multiple DNS lookups)
     Ptr (multiple DNS lookups, possibly many, therefore discouraged)
     Include (at least one DNS lookup, plus additional parsing effort)

  • Processing timeouts can cause PermError.   Relax the limits.

Minimizing TempError results

This problem has mystified me.   Because most Internet activity works so well, I expected DNS timeouts to occur on the order of 1 per billion, or something on that order.   I was surprised to find that the actual error rate is as high as 1 per several hundred.   This is not a unique result.   This level of TempError has also been detected in DMARC reports received from other organizations and in Authentication-Results headers provided by others and included in my incoming mail stream.
I noticed that senders with very high mail volumes were also high importance, and I did not want their messages to be mistreated because of a TempError result.   I also realized that these high-volume senders were triggering a lot of SPF processing effort and DNS lookups, simply to check a result that was already known.

I pulled a list of the top 50 IP-domain combinations that had SPF Pass.  To my surprise, these 50 pairs account for 44% of all messages processed.   Then I created a lookaside list for those pairs.   If the message has that combination of identifiers, the result is treated as SPF Pass without doing any policy lookup or policy processing.    This cut processing effort as well as avoiding false TempError results.

Results

Prior to implementing these changes:
  • SPF Pass was 93.66%
  • TempError was one per 114 messages
  • PermError was one per 366 messages
Since implementing these changes:
  • SPF Pass is 96.20%
  • TempError is one per 6,732 messages
  • PermError is one per 1,339 messages
 

Reply to Thread

Enter the verification text