Optimizing SPF
Problem reported by Douglas Foster - 3/19/2026 at 8:31 AM
Submitted
SPF Pass tells me that I the SMTP Mail From address is accurate, and therefore not impersonated, to within the limits of the technology.   Any other result is some form of failure is an ambiguous result, and the optimal disposition is quarantine.   I want to minimize risk while minimizing quarantine review effort, so I want to maximize the frequency of SPF Pass.  There are two different contexts for SPF evaluation:

  • When I am evaluating my own SPF policy, I want to interpret rules strictly, to ensure that other organizations will use my SPF record to trust my messages.
  • When I am evaluating someone else’s SPF policy, I wanted to interpret rules with grace, because getting the correct answer is more important than enforcing the rules.
Because I am using SPF in the evaluation context, I want to apply relaxed rules to minimize PermError and TempError results.  The impact has been pretty dramatic, as indicated at the last paragraph.

Notes:
  • I evaluate SPF in Declude, using a modified version of the Python PYSPF module.  Messages are sent to quarantine if they do not produce SPF PASS and are not authenticated by DKIM or local policy settings.
  • I detect messages with no valid recipients, and silently discard them, without doing any SPF checks.   This process quickly excludes about 65% of all incoming messages.   The statistics in the last paragraph only apply to the subset that is not discarded.

Minimizing PermError results:

  • Multiple SPF policy records are not allowed.  Just pick one to use.
     Experience indicates that this is a simple human error where the SPF record needs to be modified, but the user creates a second one.     The differences between the two records are usually minor, and may not be relevant to the Source IP that I need to check.   So, I pick one of the results and use it.    An alternate strategy would be to check both of them, at the cost of extra processing effort, much of which is expected to be redundant.
     
  • Minor syntax errors make the whole policy invalid.  Fix common errors.
     Substitute the correct token for these common data entry errors:
     prt -> ptr
     ip -> ip4
     ipv4 -> ip4
     ipv6 -> ip6
     all. -> all
     Some errors can also be fixed by finding and inserting a missing space.   For example, a ‘+ character should always be preceded by a space, so a space could be inserted if it is missing.
     
  • Some policy records have too much recursion.   Relax the limits.
     The specification limits DNS include lookups to 10.   A valid of 20 is sufficient to avoid some PermErrors without creating a denial-of-service risk.
     
  • Some policy records have too many void lookups.  Relax the limits.
     DNS lookups that return no result are called Void lookups.   This applies to “A”, “MX”, and “INCLUDE” references that cannot be resolved in DNS.   The specification says to throw PermError after 2 void lookups.   I have found that increasing the limit to 5 allows me to get a valid result without creating a denial-of-service risk.
     
  • Some INCLUDE references are invalid.   Ignore them and keep checking.
     If a PASS result can be determined without using the Include, use it.   If no result is achieved, the default “ALL” result will be applied.
  • Change order of processing to evade some errors and improve performance.
     Instead of evaluating a policy terms from left-to-right, consider reordering them from simple to complex.  If the simple term returns Pass, the complex terms do not need to be evaluated.   This appears to be the optimal order:
     Ip4 (no DNS lookups)
     Ip6 (no DNS lookups)
    A (one DNS lookup)
     Exists (one DNS lookup, but usually requires macro expansion as well)
     MX (multiple DNS lookups)
     Ptr (multiple DNS lookups, possibly many, therefore discouraged)
     Include (at least one DNS lookup, plus additional parsing effort)

  • Processing timeouts can cause PermError.   Relax the limits.

Minimizing TempError results

This problem has mystified me.   Because most Internet activity works so well, I expected DNS timeouts to occur on the order of 1 per billion, or something on that order.   I was surprised to find that the actual error rate is as high as 1 per several hundred.   This is not a unique result.   This level of TempError has also been detected in DMARC reports received from other organizations and in Authentication-Results headers provided by others and included in my incoming mail stream.
I noticed that senders with very high mail volumes were also high importance, and I did not want their messages to be mistreated because of a TempError result.   I also realized that these high-volume senders were triggering a lot of SPF processing effort and DNS lookups, simply to check a result that was already known.

I pulled a list of the top 50 IP-domain combinations that had SPF Pass.  To my surprise, these 50 pairs account for 44% of all messages processed.   Then I created a lookaside list for those pairs.   If the message has that combination of identifiers, the result is treated as SPF Pass without doing any policy lookup or policy processing.    This cut processing effort as well as avoiding false TempError results.

Results

Prior to implementing these changes:
  • SPF Pass was 93.66%
  • TempError was one per 114 messages
  • PermError was one per 366 messages
Since implementing these changes:
  • SPF Pass is 96.20%
  • TempError is one per 6,732 messages
  • PermError is one per 1,339 messages
 
John Quest Replied
I was taught long ago, and operate under that premise, that the only real actionable SPF return is a SPF Fail. 
Douglas Foster Replied
You hit my passion point.

Your are half right.   SPF Pass does not require action and SPF non-Pass is not immediately actionable.   However, authentication failure is in an alarm that tells you to collect more information.   The problem is that most spam tools do not provide the tools you need to manage authentication, so people give up.

The typical scenario:
You start with "Block on SPF FAIL", because you want to prevent impersonation.   Then you get a false positive because Example.com has messed up their SPF record.   To fix this, your tool requires you to whitelist the Example.com domain.    Now if an attacker impersonates Example.com, he not only gets past the authentication filter, he also gets past the content filter.   In an attempt to protect your network from impersonation, you have to create a security hole that facilitates impersonatiom.   This makes no sense, so you turn off SPF checking.   The problem is not authentication; the problem is the lousy tools from people who want our money and claim to be experts.

Here's the correct solution:
If Example.Com throws a false positive, you create a local policy record that provides alternate authentication:
"If the server domain is Outlook.com, and the server name is verified by fcDNS, and the SMTP Mail From domain is Example,com, then the message is treated equivalent to SPF Pass".   
Now you have distinguished Example.Com from the impersonators, increased the trust score for legitimate messages from Example.com, and blocked anyone attempting to impersonate Example.com.
 
Anything that needs whitelisting MUST be authenticated, but any message MAY need whitelisting, regardless of sender sophistication, so you NEED the ability to configure alternate authentication on ANY legitimate messages.   But if you CAN do that on ANY message, you SHOULD do it on EVERY message.

The first step in the process is to build tools to handle the exceptions.   Then you can start sending unauthenticated messages to quarantine.  Once there, you have to figure out if the correct response is an alternate authentication rule or a block rule, but you only have to do decide once.   The type of SPF Failure does not matter, any message without Pass is possibly a malicious impersonation, and we are paid to keep that threat from getting through our filters.

The same principle applies to authentication of the From address.   DMARC is harmful if you only block impersonation when the other domain owner gives you permission to do so.   You are responsible for your network security, not them.
Sébastien Riccio Replied
Or maybe, tell example.com admin that they have an issue with their SPF record and that their mails are probably rejected elsewhere too.

Personally, I'm not really a big fan of adding exceptions and workarounds for mechanisms when the issue is at the sender side. Better tell them, it will benefit everyone.

SPF Fails and no DKIM or DKIM fail -> immediately reject at SMTP level (or send to spam folder, or follow their wish according to their published dmarc record, if any.

No SPF, No DKIM -> same -> reject at SMTP level (avoid sending bounces)

TempFail on SPF -> reject with a 4xx (the sender MTA should retry later).

A sender must at least have either valid DKIM or SPF Pass to get through.

Sébastien Riccio System & Network Admin https://swisscenter.com
Douglas Foster Replied
The "You try Block on Fail" was my story.  I went looking for a vendor who could do whitelisting without creating a security hole and have not found one.   Declude custom filters allowed me to build my solution.  Today, 100% of my messages are authenticated by algorthm or local policy.  1% is SPF Pass without From alignment because I have learned that the risk is minimal.

Every human communication is interpreted in the context of the author or speakerr, and acceptable content is also author-dependent.  So all impersonation is fraud and should be blocked.   You block all of it by requiring authentication.

There is a huge payback from inspecting authentication failures.   It is about 10% malicious impersonation, 10% acceptable, and 80% spam without impersonation

Authentication is the one filter that has a finite solution.   Reputation knowledge is limited by an infinite number of sender addresses.  Content filtering is limited by an infinite number of message possibilities.   By comparison, authentication is simple.   It has some risk of false Pass that is very hard to detect, but it is still very effective 

I seem to be the only one in the world doing this, and that feels weird.  The rest of the world agrees with you.   It is a great disappointment to me that no one else has caught the vision.   This design seems to me to be both obvious and the only defensible one.
Douglas Foster Replied
I will also observe that there is a problem even when "block on Fail" correctly blocks an impersonation.   If true impersonation is detected,  you should be identifying and blocking the impersonator.   You should assume that the impersonator will change tactics and eventually attack with something that does not trigger "block on Fail".   If you tell him that he has been blocked, rather than using silent discard, you advise him to change attack strategies sooner rather than later.

But attackers are not stupid, so we should expect them to attack domains that produce neutral results, not hard fail results, whether the test is SPF or DMARC.   Our standard defenses are poorly matched to the probable attack vectors.
Douglas Foster Replied
To return to the original topic, whether you prioritize Pass or prioritize Fail, either approach benefits from minimizing error results and minimizing wasted effort.

But I need to speak to Sebastien's concerns about creating exceptions, because it misunderstands the difference between an alternate authentication rule and a content filtering exception.

Pass by itself says that the message is authentic, but it does not say whether the message is useful, because it does not require any human judgements.   An alternate authentication rule says that the message is authentic because I have inspect messages from this source, and it also says that the message is useful, because I have chosen to create this authentic rule.   So Pass by local policy is a stronger result than Pass by algorithm, because it communicates a higher level of trust.

Alternate authentication does not exempt from content filtering unless you consider the messages so useful that also you want to whitelist -- it is a separate decision.   

Alternate authentication does not allow impersonation because it is contingent on an underlying identifier which is authenticated.   Any message can be given proxy authentication because every message has at least that identifier that is presumed true, the Source IP. 

A default rule of "Block on Fail" will allow impersonation of any domain that produces an uncertain result.    A default rule of "Quarantine on non--Pass" will never allow impersonation. 

When identity is confidently known, content filtering can be made contingent on identity.   For example, we noticed that messages with emojis were consistently unwanted.   Filtering on emojis proved difficult, so we configured content filtering to quarantine any subject text  that started with the UTF-8 prefix.  We knew that this would produce a lot of false positives, but we have accurate identity.   So we have a long list of identities that are given exemption from the UTF-8 rule.  The rule still applies to everyone else and it filters out a lot of nuisance advertising.

(If someone can contribute the PCRE RegEx formula for identifying the Emoji character class within UTF-8, I would be grateful for the tip.) 
John Quest Replied
The typical scenario:
You start with "Block on SPF FAIL", because you want to prevent impersonation.   Then you get a false positive because Example.com has messed up their SPF record.   To fix this, your tool requires you to whitelist the Example.com domain.    Now if an attacker impersonates Example.com, he not only gets past the authentication filter, he also gets past the content filter.   In an attempt to protect your network from impersonation, you have to create a security hole that facilitates impersonatiom.   This makes no sense, so you turn off SPF checking.   The problem is not authentication; the problem is the lousy tools from people who want our money and claim to be experts.

Call me a gluten for punishment, but in that scenario, EXAMPLE.com does not get a pass, but rather a specific filter that ends up in my HOLD_SPECIFIC manual review folder. Then I will work on very specific allowances for any legit that get held by that specific filter. 
Douglas Foster Replied
You are agreeing with me.
The correct process:
Example.com has an SPF policy that is missing, malformed, or contains omissions, producing some result other than Fail.   (Assume that their is no DKIM signature to override the SPF problem.)  Because the message cannot be authenticated, I send it to Quarantine.   In my configuration, content checking is still applied so the message may be blocked, but will proceed to Quarantine if there are no other problems.
I review quarantine and determine that this message is acceptable and therefore future messages are expected to be acceptable.   I also note that the message comes from an Outlook.com server, and the Reverse DNS name can be verified with forward-confirmed DNS.
I do not wish to repeat this process on future messages of this type, so I create an SPF-equivalent allow rule:
- If server name ends with .outlook.com
- and the server name is verified by forward-confirmed DNS
- and the SMTP Mail From is example.com
- Then the Mail From address is considered authenticated.
(It should be obvious that I want to use a verified host name rather than an IP list, since I cannot keep track of Outlook.com's IP address list.) 

If Example.Com also uses SendGrid.net, the first message from that source will also get quarantined and will need a separate allow rule.

The first message from anybody who tries to impersonate Example.com will be quarantined.   When quarantine review exposes the impersonation, the responsible identifier will be determined by inspection, and the responsible identifier(s) will be blocked.

The problem is that I have yet to find a commercial product that can do rules of this type.   Most do not permit multiple-attribute rules.   The few that can support multiple-attribute rules cannot filter on forward-confirmed host name.   With either limitation, I cannot create an allow rule with the necessary characteristics, so they do not provide the necessary tools to defend against impersonation by permitting transition to mandatory authentication by a combination of algorithm and local policy.   If these experts cannot provide an effective defense against impersonation, the easiest problem to solve, they don't warrant my respect or my money.

I have lots of problems with spam, but impersonation is not one of them, and that is a good thing.

Reply to Thread

Enter the verification text