IOwait causing SmarterMail Linux to freeze
Problem reported by Rami El-Zein - 4/7/2026 at 2:03 PM
Submitted
Hello,
I’m experiencing an intermittent issue with SmarterMail that I haven’t been able to isolate, and I’d appreciate any guidance from the community.
I’m running the latest SmarterMail Enterprise (version 100.0.9560.29387 – 03/05/2026) with  1,770 users on a Linux-based AWS instance (16 vCPU, 32 GB RAM, and 13 TB gp3 storage). 
Outgoing mail is routed through MailChannels.
System storage:
  • / (root): 48 GB total, 35% used
  • /data: 12 TB total, 92% used
The issue:
Once or twice during workdays, IOwait suddenly spikes to 70–90%, and load average climbs as high as 50. At that point, the server becomes unresponsive until I restart SmarterMail. It feels like a process or user action gets out of control, but I haven’t been able to identify the root cause or determine what limits I should enforce to prevent this.
Usage details:
  • Peak concurrent users: 700+ (webmail, IMAP, MAPI/EWS mix)
  • The issue can also occur at lower load (around half that number)
  • Some mailboxes are quite large (40–60 GB)
Storage tuning:
  • Increased gp3 IOPS from 3,000 → 6,000
  • Increased throughput from 125 MB/s → 375 MB/s
  • Despite this, the issue still occurred twice today
Today’s stats (Reports):
  • Bandwidth:
    • SMTP In: 55 GB
    • SMTP Out: 2.2 GB
    • IMAP: 86.8 GB
    • POP: 3.2 GB
  • Messages:
    • Inbound: 47K
    • Outbound: 7.7K
Sessions:
  • SMTP In:
    • New: 57K
    • Bad Commands: 31K
    • Terminations: 6.3K
  • SMTP Out:
    • New: 2.9K
    • Terminations: 93
  • IMAP:
    • New: 84.8K
    • Bad Commands: 1.1K
    • Terminations: 1.9K
  • POP:
    • New: 5K
    • Bad Commands: 2.2K
    • Terminations: 38
Spam:
  • Total inbound spam: ~3.8K (seems reasonable)
IDS settings (current):
  • Bad SMTP Sessions (Fast): Block, 5 min / 10 threshold / 60 min block
  • Bad SMTP Sessions (Slow): Block, 60 min / 25 threshold / 360 min block
  • Bounces: Quarantine, 5 min / 10 threshold / 30 min block
  • DoS: Block, 2 min / 100 threshold / 30 min block
  • Internal Spammer: Block, 10 min / 100 threshold / 60 min block
  • Password Brute Force/IP: Block, 5 min / 200 threshold / 30 min block
  • Password Retrieval Brute: Block, 5 min / 50 threshold / 30 min block
Other notes:
  • Max 3 concurrent migrations allowed (and rarely reached)
  • Indexing settings:
    • Max threads: 5
    • Items per pass: 100
    • Queue delay: 30 seconds
Questions:
  • Has anyone encountered similar IOwait spikes tied to SmarterMail?
  • Could this be related to indexing, large mailboxes, or IMAP behavior?
  • Are there recommended limits (connections, indexing, mailbox size, etc.) to prevent this type of resource spike?
  • Do my IDS thresholds look reasonable, or should they be more aggressive?
Any suggestions on where to start troubleshooting would be greatly appreciated.
Thanks in advance.
Zach Sylvester Replied
Employee Post
Hey Rami, 

Thanks for the question. 

One thing you could try is setting your max indexing threads to 2 and the items per pass to 1000.
Generally I recomend only doing 1 thread per thousand users you have as this can use lots of IO. 

One thing I have also heard about is that GP3 volumes start to slow down as you reach their max capacity. So you could try creating a second volume and enabling Secondary Storage for the domains. 

Per our help
  • Secondary Path / Secondary Storage - A secondary path, or Secondary Storage, can be used for older emails and files in SmarterMail File Storage so as to preserve disk space on a primary drive. SmarterMail allows system administrators to select an "Age" to use for automatically, and continually, moving these files in the background, starting at 90 days and then increasing in 30 day increments.

    For example, when spinning up a SmarterMail server, administrators can start by configuring a Primary Path. As a server grows, however, a Secondary Path can be added. This secondary path can point to a drive or an array of standard HDDs, which aren't as efficient in terms of disk i/o but can save on overall cost. In addition, should the Secondary Path not prove necessary -- say, a server is being decommissioned or users are being migrated to new hardware -- then the Secondary Path can be disabled, and any files stored there are moved back to the Primary Path. 
    Note: Regarding moving emails and files, SmarterMail will move entire messages, including any attachments associated to the messages. As for File Storage, only files uploaded directly to SmarterMail File Storage are affected. Files associated with (i.e., attached or uploaded to) SmarterMail Chat, Online Meetings, Calendars, Contacts, Tasks, or Notes, are not moved and will use the Primary Path. To change a Secondary Path, please see our KB article: Changing a Secondary Path Location for a Domain.


Let me know if this helps. 


Kind Regards,  

Zach Sylvester

Software Developer
SmarterTools Inc.
Rami El-Zein Replied
Thanks Zack but this did not help and its happening again. Usually around mid day where most of the users are. Its as if someone is doing something that causes this. While its happening now, there are only 500 connections, 100 of them using webmail. Do you think that doubling the server cores/RAM and increasing IOPS might help?
J. LaDow Replied
Look at server resource usage before just throwing memory and IOPs at it.  There should be a way to tell if you're suffering from disk throughput bottlenecks or memory exhaustion -- 

It sounds more like you have a user with a corrupted mailbox than anything else --



MailEnable survivor / convert --
Kyle Kerst Replied
Employee Post
I too suspect a problematic mailbox since its happening in the middle of the day and seems to spike. If most of your users utilize IMAP or EWS - you can try increasing those logs to Detailed for a period of time to look for any exceptions and which users they pertain to. Keep in mind though that the verbose logging will increase disk IO on whichever disk you save logs to. 
Kyle Kerst
Lead Internal Network/System Administrator
SmarterTools Inc.
Tim Uzzanti Replied
Employee Post
Please look at the latency on your GP3 instance during both good and bad periods. Cloud storage is notoriously slower than what’s often marketed, especially for the small file operations required by mail servers.

Share the results here, and we’d be happy to help.
Tim Uzzanti
CEO
SmarterTools Inc.
Rami El-Zein Replied
Thank you for all the feedback. I wanted to share something I tried: I copy/pasted the issue to chatgpt and uploaded the SmarterMail error logs from the past few weeks. It quickly identified a few email accounts with high error counts and suggested reindexing and rebuilding them. I’ve done that and will report back tomorrow on the results.

Reply to Thread

Enter the verification text