3
EXTREME HIGH CPU after 7053 update
Problem reported by Rod Strumbel - 5/6/2019 at 7:38 AM
Resolved
Updated server from 6985 to 7053 over the weekend.
Now as we get into peak traffic we are seeing the CPU just getting hammered.

Server was rebooted between the uninstall of 6985 and the install of 7053 AND after the install of 7053.

It is the Smarter Mail service that is chewing up the CPU.
We normally ride along around 45 to 50% CPU all day.




No issues being reported by users yet, but our VM alert system is yelling at us pretty frequently about high CPU usage on that VM.

No IDS Blocks
Spool is normal at under 8 to 15 most of the time
Just a couple of Waiting to Deliver messages from time to time
I flushed the virus quarantine (there were 170 in there but going back months), didn't make a difference
User activity is not out of the norm, only around 120 simultaneous/online users
Only around 250 connections of all types at a time
Nothing in the Server Blacklist either

And yet the SM Service is just buzzing.

Any thoughts?


Rod Strumbel


24 Replies

Reply to Thread
0
JerseyConnect Team Replied
What is your indexing activity like?
0
Rod Strumbel Replied
449 boxes currently listed, but it is stepping through and completing and moving on to the next like normal.
Haven't really paid attention to know what a "normal" load of indexing would be on my system though.


0
Christian Schmit Replied
We saw the same issue on one of our servers after upgrading to build 7053. I seems to be related to traffic coming in over IIS as restarting IIS (without restarting smartermail) brings CPU usage back to normal for some time in our case.

Support recommended to run procdump as follows to collect more information:

Download procdump: https://docs.microsoft.com/en-us/sysinternals/downloads/procdump
Run procdump: procdump.exe MailService.exe -ma -accepteula %USERPROFILE%\Desktop\ -c 75 -n 3




0
Rod Strumbel Replied
I took down the IIS server altogether... 100% CPU remains.
Put back online now.... still 100% CPU.
0
Rod Strumbel Replied
I've opened a ticket on this, something is not right.

Thanks everyone for the ideas.

Will post results once a fix is figured out.

Rod
0
Kyle Kerst Replied
Employee Post
Rod, we did make some changes to the indexing max thread count in 7016 and corrected some high CPU causes in 7008 as well, so there were changes between 6985 and current that could explain this. My thoughts are that we're likely in the process of cleaning up the indexes for users who had issues in there to begin with. 
Kyle Kerst IT Coordinator SmarterTools Inc. www.smartertools.com
1
Rod Strumbel Replied
The SM folks are in the server now.

I hear where you are coming from Kyle, my server was still littered with the XML vs JSON config issue to, was that cleanup part of this update too?

Watching what's going on so far it appears to be a combination of the SM service and from  time to time clamd will spike as well.   One thing they have identified is they think we may be configured with too few CPUs in the VM currently.  
0
Kyle Kerst Replied
Employee Post
Rod, we do keep quite a few of the XMLs around for safety's sake as they do come in handy when restoring old user configurations, etc. I know that cleanup is due to be looked at, but we'll likely still keep at least portions of the files around for restore purposes at least in the short term. 

As to the CPUs, this could definitely cause some of these problems. Any time SmarterMail runs short on resources the server will begin queuing up requests, consuming more memory, using garbage collection more frequently, and subsequently beginning to page for memory due to the increase in resource utilization caused by the queuing. Anything you can do to improve the available CPU cores and memory should help!
Kyle Kerst IT Coordinator SmarterTools Inc. www.smartertools.com
0
Rod Strumbel Replied
And my response to that is... sure, throw more CPU at it to make a CPU issue "vanish".

If it was an issue before the update I could see it, but it wasn't.

I'm waiting for my VMware guys to see if it is an option.
Not my area of expertise at all. :)
0
Scarab Replied
We had the exact same issue on all three of our servers immediately after upgrading to 7053. We doubled the CPU cores on the server from 4 to 8 that was Hyper-V but it was happening to two bare-metal servers too. We let it run 8 hours at 100% CPU and it didn't ever decrease, even for a second, so I doubt it is an Indexing issue, especially since we had the exact same 100% CPU issue on our two Incoming Gateways after upgrade (which only have one user). It was causing a > 12 min delivery delay from SmarterMail Gateway to SmarterMail and another > 12 min delivery delay to get to the user's Inbox. Eventually the Spool just filled up to 4000+ messages before we had to pull the plug.

Ultimately, we had to roll back to build 7040.

Post-Script: We originally used the MSI install for 7053. Even tried uninstalling, rebooting, and doing the EXE install with the exact same results. Restarting IIS didn't resolve the issue on any server for us.
0
Rod Strumbel Replied
Well, techs pulled logs from the server, it is in engineering now is my understanding.
They used the tool Christian Schmit mentioned above as well as installed some dotnet tracing tool as well to capture what was going on in the program while the CPU issue was taking place.

Still not impacting clients so hoping a MOVE FORWARD fix is the answer.
I HATE reverting... it seems to never ever go well.

Rod
1
Rod Strumbel Replied
Temporary conclusion, but no resolution yet:

The development team seems to think the issue may stem from the White/Black list not being cached.
So every-time an IP comes into the server is having to do a database dip (or file read... I'm not sure which).
They are building a caching mechanism for that now.

No timeline for the fix as of yet... but I'm askin'

Rod
0
Scarab Replied
Rod,

That makes a lot more sense for the cause of the high CPU usage. I know we have a *HUGE* Whitelist/Blacklist/SMTP Blocks on all of our SM Servers, including the Gateways...I think @ 13K entries, so that wouldn't surprise me that the build would behave perfectly fine for some installations but then go crazy nuts on other installations, including Gateways. 

Good to know that a fix is in the pipeline. Thanks for keeping us updated.
0
Rod Strumbel Replied
Yeah ours is pretty big too... 
I have a rule that if I see 20 or more IPs hacking from a class C, the entire class C gets blocked.
May torque some people, but it works.

1
Rod Strumbel Replied
Quick update... SM has a PATCH 7065 that I'll be installing late night Saturday 5/11/19 ... will post back after that installation and some monitoring / testing.   My understanding is they simply setup some things that were not being cached to be cached.  Guess we'll see how it plays out.

As an aside... for Enterprise Edition customers... shouldn't it really include 2 full licenses of whatever size you have purchased?   Sure would have been nice to be able to simply use failover here to switch traffic to one machine, install patch on other machine, then switch it back... just my $.02.
0
Rod Strumbel Replied
Heads up... make that patch 7066 will be applied this weekend as an issue was discovered during SM internal testing that had to be repaired.

Rod
0
Luis Zito Replied
Hello,

I have the same problem, after upgrade to 7053.

zito
DHO HOST
1
Employee Replied
Employee Post
Hi Luis.  Can you uninstall your current version of SmarterMail, and then install this build.
0
Rod Strumbel Replied
WooHoo... an earlier guinea pig. :)
0
Christian Schmit Replied
Last nights build 7068 has solved the high CPU usage problem for us.
0
Rod Strumbel Replied
Kyle, should i be loading 7066 or the public released 7068 this weekend?   The release notes seem  to indicate the bl/wl caching is included in the released version.

0
Employee Replied
Employee Post
Rod, please use the public release 7068.  It does include the changes in the custom build.
0
Rod Strumbel Replied
Applied 7068 tonight, along with a cumulative patch to Win 2016. Appears to be "somewhat" fixed. I'm still seeing SPIKES up to 100% but they are not frequent and may just be the system running nightly processes of some sort (not sure, not normally watching at this time of night). Will know more as we ramp up the traffic levels on Monday morning.
2
Rod Strumbel Replied
Through the first peak period of traffic, and things are looking much better.

I have identified a potential issue with ClamAV churning a bit and using all the processors for up to 10 seconds at a time but in general... the 7068 update fixed the biggest part of the High CPU issue.

Thanks SM folks!

Reply to Thread