Are you getting an error in your ePO deployment like this “EPOCore – com.mcafee.epo.core.ServerNative.getFipsModeNative()I”? So was I, keep reading.
My ePO deployment had been chugging along without issue for four (4) months since I installed it last October. From what I had heard that was quite the feat. Not to say that ePO is a bad product, I quite like it. I think others have issues because the art of ePO it that you are perpetually tuning it, and any information system that has changed being made on it has a greater chance of going down. This is what took my installation down and in the following I will explain all the frustrating trouble shooting steps my cohort and I performed along with our solution and determination of the cause.
It had been a thankfully normal day in a very hectic week. I had just put some finishing touches on a PowerShell script and was about to work on finding out what was causing our somewhat annoying latency problem on our virtual environment when my friend and colleague reminded me that the latest pool of virtual workstations was not getting the McAfee agent pushed to it and subsequently all the endpoint security like antivirus, etc. So I jumped onto the ePO web console to take a look. The login page appeared but didn’t look quite right, I logged in and the dashboard really didn’t look right – there was no data in any of the widgets and all the extensions in the menu were gone. So I started simply by logging out I then noticed that the scroll bar was very long so I scrolled to the bottom and was created with 20+ red single line error messages. All the messages were similar epoMigration – Dependent plunging SoftwareMgmt failed with initialization error.
If you want to skip the long story and get to the solution click here
I backtracked from one plug-in to the next till I found the start of the failed plugins, which turned out to be EPOCore. “Well that’s your problem right there” I said to my buddy who was now surfin’ on my shoulder. The next logical step was to checkout the host server to see if any services had gotten bound up. We found that only one McAfee service that was set to automatic was not running, the McAfee ePolicy Orchestrator 5.3.0 Event Parser Service. When I tried to start it I got that let down of a message stating that the service has started be immediately stopped. Next I restarted the McAfee ePolicy Orchestrator 5.3.0 Application Server Service, doing so also restarted the McAfee ePolicy Orchestrator 5.3.0 Server Service (the web server). We received no errors and the services stopped and started with no apparent issues. Unfortunately, the web console told a different story; it was still showing all those errors at the login screen. So following in the footsteps of my fictional British TV show mentor, Roy, I tried “turning it off and on again,” the whole system that is. Still no fix.
Paging Dr. Google; Dr. Google please come to the rescue
Now that the standard stuff wasn’t working we asked the patient what, if any, other symptoms it was having by poking around in the logs. The standard Windows – system, application, and security – logs provided no help. Next I went I. Search for McAfee logs and I found what looked to be interesting in C:\Program Files(x86)\McAfee\ePolicy Orchestrator\Server\logs; it was the orian.log file. It was rather large 60+ MB and had a modified time stamp of only a few minutes ago. In it there was an error that had us thinking there was some issue with connecting to the SQL database so we jumped to the ePO database configuration page, http://<server name>:8443/core/config. There we verified that the correct user name was in place and we re-entered the password for the account (just before going to the config page we reset the password for the account to what it should have been so we knew the password was correct and verified that the account had full access to the SQL server and ePO database), then we pushed the button to test the database connection and the test was a success. We saved the settings then restarted the McAfee ePolicy Orchestrator 5.3.0 Application Server Service. From all the Google search results this was going to be the answer. All the Google search results were wrong! Sorry that was rude of me to say that in such a tone. I don’t mean to make it sound like the solutions they provide are not correct solutions for their particular situations, the solutions are confirmed as working for those requested help. I am just saying that all of those solutions did not fix the problem we were having. Those solutions had issues where account passwords had been changed in Active Directory or had been deleted all together. The service account I was using was alive, had had no changes to passwords or group membership, and I had just verified that the password was what it should be by resetting it and re-imputing it into the config site. We were getting frustrated and more frustration was on its way.
“There has to be something wrong with the account” Troubleshooting
So if we verified the account is up and alive (not disabled) and we reset and re-imputed it into the ePO settings then we figured there just has to be some strange glitch with the account. With this in mind I set out to create a new account. First I cheated a little by copying the original account and giving it a new name, I also set the password for the new account to be the same as the old account. We put the new account’s information into the SQL server with the same permissions as the original account. We then put the new account’s information into the ePO config site. We pushed the Test Database Connection button and got the OK, we are good message. We saved the configuration and restarted the McAfee ePolicy Orchestrator 5.3.0 Application Server Service. Hoping beyond hope we waited for the services to restart then we refreshed the ePO logon page. Broken! OK so I made a copy of the possibly broken account, maybe (latter found out that obviously) it copied whatever was broken to the new account. We throw this new account away. I then created a new account from scratch, then added it to all the groups the broken account was a member of, and again setup the account in SQL with the same permissions as the broken account. Did the restart and wait thing and was let down again. So being extra frustrated we determined that maybe for some reason it didn’t like the password for some reason, a special character it didn’t like perhaps. So we reset the password to something simple with “normal” special characters, nothing SQL usually doesn’t like such as ()*&;”@. Reboot, wait, disappointment. “OK so lets use my account” I said, “Can’t hurt”. Reboot, wait, FIXED! OK so WTF, my account has a supper complicated password its a member of every group and then some that the broken account is a member of and it works. Next step add the broken account to the other groups that my account is a member of. Reboot, wait, BROKEN!!! Hands thrown in air, I don’t know …. I give up time. But wait the broken account is also a member of a group that my account is not a member of.
Remove that group membership from the broken account, don’t change anything else just reboot the services … wait … FIXED 🙂 HOORAY!!! smiles all around. But why?
Reason for breakage and ultimate fix
So back a few months ago my buddy noticed that I logged onto a system with a service account. Normally you don’t do this with service accounts but it was a quick way to make sure that account could log on to the system. He said something like how come you were able to do that. I gave him an answer with an equivalent amount of sarcasm as his question had stupidity. “I typed in the user name and password with this thing called a keyboard that has letters on it, then pressed the ‘Enter’ key telling the computer I was finished putting in the authentication credentials.” Or something like that. Anyway he went on to explain his question was more about policy and why were we allowing service accounts to log on locally to systems. The are not meant to be used by humans so they don’t need to log on locally. I agreed with him and we set out to make the needed changes to our security policy via a GPO. Luckily I had most of the needed components implemented all we had to do was put them together. I had made it a practice to establish a shadow group for every OU that houses users, for more on shadow groups check out this post Shadow Groups. I also had a GPO that assigned User Rights Assignments. As is good practice all the users in each OU was a member of its corresponding shadow group, e.g., ServiceAccounts OU members are part of the _ShadowServiceAccounts group. We edited the User Rights Assignments GPO to include the _ShadowServiceAcccounts group to the User Rights Assignment Deny logon locally. Feeling that we had done the right thing we moved on. We did this change back in November 2015 and this ePO problem pops up at the end of February 2016. Servers, including the ePO server, have been rebooted a few times since the Deny logon locally change; I am pretty sure we even had a power outage so we had to shutdown everything. That is the only remaining mystery of this whole thing. Anyway the ePO service account was a member of the shadow group so in turn it was not being allowed to log on to the ePO server. So the solution was to remove the ePO service account from the shadow group, but how to we keep things secure. Well we set the Log On To property of the ePO service account so that the account could only log on to the ePO server. We fixed it all up and still maintained a good level of security. We went home. The next day I was playing round in the ePO management console and it wasn’t able to authenticate to the domain; I was trying to get the system tree to sync. Well knowing what happened all before I figured that now restricting the ePO service account to just being able to log on to the ePO server it wasn’t able to authenticate to the domain. I added the domain controllers to the log on to properties of the ePO service account and everything worked.