False alerts are quite possibly the most annoying thing to any sys admin. Your monitoring tool shows your resource is down, you go check on it, and its fine. If it happens once, you are annoyed, if it happens repeatedly – you distrust your monitoring system.
With Hyperic, chances are, something is wrong and it has to do with JVM timekeeping getting out of sync between the Hyperic Server and Agent. If both the server and agent have different timestamps, an event known as time drift, then even if the agent is reporting everything fine, it may not be reporting with a time that the server thinks is acceptable. This, in turn, causes the server to report an outage.
There is a secondary performance impact as well, as the system will also incur additional overhead in re- scheduling measurements for both the agent and the database repository as it tries to backfill (inserting of the late metrics) and correctly insert these elements into the database.
How Time Works in Software
Time is kept in a number of ways depending on hardware, operating system, and sub-systems like virtual machines or JVMs. Time drift occurs as the system time becomes separated from the JVM time. For Hyperic, its important to remember that Hyperic HQ is an agent based monitoring solution. So we have a server on one machine that keeps time separately from the agent that we need to keep in sync so agent reports are not reported “missing”, and cause false alerts.
Agents are usually the cause of the time drift as they run as Java Virtual Machine (JVM) processes. There are two components to JVM timekeeping, the platform system time and also the JVM internal clock.
System time usually comes from a physical hardware clock, whether built into your machine or as an external hardware device. Essentially, this clock ticks at a precise speed and then the system interrupts it after a number of ticks and clocks the time. The interrupt rate varies on operating systems and hardware, but the principal is the same. (For more info on how time is kept, the intro to VMware’s Timekeeping in VMware Virtual Machines VMware ESX 4.0/ESXi 4.0, VMware Workstation 7.0 provides a detailed primer.)
JVM Internal Clock
For processes running in Java Virtual Machines, the JVM needs to provide a clock the same as if it was a physical machine. While lots of sophisticated engineering has gone into making this as perfectly synched as possible, the reality is a JVM relies on software, not hardware to keep time. System pressures, such as CPU slowdowns due to temperature changes or loads, can affect time in a JVM. So, it is essential to constantly check these times, and realign them.
Best Practices for Synchronizing Time
On both physical and virtual platforms the HQ Server and HQ Agents system times should be closely synchronized with use of a mechanism such as NTP or against a domain controller. For VMware virtualized systems, it is recommended to review section “3.4 VM Timekeeping Best Practices” in Enterprise Java Applications on VMware Best Practices Guide.
When the agent JVM process starts up, it obtains an initial base time from the operating system scheduler clock. Restarting agents will reset the time back to system time and resynchronize the agent with the server as long as system times are in sync.
Ideally the times on agents and the server will be synchronized to within a sub-second to 5 seconds difference time offset. Time differences greater than 30 seconds should be corrected to bring them back to close synchronization.
A restart of the Agent process (JVM) will permit the agent to synchronize to the time referenced system clock and will return server and agents back into a synchronized state.
It is a recommended best practice to configure a centralized time source to synchronize platform times. This may need to be coordinated to address worldwide time zone configuration for a globally distributed enterprise with centralized monitoring.
Troubleshooting Time Sync Issues
The Hyperic UI Administration and Health Tabs and the HQ Health Report are effective means to check agent time offsets on demand. To review the Agent time synchronization in the HQ Administration go to Administration Tab –> HQ Health Agents –> Time Offset.
Here you will see what agents have significant time offsets. To fix them quickly, an agent restart is recommended. To prevent drift from causing errors, it is recommended to first employ a method to correct time drift either using NTP or a domain controller. However, since that is not a perfect method, it is also recommended to set up alerts on the time drift itself.
Essentially, it is recommended to set up an alert that will notify when the Agent time sync value exceeds a pre-determined value appropriate for the environment.
If the agent time synchronization cannot be maintained, it is possible to restore synchronization by restarting the agent process. This may be a manual intervention to restart the agent. This may also be accomplished by associating a control action or script action to the alert, and the action will initiate a restart of the agent.
VMware has some must read articles on timekeeping that should be on every system administrators reading list. This is not just a virtualization issue, it affects hardware platforms as well and may propagate up through virtualization layers.
Enterprise Java Applications on VMware Best Practices Guide
Timekeeping in VMware Virtual Machines VMware ESX 4.0/ESXi 4.0, VMware Workstation 7.0
Best practices for running Java in a virtual machine (KB article 1008480)
Time in virtual machine drifts due to hardware timer drift (KB article 1006072)
Timekeeping best practices for linux (KB article 1006427)
Time in a Linux guest operating system runs faster than real time due to lost tick overcompensation (KB article 1006113)
Time runs too fast in a Windows virtual machine when Multimedia Timer interface is used (KB article 1005953)