Correlating user activity with system data for fast detection and diagnosis of system outages

Human error has been identied one of the major factors behind system outages and network downtime in a
number of previous research papers and surveys. Gartner statistics show that almost 40% of unplanned application
downtime is caused due to operator errors such as unintentional changes to network conguration resulting in a
network outage, patch installations, service restart,etc. Yet, system admin activities on production IT systems are
rarely properly logged and monitored. Existing tools to track user activities either produce too much information
without any hints of a potential outage scenario or too little information to be useful in a meaningful way.
In this paper, we describe the design and implementation of iTrack - a framework for monitoring user activities
and correlating them with system data for fast detection and diagnosis of service outages. iTrack makes use of
commonly available native monitoring and diagnostic utilities on operating systems to monitor systems events as
well as system admin activity, correlates these two sets of information and categorizes the activity as potentially
abnormal or harmful based on their impact on the system in terms of le system, network and memory activities. Our
results conrm that iTrack overhead in terms of CPU time, activity completion time and data generated is within the
tolerance range of most production systems. In cases, where the overhead was found to be unacceptable, we detect
the underlying cause and provide solutions which improve performance by up to 20% to 90%, in terms of managed
sever and iTrack server CPU utilization, respectively and by up to 2 times in terms of completion time of certain
system admin activities on the managed server.

By: Vijay Mann and Anilkumar Vishnoi

Published in: RI11013 in 2011

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

report_RI11013.pdf

Questions about this service can be mailed to reports@us.ibm.com .