воскресенье, 9 июня 2013 г.

Monitoring & Complex Event Processing

   To begin with, I'm sorry for my terrible English! Most likely, you will fill that I test your patience with this post, but wait a bit (after all, this is my first post ever).   

    I want to talk about an existing problems in IT Infrastructure monitoring and ways for solving them from my point of view. Also, I will be very appreciated if you give me some feedback on the subject.
   
    More recently , I'm heading a working group based on my chair at the National Technical University of Ukraine "KPI". We are trying to develop a software in the area of infrastructure monitoring.
 
    Supporting information infrastructure, system administrators are faced with the problems of various kinds. In order to quickly find and solve the problems they use different tools. The most popular at the moment (in my opinion) is Zabbix, although there are many other methods and interesting software which you can see in this great post, for example.
   
    I started to dig deeper into the various applications of Zabbix, and found a great blog by Tim Bass on Complex Event Processing. Especially this post become a point of a great interest for me.
 
     So, what is this all about?

Predictive analysis

As Tim says:

"The best technology we have today are human eyes and brains looking at time-series charts and graphs."

So, the common scenario when you use one of the monitoring tools is:
  • Setup the metrics which you want to monitor
  • Setup rules and thresholds for generating events(or send events to server with a zabbix_sender)
  • Setup rules for alerting when event occurs (or even do simple actions, in case of Zabbix, like a process restart or a system shutdown
But none of this will give a system an ability to predict a real outage from set of different events and react on it in real time. So, basically, if your server goes down you will end up with digging into monitoring data to find out the real reason of fail and, finally, perform some actions to prevent further problems. 
Also, I'd like to refer to Tim here:

"We could build models and implement myriad rules; however, experience teaches us that a pure “expert systems” approach is too time and resource consuming (and the models change so often, that it is like a cat chasing it’s tail).   Building a system that can predict outages based only on rules is inefficient and suboptimal.   To accomplish what we desire, we need software to baseline the “event characteristics” of the system, a machine learning algorithm or two, that will “listen” to the myriad events in our “mini event cloud” and create a baseline of what are normal ebbs and flows, peaks and valleys, and other anomalies."

    So, I want to come up with a very, very abstract concept of a possible way of evolution of NMS at my point of view. 

    Let's assume, we have an agent software which can perform both "trapping" and "polling" on target host. Also, we have an obvious and simple way to calculate desired metrics and generate events using basic rules. With all of this we can get a usable amount of history on which we can rely in more complex calculations and events generation needed for making decisions. Moreover agent can do predefined actions for manipulation. Nothing new.

    Suppose, on a server side we have a machine learning algorithm which we use for teaching the system. We can organize each learning pattern and data in a template, like in Zabbix, which we can import into existing knowledge base for other users to apply. Each server has access to this data and make use of statistical algorithm which generate appropriate actions.

   All of this can give a system an ability to automatically tune system without an administrator intervention.

So, what do you think of it? Does this approach have a future? What would be a simplest scenario of a server or a process behavior for such system to deal with?

Please, comment! Thank you.