As per Gartner, “AIOps is the application of machine learning and data science to IT operations problems. AIOps platforms combine big data and ML functionality to enhance and partially replace all primary IT operations functions, including availability and performance monitoring, event correlation and analysis, and IT service management and automation.”
They had predicted that large enterprise exclusive use of AIOps will rise from 5% in 2018 to 30% in 2023. Indeed, we have seen a rapid increase in the adoption of AIOps platforms over the past few years. The acceleration of digital transformation brought on by the pandemic has further reinforced the importance of such platforms.
AIOps typically consists of various components such as monitoring agents, analysis components, AI peripherals, highlighting tools and others. In this article, I will be focusing on the monitoring agent.
Monitoring agents proactively monitor, manage, and resolve performance issues across the entire IT landscape before they impact end-user productivity.
Based on its functioning, monitoring agents can be broadly categorized as below:
- Fundamental functioning
- AI functioning
Fundamental Functioning of Monitoring Agents
This covers what we are monitoring, how we are monitoring it, what insights are handled and where it is being stored. In this structure, AI and Prediction components are independent. Here, agents will only act as an observer or the catcher of insights.
The below structure is common for most AIOps tools.
AI Functioning of Monitoring Agent
Apart from all the features of fundamental functioning, monitoring agents also have AI algorithms and process mechanisms for self-intelligences. These algorithms turn a monitoring agent into a reactive machine.
Reactive Machines
Reactive machines are a type of AI that work based on predefined algorithms. It does not have any memory, nor does it have predictive capabilities. Reactive machines will respond to identical situations in the exact same way every time. There will never be a variance in action if the input is the same. This feature is desirable when it must be ensured that the AI system is trustworthy. However, it means they can’t learn from the past.
Spam filters and the Netflix recommendation engine are examples of reactive AI.
Reactive machines work well in scenarios that require pattern recognition and where all parameters are known.
Benefits of replacing basic monitoring agents by reactive machines:
In the monitoring aspect, Reactive machines = Reactive AI Agents
- Data processing & size deduction
Handling huge data and processing those are extraordinarily complex. Moreover, it keeps on growing and needs to be maintained.
Reactive AI agents have their own intelligence to filter polling data’s properties. All insights’ properties are not always used from raw data. Only a few properties are needed in certain impact cases. But, most monitoring agents, don’t have the intelligence to identify situations (impact cases) to filter properties. Agents post all properties each time of monitoring frequency. If we filter the properties of monitoring data on need basis, it will reduce 5-10% overall data size.
In processing aspect, reactive agents will group repeated common properties of various transaction documents at the same polling time. Reactive AI agents will have the intelligence to know what can be grouped, how can it be grouped, and to maintain raw data’s consistency. Example, at a specific time we are collecting 200+ event details, they all will have a machine name, IP address, location, and few more common properties. These properties are going to be repeated on all 200+ documents and these will be a few gigabytes. If we are grouping those details as an independent single document with that independent id on all 200+ documents, it will get reduced to around 15%-20% overall size.
- Self-healing
Reactive AI agents have self-healing intelligence to optimize its CPU, memory utilization, in-memory refreshment and handle log recycling based on time and log’s information. Also, it will notify status of its force down situation except for a few exceptional cases.
- Data Accuracy
Reactive AI agents have an accurate unit conversion based on polling unit. It also realigns frequency with the delay of the data polling cycle.
- Security
Usually, data encryption happens in static encryption key on agents and back-end components. Reactive AI agents generate a unique encryption key, it is frequently updated (by week or month) and it is notified by back-end calls for decryption process. It is more secure than the static one.
Agent’s configuration details also get periodically obfuscated on its own.
- Dynamic Polling Frequency
Most of the agent’s polling frequency is static. It could be via a back-end component or the agent’s own configuration. But reactive AI agents have the intelligence to decide frequency changes. It will decide frequency based on data impact (low and high frequency level). This frequency changes are notified to back-end components as well and considered based back-end process’s impact.
Monitoring agents usually do not have any intelligence and algorithms. Implementing AI in monitoring agents is much needed to make it more efficient and make AIOps a true enabler of digital transformation.
About the author
July 6, 2021
Natarajan Veerasekaran
“ Natarajan is a Lead Engineer for ZIF Monitoring at GAVS. He is deeply passionate about programming and broadening his technical boundaries.“