Exploring AIOps? Here’s Where to Begin

The IT operations, the lifeline of a company or a business, is where the active response system of the firm is centered. Constant innovation has led to significant changes in the operational infrastructure over the last few years and brought new-age challenges to the fore that continue to test the limitations of the existing structure. Especially with the digital revolution that is increasingly transforming the way businesses function and manifest themselves, the ITOps struggle to cope with the mounting data volume and in leveraging inputs from core systems.

This is prompting businesses to push their boundaries and resort to Artificial Intelligence (AI) solutions with greater bandwidth. AIOps or Artificial Intelligence for IT operations is the way traditional IT services and management can be reassessed by integrating machine-based learning and abilities into the existing database and launching it in a broader spectrum. This involves functions like automation, availability, event correlation alerts, and delivery at par with the rising complexities of the business.

AIOps as evolving platforms

The penetration of AIOps in modern job design has been slow but steady. Big business enterprises are now looking at AIOps as an advanced tool to single-handedly manage extensive data monitoring and analysis while processing burgeoning data and information load at a remarkably fast pace. AIOps are programmed for tackling a diversity of output for exact and efficient error spotting in real-time situations followed by critical problem-solving and addressing high-risk outages. Besides, the predictability of AIOps provides a clear sense of the shortcomings and be prepared for unforeseen disruptions in the workflow.

According to reports by the world’s leading research and advisory company and IT management specialist Gartner, the use of AIOps may rise exponentially in the next 3-5 years at a rate of up to 30% that can help catapult business statuses remarkably. However, harnessing this technology requires a solid investment which is why the businesses are interested in a model that is reliable and can prove to be beneficial to the clients and profitable to these ventures at large.

The roadmap for AIOps

Launching and incorporating a versatile technology like AIOps with the current business machinery can seem daunting initially. However, before proceeding with the AI setup, it is essential to consider both the immediate and long-term implications of such an arrangement. Most often, organizing pre-existing records is the key to implement AIOps to the best of capacity.

Here are a few pointers a business can follow to launch its own AIOps interface:

  • It’s always wise to understand the technology before applying it. AI is no rocket science but requires a focused approach to make sense of its terminology. Persistent engagement with the AI tools can result in a better grasp of the subject matter and an assured involvement with related projects and other stakeholders.
  • Start small and simple and demonstrate the power of AIOps to your team in the most convincing manner possible. Highlight the specific problems that you expect your AIOps to fix including the anomalies that escape human surveillance. This could imply automatic troubleshooting and recovery responses if the system reports any malfunction. In a larger picture, this will helps mobilize isolated IT entries and integrate the work ecosystem with a more robust yet feasible strategy.
  • Allow room for experiments to evaluate the true potential of an AIOps application. Even if the whole initiative is cost-intensive, there are resources available at a reasonable rate that can be used to expand knowledge and tap into its wide range of functionality.
  • Be well-versed with digital analytics and statistical figures to enable the management of big data and helps track and monitor performance updates. This gives a direct insight into the positioning of your business and measures that can be taken to upscale possibilities.
  • Equip your infrastructure with State-of-the-art facilities that can shoulder the AIOps network and fully support the system upgrades that may emerge in the process.
  • When it comes to monetization of the business, the AIOps capitalize on the Return on Investment factor and help the company earn big bucks and invest in furthering the threshold. This can give a fair sense of how to fund AI-driven businesses at every stage.

Conclusion

The partnership between IT and AIOps is the new currency for optimizing the digital operations of a company. AI could very well be the gamechanger but it requires the informed use of this feature to take any enterprise to the next level. AIOps are also ‘guardians of security’ of confidential data owing to their prompt detecting abilities that minimize the risk of breach, leaks, cloning, and other potential threats that may hamper its credibility.

Therefore, embrace this top-notch, persuasive technology without fear or hesitation as it not only beats the operational status quo in theory but also sets revolutionary but achievable standards for your business. As a result, your project becomes more ambitious and future-oriented and the tech more approachable and ubiquitous with each passing day.

Top 6 Things AIOps Can Do for Your IT Performance

With technological advancement and reliance on IT-centric infrastructure, it is essential to analyze lots of data daily. This process becomes challenging and often overwhelming for an enterprise. To ensure the IT performance of your business is on par with the industry, Artificial Intelligence for IT operations (AIOps) can help structure and monitor large scores of data at a faster pace.

What are AIOps?

It is the application of artificial intelligence, machine learning, and data science to monitor, automate and analyze data generated by IT in an organization. It replaces the traditional IT service management functions and improves the efficiency and performance of IT in your business.

AIOps eliminates the necessity of hiring more IT experts to monitor, manage and analyze the ever-evolving complexities in IT operations. AIOps are faster, efficient, error-free, and reliable in providing solutions to issues and challenges involved in IT.

Top 6 Things AIOps can do for your IT Performance

By moving to AIOps you save a lot of time and money involved in monitoring and analyzing using the traditional methods. You can also eliminate the risk of faulty data or outdated reports by opting for AIOps. Here are six reasons to choose AIOps and how they can enhance your IT performance.

1. Resource Allocation and Utilization

AIOps make it easy for an enterprise to plan its resources. Real-time analytics provides data on the infrastructure necessary for a seamless experience be it the bandwidth, servers, memory, and more details.

AI-based analytics also helps an enterprise plan out the capacity required for their IT teams and reduce operational costs. With AI-driven analytics, the enterprise knows the number of people required to address and resolve events and incidents. It can also plan the work shifts and allocate resources based on the number of incidents during any given time.

2. Real-time Notification and Quick Remediation

Real-time analytics has made it easy to make quick business decisions. With AIOps, businesses can create triggers for incidents and can also narrow down business-critical notifications.

According to a study, about 40% of businesses deal with over a million events daily. Assessing priority events becomes an issue in such cases. AIOps help businesses prioritize and effect quick remedies for anomalies. The priority incidents can then be assigned to the IT team to resolve on priority.

3. Automated Event and Incident Management

Using data collected by AIOps, both historical and real-time, businesses can plan for different events and incidents. Thus, offer automated remedies for such incidences.

Traditionally, detection and resolution of such events took a long time and required larger incident management teams. It also meant that the data collected would not be real-time.

Using AI-based automation reduces the workload and ensures that an enterprise is equipped to handle current incidents and planned events. It also requires less manpower to deal with such incidents saving a business from hiring costs.

4. Dependency Mapping

AIOps help understand the dependencies across various domains like systems, services, and applications. Operators can monitor and collect data to mark the dependencies which are even hidden due to the complexities involved.

AIOps even analyze interdependencies that might be missed unless there is thorough monitoring of data. It helps enterprises in the process of configuration management, cross-domain management, and change management.

Businesses can collect real-time data to map the dependencies and create a database to use in change management decisions like when, how, and where to affect system changes.

5. Root-cause Analysis

For improved IT efficiency and performance, understanding the root cause of anomalies and correlating them with incidents is important. Early detection will help affect quicker remedies.

AIOps let IT teams in a business have visibility on anomalies and their relation to abnormal incidents. Thus, they can respond quickly with efficient resolutions for a smooth experience.

The root-cause analysis also helps in improving the domain and ensuring that the business runs efficiently with less exposure to unknown anomalies. Businesses are equipped to investigate and remedy the issue with better diagnoses.

6. Manage IoT

With many Internet of Things devices used widely, the necessity to manage data and the device complexity is of utmost importance. AIOps sees a wide application in this field and help manage several devices at the same time. The sheer volume of devices can make it overwhelming to manage IT operations.

IoT devices have several variables in play and operators require AIOps to manage them with ease. Machine learning helps leverage IoT and monitor, manage and run this complex system.

AIOps ensure that the IT performance thrives with consistent efficiency. It not just helps monitor large data in real-time but also detects issues, analyzes correlation, and ensures quick resolutions. Automated resolutions and management can eliminate downtime and save time and money for any business.

In a nutshell, AIOps aid in the consolidation of data from various IT streams and ensures you receive the highest benefit out of it. Whether it results in automation, resolving incidents at a quick pace, or finding anomalies and making data-driven decisions, AIOps help an organization while ensuring the IT performance is efficient.

Lack of Visibility into User Experience: A CIO’s Nightmare

Have you hired a CIO (Chief Information Officer) for your organization? A CIO in an organization is responsible for managing the computer technologies used by the employees. However, sometimes CIOs find it hard to analyze the technological standard of an organization due to lower visibility. Read on to know more about the lack of visibility into the user experience.

What is a CIO?

A CIO monitors the technologies used by an organization and the usability of the information produced within the organization. Since more and more firms are working on a digital platform, the roles of CIOs have significantly increased over the years. CIOs find out the benefits of technologies used within the organization. A CIO makes sure that each technology is being used for any particular business process.

A CIO also analyses the technologies offered by a firm to its users. It makes sure that the information and technologies within an organization are used for the betterment of the organization. CIOs help a firm to adapt to the changes and use the latest technologies that can make the business processes less tedious.

Digital experience monitoring

DEM (Digital Experience Monitoring) is one of the main job responsibilities of a CIO. DEM is monitoring the way a customer or internal employee interacts with the digital interface of the firm. DEM analyses the user behavior within an enterprise application or digital interface. It focuses on checking the availability of enterprise applications. DEM helps us in knowing the user’s experience with any particular application and how to improve it.

DEM is done for both customers as well as internal employees. The DEM for internal employees is often referred to as EUEM (End User Experience Management). Digital interfaces can be any technology used by the firm to connect with customers. It can be the firm’s website used by customers to access the offered services or, it can be a management system accessed by the internal employees. You can provide a better customer experience with the help of DEM. Sometimes CIOs face hassles while improving the user experience as they have very little visibility into applications and system software(s).

What does poor visibility mean?

Good visibility signifies how well you can view and access the offered services. Every firm has its services that are visible to the users. If a user cannot find/interact with your offerings easily, it implies that your firm offers poor visibility. Visibility is talked about in the context of the user applications that work as a substrate between the firm and the user. For example, an e-commerce website contains a link to check the real-time availability of products in the warehouses. It is an example of enhanced visibility that can help the customers know about the availability of services in their geographical location.

Enhanced visibility implies a better user experience and also better marketing. When customers can know about the availability of your services easily, the conversion rate will also be high. CIOs aim at offering greater visibility to customers when they interact with enterprise applications or software(s). Poor visibility will not only hamper the user experience but will also drive away potential customers.

Challenges with poor visibility

Poor visibility leads to various issues that hamper the user experience. The challenges with poor visibility are as follows:

  • A wide range of applications and systems are being used by a firm. With poor visibility, your employees may not be able to complete the business processes effectively. Maybe your particular business process is lacking due to a bad user experience. You may not be able to visualize the shortcoming of your user experience due to poor visibility.
  • When you have poor visibility into user experience, you cannot determine the source of any problem. Your IT teams may blame each other as there is no dedicated communication pipeline.
  • As users do not get a better user experience, they might deviate to services offered by your competitors.

Possible reasons for poor visibility into user experience

There can be many possible reasons for poor visibility into the user experience that are as follows:

  • Your organization does not have a single view of performance metrics for different IT systems. Your administrators have to view the performance metrics of each IT system separately. It is not only time-taking but also lacks accuracy.
  • The existing business metrics for digital experience monitoring are not up to the standards. You need to choose the correct business metrics for gaining visibility into the user experience.
  • Your employees are not able to realize the cost impacts of poor user experience on your business. You may not even be aware of the problems arising due to poor visibility into the user experience.
  • Your employees may not have access to real-time performance metrics. You may not know about bad user experience until someone has reported it.

Pros of better visibility into user experience

The benefits of greater visibility into user experience are as follows:

  • You not only monitor the performance of digital interfaces for your customers but also for your employees.
  • With better visibility, you can decrease MTTD (Mean Time to Detect) and MTTR (Mean Time to Resolve) drastically.
  • You can solve issues with the digital experience immediately if your engineers have better visibility.
  • It is easy to track the root cause of an IT issue with enhanced visibility. You can significantly increase the uptime of your digital interfaces.
  • With better visibility, a CIO can understand the issues faced by customers/employees and can provide a personalized digital experience to them.

What’s the solution?

Many businesses are choosing AI-based platforms for better monitoring of IT infrastructure. AIOps platforms help in gaining more visibility into the user experience. With AIOps, you can automate the DEM process and can monitor user experience in real-time.

In a nutshell

The global AIOps market size will be more than USD 20 billion by 2026. You can use an AIOps platform for better visibility into the user experience. A CIO can automate steps for digital experience monitoring and can save time and effort. Enhance visibility into the user experience for better results!

Empowering VMware Landscapes with AIOps

VMware has been at the forefront of everything good in the modern IT infrastructure landscape for a very long time. After it came up with solutions like VMware Server and Workstation around the early 2000s, its reputation got tremendously enhanced amongst businesses looking to upgrade IT infrastructure. VMware has been able to expand its offering since then by moving to public and private cloud. It has also brought sophisticated automation and management tools to simplify IT processes within organizations.

The technology world is not static, it is consistently changing to provide better IT solutions that are in line with the growing and diverse demands of organizations across the world. The newest wave doing the rounds revolves around IT operations and providing support to business services that are dependent on those IT environments. AIOps platforms find their origin, primarily from the world that VMware has created – a world that is built on IT infrastructure that is capable of modifying itself according to needs and is defined by software. This world created by VMware consists of components that are changing and moving at a rapid pace. In order to keep up with these changes, newer approaches to operating environments are required. AIOps solutions are emerging as the ideal way to run IT operations with no reliance on static service models or fragile systems. AIOps framework promises optimal utilization of skills and effort targeted at delivering maximum value.

In order to make the most of AIOps tools, it is important that they be used in ways that can complement the existing VMware infrastructure strategy. Here are a few of those:

Software-defined is the way to go

Even though SDx is not properly distributed, it is still here and making its mark. However, the uneven distribution of SDx is a problem. There is still a need to manage physical network infrastructure along with some aspects of VMware SDN. In order to ensure that you get the most out of VMware NFV/SDN, it is important to conduct a thorough overview combining all these aspects. By investing in an AIOps solution, you will have a unified view of the different infrastructure types. This will help you in not only identifying problems faster but also aligning IT operation resources to deal with them so that they don’t interfere with the service that you provide to your users, which is the ultimate objective of choosing to invest in any IT solution.

Integrated service-related view across the infrastructure

Not too many IT organizations out there can afford to use only one technology across the board. Every organization has to deal with many things that they have done prior to switching to AIOps. IT-related decisions made in the past could have a strong bearing on how easy or difficult the transition is. There is not just the management of virtual network and compute amongst others, organizations have their work cut out with the management of the physical aspects of these things as well. If that’s not enough, there is a public cloud and applications to manage as well.

Having an overview of the performance and availability of services that are dependent on all these different types of infrastructure is very important. Having said that, this unified view should be independent of time-consuming manual work associated with entering service definitions at every point of change. Also, whenever it is updated, it should do so with respect to the speed of infrastructure. Whether or not your IT infrastructure can support software-defined frameworks depends a lot on its minimum or no reliance on static models.  AIOps can get isolated data sources into a unified overview of services allowing IT operations teams to make the most of their time and focus only on the important things.

Automation is the key

You have to detect issues early if you want to reduce incident duration – that’s a fact. But there is no point in detecting issues early if you are not able to resolve them faster. AIOps tools connect with third-party automation tools as well as those that come with VMware to provide operators a variety of authorized actions to diagnose and resolve issues. So there are no different automation tools and actions for different people, which enables everyone to make the most of only the best tools. What this leads to is helping the IT operations teams to deliver desired outcomes, such as faster service restoration.

No-risk virtual desktops

There is no denying the benefits of having virtual desktops. However, there are disadvantages of taking the virtual route as well. With virtual desktops, you can have a chain of failure points, out of which any can have a huge impact on the service delivered to end-users. The risk comes from the different VDI chain links that are owned by different teams. This could prove harmful and cause outages, especially if support teams don’t go beyond their area of specialization and don’t communicate with other support teams either. The outages will be there for a longer period of time in these cases. AIOps can detect developing issues early and provide a background of the entire problem throughout the VDI chain. This can help different support teams to collaborate with each other and provide a resolution faster, consequently saving end-users from any disruption.

Collaboration across service teams

VMware admins have little problem in getting a clear overview of the infrastructure that they are working on. However, it is a struggle when it comes to visibility and collaboration across different teams. The problem with this lack of collaboration is the non-resolution of issues. When issues are raised, they only move from one team to another while their status remains unresolved. AIOps can improve the issue resolution rate and bring down issue resolution time considerably. It does this by associating events with their respective data source, aligning the issue to the team that holds expertise in troubleshooting that particular type of issue. AIOps also facilitates collaboration between different teams to fast-track issue resolution.

AI in Monitoring Agents

As per Gartner, “AIOps is the application of machine learning and data science to IT operations problems. AIOps platforms combine big data and ML functionality to enhance and partially replace all primary IT operations functions, including availability and performance monitoring, event correlation and analysis, and IT service management and automation.”

They had predicted that large enterprise exclusive use of AIOps will rise from 5% in 2018 to 30% in 2023. Indeed, we have seen a rapid increase in the adoption of AIOps platforms over the past few years. The acceleration of digital transformation brought on by the pandemic has further reinforced the importance of such platforms.

AIOps typically consists of various components such as monitoring agents, analysis components, AI peripherals, highlighting tools and others. In this article, I will be focusing on the monitoring agent.

Monitoring agents proactively monitor, manage, and resolve performance issues across the entire IT landscape before they impact end-user productivity.

Based on its functioning, monitoring agents can be broadly categorized as below:

  • Fundamental functioning
  • AI functioning

Fundamental Functioning of Monitoring Agents

This covers what we are monitoring, how we are monitoring it, what insights are handled and where it is being stored. In this structure, AI and Prediction components are independent. Here, agents will only act as an observer or the catcher of insights.

The below structure is common for most AIOps tools.

AI Functioning of Monitoring Agent

Apart from all the features of fundamental functioning, monitoring agents also have AI algorithms and process mechanisms for self-intelligences. These algorithms turn a monitoring agent into a reactive machine.

Reactive Machines

Reactive machines are a type of AI that work based on predefined algorithms. It does not have any memory, nor does it have predictive capabilities. Reactive machines will respond to identical situations in the exact same way every time. There will never be a variance in action if the input is the same. This feature is desirable when it must be ensured that the AI system is trustworthy. However, it means they can’t learn from the past.

Spam filters and the Netflix recommendation engine are examples of reactive AI.

Reactive machines work well in scenarios that require pattern recognition and where all parameters are known.

Benefits of replacing basic monitoring agents by reactive machines:

In the monitoring aspect, Reactive machines = Reactive AI Agents

  • Data processing & size deduction

Handling huge data and processing those are extraordinarily complex. Moreover, it keeps on growing and needs to be maintained.

Reactive AI agents have their own intelligence to filter polling data’s properties. All insights’ properties are not always used from raw data. Only a few properties are needed in certain impact cases. But, most monitoring agents, don’t have the intelligence to identify situations (impact cases) to filter properties. Agents post all properties each time of monitoring frequency. If we filter the properties of monitoring data on need basis, it will reduce 5-10% overall data size.

In processing aspect, reactive agents will group repeated common properties of various transaction documents at the same polling time. Reactive AI agents will have the intelligence to know what can be grouped, how can it be grouped, and to maintain raw data’s consistency. Example, at a specific time we are collecting 200+ event details, they all will have a machine name, IP address, location, and few more common properties. These properties are going to be repeated on all 200+ documents and these will be a few gigabytes. If we are grouping those details as an independent single document with that independent id on all 200+ documents, it will get reduced to around 15%-20% overall size.

  • Self-healing

Reactive AI agents have self-healing intelligence to optimize its CPU, memory utilization, in-memory refreshment and handle log recycling based on time and log’s information. Also, it will notify status of its force down situation except for a few exceptional cases.

  • Data Accuracy

Reactive AI agents have an accurate unit conversion based on polling unit. It also realigns frequency with the delay of the data polling cycle.

  • Security

Usually, data encryption happens in static encryption key on agents and back-end components. Reactive AI agents generate a unique encryption key, it is frequently updated (by week or month) and it is notified by back-end calls for decryption process. It is more secure than the static one.

Agent’s configuration details also get periodically obfuscated on its own.

  • Dynamic Polling Frequency

Most of the agent’s polling frequency is static. It could be via a back-end component or the agent’s own configuration. But reactive AI agents have the intelligence to decide frequency changes. It will decide frequency based on data impact (low and high frequency level). This frequency changes are notified to back-end components as well and considered based back-end process’s impact.

Monitoring agents usually do not have any intelligence and algorithms. Implementing AI in monitoring agents is much needed to make it more efficient and make AIOps a true enabler of digital transformation.

References

About the Author –

Natarajan Veerasekaran

Natarajan is a Lead Engineer for ZIF Monitoring at GAVS. He is deeply passionate about programming and broadening his technical boundaries.

AIOps for Service Reliability Engineering (SRE)

Data is the single most accountable yet siloed component within any IT infrastructure. According to a Gartner report, an average enterprise IT infrastructure generates up to 3 times more IT operational data with each passing year. Large businesses find themselves challenged by frequent unplanned downtime of their services, high IT issue resolution times, and consequently poor user experience caused by inefficient management of this data overload, reactive IT operations, and other reasons such as:

  • Traditional legacy systems that do not scale
  • Siloed environments preventing unified visibility into IT landscape
  • Unattended warning signs due to alert fatigue
  • Lack of advanced tools to intelligently identify root causes of cross-tier events
  • Multiple hand-offs that require manual intervention affecting problem remediation workflow

Managing data and automation with AIOps

The surge of AI in IT operations or AIOps is helping bridge the gap between the need for meaningful insights and human intervention, to ensure service reliability and business growth. AIOps is fast becoming a critical need since effective management of the humongous data volumes has surpassed human capabilities. AIOps is powered by AI/ML algorithms that enable automatic discovery of infra & applications, 360o observability into the entire IT environment, noise reduction, anomaly detection, predictive and prescriptive analytics, and automatic incident triage and remediation!

AIOps provides clear insights into application & infrastructure performance and user experience, and alerts IT on potential outages or performance degradation. AIOps delivers a single, intelligent, and automated layer of intelligence across all IT operations, enabling proactive & autonomous IT operations, improved operational efficiencies through reduction of manual effort/fatigue/errors, and improved user experience as predictive & prescriptive analytics drive consistent service levels.

The Need for AIOps for SRE

SRE mandates that the IT team always stays ahead of IT outages and proactively resolves incidents before they impact the user. However, even the most mature teams face challenges due to the rapidly increasing data volumes and expanding IT boundaries, created by modern technologies such as the cloud, and IoT. SRE faces challenges such as lack of visibility and technology fragmentation while executing these tasks in real-time.

SRE teams have started to leverage AI capabilities to detect & analyze patterns in the data, eliminate noise & gain meaningful insights from current & historical data. As AIOps enters the SRE realm, it has enabled accelerated and automated incident management and resolution. With AI at the core, SRE teams can now redirect their time towards strategic initiatives and focus on delivering high value to users.

Transform SRE with AIOps

SREs are moving towards AIOps to achieve these main goals:

  • Improved visibility across the organization’s remote & distributed systems
  • Reduced response time through automation
  • Prevention of incidents through proactive operations

AIOps Platform ZIFTM from GAVS allows enterprises focused on digital transformation to become proactive with IT incidents, by delivering AI-led predictions and auto-remediation. ZIF is a unified platform with centralized NOC powered by AI-led capabilities for automatic environment discovery, going beyond monitoring to observability, predictive & prescriptive analytics, automation & self-remediation enabling outcomes such as:

  • Elimination of digital dirt
  • IT team empowered with end-to-end visibility
  • Breaking away the silos in IT infrastructure systems and operations
  • Intuitive visualization of application health and user experience from the digital delivery chain
  • Increasing precision in intelligent root cause analyses helping drastic cut in resolution time (MTTR)
  • ML algorithms for continuous learning from the environment driving huge improvements with time
  • Zero-touch automation across the spectrum of services, including delivery of cloud-native applications, traditional mainframes, and process workflows

The future of AIOps

Gartner predicts a rapidly growing market size from USD 1.5 billion in 2020. Gartner also claims that the future of IT operations cannot operate without AIOps due to these four main drivers:

  • Redundancy of traditional approaches to handling IT complexities
  • The proliferation of IoT devices, mobile applications & devices, APIs
  • Lack of infrastructure to support IT events that require immediate action
  • Growth of third-party services and cloud infrastructure

AIOps has a strong role in five major areas — anomaly detection, event correlation and advanced data analysis, performance analysis, automation, and IT service management. However, to get the most out of AIOps, it is crucial to choose the right AIOps platform, as selecting the right partner is critical to the success of such an important org initiative. Gartner recommends prioritizing vendors based on their ability to address challenges, data ingestion & analysis, storage & access, and process automation capabilities. We believe ZIF is that AIOps solution for you! For more on ZIF, please visit www.zif.ai.

Anomaly Detection in AIOps

Before we get into anomalies, let us understand what is AIOps and what is its role in IT Operations. Artificial Intelligence for IT operations is nothing but monitoring and analyzing larger volumes of data generated by IT Platforms using Artificial Intelligence and Machine Learning. These help enterprises in event correlation and root cause analysis to enable faster resolution. Anomalies or issues are probably inevitable, and this is where we need enough experience and talent to take it to closure.

Let us simplify the significance of anomalies and how they can be identified, flagged, and resolved.

What are anomalies?

Anomalies are instances when performance metrics deviate from normal, expected behavior. There are several ways in which this occur. However, we’ll be focusing on identifying such anomalies using thresholds.

How are they flagged?

With current monitoring systems, anomalies are flagged based on static thresholds. They are constant values that provide the upper limits of a normal behavior. For example, CPU usage is considered anomalous when the value is set to be above 85%. When anomalies are detected, alerts are sent out to the operations team to inspect.

Why is it important?

Monitoring the health of servers are necessary to ensure the efficient allocation of resources. Unexpected spikes or drop in performance such as CPU usage might be the sign of a resource constraint. These problems need to be addressed by the operations team timely, failing to do so may result in applications associated with the servers failing.

So, what are thresholds, how are they significant?

Thresholds are the limits of acceptable performance. Any value that breaches the threshold are indicated in the form of alerts and hence subjected to a cautionary resolution at the earliest. It is to be noted that thresholds are set only at the tool level, hence that way if something is breached, an alert will be generated. These thresholds, if manual, can be adjusted accordingly based on the demand.

There are 2 types of thresholds;

  1. Static monitoring thresholds: These thresholds are fixed values indicating the limits of acceptable performance.
  2. Dynamic monitoring thresholds: These thresholds are dynamic in nature. This is what an intelligent IT monitoring tool does. They learn the normal range for both a high and low threshold, at each point in a day, week, month and so on. For instance, a dynamic system will know that a high CPU utilization is normal during backup, and the same is abnormal on utilizations occurring in other days.

Are there no disadvantages in the threshold way of identifying alerts?

This is definitely not the case. Like most things in life, it has its fair share of problems. Routing from philosophy back to our article, there are disadvantages in the Static Threshold way of doing things, although the ones with a dynamic threshold are minimal. We should also understand that with the appropriate domain knowledge, there are many ways to overcome these.

Consider this scenario. Imagine a CPU threshold set at 85%. We know anything that breaches this, is anomalies generated in the form of alerts. Now consider the same threshold percentage as normal behavior in a Virtual Machine (VM). This time, the monitoring tool will generate alerts continuously until it reaches a value below the threshold. If this is left unattended, it will be a mess as there might be a lot of false alerts which in turn may cause the team to fail to identify the actual issue. It will be a chain of false positives that occur. This can disrupt the entire IT platform and cause an unnecessary workload for the team. Once an IT platform is down, it leads to downtime and loss for our clients.

As mentioned, there are ways to overcome this with domain knowledge. Every organization have their own trade secrets to prevent it from happening. With the right knowledge, this behaviour can be modified and swiftly resolved.

What do we do now? Should anomalies be resolved?

Of course, anomalies should be resolved at the earliest to prevent the platform from being jeopardized. There are a lot of methods and machine learning techniques to get over this. Before we get into it, we know that there are two major machine learning techniques – Supervised Learning and Unsupervised Learning. There are many articles on the internet one can go through to have an idea of these techniques. Likewise, there are a variety of factors that could be categorized into these. However, in this article, we’ll discuss an unsupervised learning technique – Isolation Forest amongst others.

Isolation Forest

The algorithm isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

The way that the algorithm constructs the separation is by first creating isolation trees or random decision trees. Then, the score is calculated as the path length to isolate the observation. The following example shows how easy it is to separate an anomaly-based observation:  

predictive analytics models

In the above image, the blue points denote the anomalous points whereas the brown ones denote the normal points.

Anomaly detection allows you to detect abnormal patterns and take appropriate actions. One can use anomaly-detection tools to monitor any data source and identify unusual behaviors quickly. It is a good practice to research methods to determine the best organizational fit. One way of doing this is to ideally check with the clients, understand their requirements, tune algorithms, and hit the sweet spot in developing an everlasting relationship between organizations and clients.

Zero Incident FrameworkTM, as the name suggests, focuses on trending organization towards zero incidents. With knowledge we’ve accumulated over the years, Anomaly Detection is made as robust as possible resulting in exponential outcomes.

References

About the Author –

Vimalraj Subash

Vimalraj is a seasoned Data Scientist working with vast data sets to break down information, gather relevant points, and solve advanced business problems. He has over 8 years of experience in the Analytics domain, and currently a lead consultant at GAVS.

AIOps Myth Busters

The explosion of technology & data is impacting every aspect of business. While modern technologies have enabled transformational digitalization of enterprises, they have also infused tremendous complexities in infrastructure & applications. We have reached a point where effective management of IT assets mandates supplementing human capabilities with Artificial Intelligence & Machine Learning (AI/ML).      

AIOps is the application of Artificial Intelligence (AI) to IT operations (Ops). AIOps leverages AI/ML technologies to optimize, automate, and supercharge all aspects of IT Operations. Gartner predicts that the use of AIOps and digital experience monitoring tools for monitoring applications and infrastructure would increase by 30% in 2023. In this blog, we hope to debunk some common misconceptions about AIOps.

MYTH 1: AIOps mainly involves alert correlation and event management

AIOps can deliver enormous value to enterprises that harness the wide range of use cases it comes with. While alert correlation & management are key, AIOps can add a lot of value in areas like monitoring, user experience enhancement, and automation.  

AIOps monitoring cuts across infrastructure layers & silos in real-time, focusing on metrics that impact business outcomes and user experience. It sifts through monitoring data clutter to intelligently eliminate noise, uncover patterns, and detect anomalies. Monitoring the right UX metrics eliminates blind spots and provides actionable insights to improve user experience. AIOps can go beyond traditional monitoring to complete observability, by observing patterns in the IT environment, and externalizing the internal state of systems/services/applications. AIOps can also automate remediation of issues through automated workflows & standard operating procedures.

MYTH 2: AIOps increases human effort

Forbes says data scientists spend around 80% of their time preparing and managing data for analysis. This leaves them with little time for productive work! With data pouring in from monitoring tools, quite often ITOps teams find themselves facing alert fatigue and even missing critical alerts.

AIOps can effectively process the deluge of monitoring data by AI-led multi-layered correlation across silos to nullify noise and eliminate duplicates & false positives. The heavy lifting and exhausting work of ingesting, analyzing, weeding out noise, correlating meaningful alerts, finding the probable root causes, and fixing them, can all be accomplished by AIOps. In short, AIOps augments human capabilities and frees up their bandwidth for more strategic work.

MYTH 3: It is hard to ‘sell’ AIOps to businesses

While most enterprises acknowledge the immense potential for AI in ITOps, there are some concerns that are holding back widespread adoption. The trust factor with AI systems, the lack of understanding of the inner workings of AI/ML algorithms, prohibitive costs, and complexities of implementation are some contributing factors. While AIOps can cater to the full spectrum of ITOps needs, enterprises can start small & focus on one aspect at a time like say alert correlation or application performance monitoring, and then move forward one step at a time to leverage the power of AI for more use cases. Finding the right balance between adoption and disruption can lead to a successful transition.  

MYTH 4: AIOps doesn’t work in complex environments!

With Machine Learning and Big Data technologies at its core, AIOps is built to thrive in complex environments. The USP of AIOps is its ability to effortlessly sift through & garner insights from huge volumes of data, and perform complex, repetitive tasks without fatigue. AIOps systems constantly learn & adapt from analysis of data & patterns in complex environments. Through this self-learning, they can discover the components of the IT ecosystem, and the complex network of underlying physical & logical relationships between them – laying the foundation for effective ITOps.   

MYTH 5: AIOps is only useful for implementing changes across IT teams

An AIOps implementation has an impact across all business processes, and not just on IT infrastructure or software delivery. Isolated processes can be transformed into synchronized organizational procedures. The ability to work with colossal amounts of data; perform highly repetitive tasks to perfection; collate past & current data to provide rich inferences; learn from patterns to predict future events; prescribe remedies based on learnings; automate & self-heal; are all intrinsic features that can be leveraged across the organization. When businesses acknowledge these capabilities of AIOps and intelligently identify the right target areas within their organizations, it will give a tremendous boost to quality of business offerings, while drastically reducing costs.

MYTH 6: AIOps platforms offer only warnings and no insights

With its ability to analyze and contextualize large volumes of data, AIOps can help in extracting relevant insights and making data-driven decisions. With continuous analysis of data, events & patterns in the IT environment – both current & historic – AIOps acquires in-depth knowledge about the functioning of the various components of the IT ecosystem. Leveraging this information, it detects anomalies, predicts potential issues, forecasts spikes and lulls in resource utilization, and even prescribes appropriate remedies. All of this insight gives the IT team lead time to fix issues before they strike and enables resource optimization. Also, these insights gain increasing precision with time, as AI models mature with training on more & more data.

MYTH 7: AIOps is suitable only for Operations

AIOps is a new generation of shared services that has a considerable impact on all aspects of application development and support. With AIOps integrated into the dev pipeline, development teams can code, test, release, and monitor software more efficiently. With continuous monitoring of the development process, problems can be identified early, issues fixed, and changes rolled back as appropriate. AIOps can promote better collaboration between development & ops teams, and proactive identification & resolution of defects through AI-led predictive & prescriptive insights. This way AIOps enables a shift left in the development process, smarter resource management, and significantly improves software quality & time to market.  

Kappa (κ) Architecture – Streaming at Scale

We are in the era of Stream processing-as-a-service and for any data-driven organization, Stream-based computing has becoming the norm. In the last three parts https://bit.ly/2WgnILP, https://bit.ly/3a6ij2k,  https://bit.ly/3gICm88, I had explored Lambda Architecture and its variants. In this article let’s discover Streaming in the big data. ‘Real-time analytics’, ‘Real-time data’ and ‘Streaming data’ has become mandatory in any big data platform. The aspiration to extend data analysis (predictive, descriptive, or otherwise) to streaming event data has been common across every enterprise and there is a growing interest to find real-time big data architectures. Kappa (K) Architecture is one that deals with streaming. Let’s see why Real-Time Analytics matter more than ever and mandates data streaming and how streaming architecture like Kappa works. Is Kappa an alternative to lambda?

“You and I are streaming data engines.” – Jeff Hawkins

workflow automation software architecture

Questioning Lambda

Lambda architecture fits very well in many real-time use cases, mainly in re-computing algorithms. At the same time, Lambda Architecture has the inherent development and operational complexities like all the algorithms must be implemented twice, once in the cold path, the batch layer, and another execution in the hot path or the real-time layer. Apart from this dual execution path, the Lambda Architecture has the inevitable issue of debugging. Because operating two distributed multi-node services is more complex than operating one.

Given the obvious discrepancies of Lambda Architecture, Jay Kreps, CEO of Confluent, co-creator of Apache Kafka started the discussion on the need for new architecture paradigm which uses less code resource and could perform well in certain enterprise scenarios. This gave rise to Kappa (K) Architecture. The real need Kappa Architecture isn’t about efficiency at all, but rather about allowing people to develop, test, debug, and operate their systems on top of a single processing framework. In fact, Kappa is not taken as competitor to LA on the contrary it is seen as an alternative.

cognitive process automation tools for business

What is Streaming & Streaming Architecture?

Modern business requirements necessitate a paradigm shift from traditional approach of batch processing to real-time data streams. Data-centric organizations mandate the Stream first approach. Real-time data streaming or Stream first approach means at the very moment. So real-time analytics, either On-demand real-time analytics or Continuous real-time analytics, is the capability to process data right at the moment it arrives in the system. There is no possibility of batch processing of data. Not to mention, it enhances the ability to make better decision making and performing meaningful action on a timely basis. At the right place and at the right time, real-time analytics combines and analyzes data. Thus, it generates value from disparate data.

Typically, most of the streaming architectures will have the following 3 components:

  • an aggregator that gathers event streams and batch files from a variety of data sources,
  • a broker that makes data available for consumption,
  • an analytics engine that analyzes the data, correlates values and blends streams together.

Kappa (K) Architecture for Big Data era

Kappa (K) Architecture is one of the new software architecture patterns for the new Data era. It’s mainly used for processing streaming data. Kappa architecture gets the name Kappa from the Greek letter (K) and is attributed to Jay Kreps for introducing this architecture.

The main idea behind the Kappa Architecture is that both the real-time and batch processing can be carried out, especially for analytics, with a single technology stack. The data from IoT, streaming, and static/batch sources or near real-time sources like change data capture is ingested into messaging/ pub-sub platforms like Apache Kafka.

An append-only immutable log store is used in the Kappa Architecture as the canonical store. Following are the pub/sub or message buses or log databases that can be used for ingestion:

  • Amazon Quantum Ledger Database (QLDB)
  • Apache Kafka
  • Apache Pulsar
  • Amazon Kinesis
  • Amazon DynamoDB Streams
  • Azure Cosmos DB Change Feed
  • Azure EventHub
  • DistributedLog
  • EventStore
  • Chronicle Queue
  • Pravega

Distributed Stream processing engines like Apache Spark, Apache Flink, etc. will read the data from the streaming platform and transform it into an analyzable format, and then store it into an analytics database in the serving layer. Following are some of the distributed streaming computation systems

  • Amazon Kinesis
  • Apache Flink
  • Apache Samza
  • Apache Spark
  • Apache Storm
  • Apache Beam
  • Azure Stream Analytics
  • Hazelcast Jet
  • Kafka Streams
  • Onyx
  • Siddhi

In short, any query in the Kappa Architecture is defined by the following functional equation.

Query = λ (Complete data) = λ (live streaming data) * λ (Stored data)

The equation means that all the queries can be catered by applying Kappa function to the live streams of data at the speed layer. It also signifies that the stream processing occurs on the speed layer in Kappa architecture.

Pros and Cons of Kappa architecture

Pros

  • Any architecture that is used to develop data systems that doesn’t need batch layer like online learning, real-time monitoring & alerting system, can use Kappa Architecture.
  • If computations and analysis done in the batch and streaming layer are identical, then using Kappa is likely the best solution.
  • Re-computations or re-iterations is required only when the code changes.
  • It can be deployed with fixed memory.
  • It can be used for horizontally scalable systems.
  • Fewer resources are required as the machine learning is being done on the real-time basis.

Cons

Absence of batch layer might result in errors during data processing or while updating the database that requires having an exception manager to reprocess the data or reconciliation.

On finding the right architecture for any data driven organizations, a lot of considerations were taken in. Like most successful analytics project, which involves streaming first approach, the key is to start small in scope with well-defined deliverables, then iterate. The reason for considering distributed systems architecture (Generic Lambda or unified Lambda or Kappa) is due to minimized time to value.

Sources

About the Author

Bargunan Somasundaram

Bargunan Somasundaram

Bargunan is a Big Data Engineer and a programming enthusiast. His passion is to share his knowledge by writing his experiences about them. He believes “Gaining knowledge is the first step to wisdom and sharing it is the first step to humanity.”

Addressing Web Application Performance Issues

With the use of hybrid technologies and distributed components, the applications are becoming increasingly complex. Irrespective of the complexity, it is quite important to ensure the end-user gets an excellent experience in using the application. Hence, it is mandatory to monitor the performance of an application to provide greater satisfaction to the end-user.

External factors

When the web applications face performance issues, here are some questions you need to ask:

  • Does the application always face performance issues or just during a specific period?
  • Whether a particular user or group of users face the issue or is the problem omnipresent for all the users?
  • Are you treating your production environment as real production environment or have you loaded it with applications, services, and background processes running without any proper consideration?
  • Was there any recent release to any of the application stack like Web, Middle Tier, API, DB, etc., and how was the performance before this release?
  • Have there been any hardware or software upgrades recently?

Action items on the ground

Answering the above set of questions would have brought you closer to the root cause. If not, given below are some steps you can do to troubleshoot the performance issue:

  • Look at the number of incoming requests, is the application facing unusual load?
  • Identify how many requests are delaying more than a usual level, say more than 5000 milliseconds to serve a request, or a web page.
  • Is the load getting generated by a specific or group of users – is someone trying to create intentional load?
  • Look at the web pages/methods/functions in the source code which are taking more time. Check the logs of the web server, this can be identified provided the application does that level of custom logging.
  • Identify whether any 3rd party links or APIs which are being used in the application is causing slowness.
  • Check whether the database queries are taking more time.
  • Identify whether the problem is related to a certain browser.
  • Check if the server side or client side is facing any uncaught exceptions which are impacting the performance.
  • Check the performance of the CPU, Memory, and Disk of the server(s) in which the application is hosted.
  • Check the sibling processes which are consuming more Memory/CPU/Disk in all servers and take appropriate action depending on whether those background processes need to be in that server or can be moved somewhere or can be removed totally.
  • Look at the web server performance to fine tune the Cache, Session time out, Pool size, and Queue-length.
  • Check for deadlock, buffer hit ratio, IO Busy, etc. to fine tune the performance.

Challenges 

  • Doing all these steps exactly when there is a performance issue may not be practically all the time. By the time you collect some of these, you may lose important data for the rest of the items unless the history data is collected and stored for reference.
  • Even if the data is collected, correlating them to arrive at the exact root cause is not an easy task
  • You need to be tech savvy across all layers to know what parameters to collect and how to collect.

And the list of challenges goes on…

Think of an ideal situation where you have metrics of all these action items described above, right in front of you. Is there such magic bullet available? Yes, Zero Incident FrameworkTM Application Performance Monitoring (ZIF APM), it gives you the above details at your fingertips, thereby makes troubleshooting a simple task.

ZIF APM has more to offer than other regular APM. The APM Engine has built-in AI features. It monitors the application across all layers, starting from end-user, web application, web server, API layers, databases, underlying infrastructure that includes the OS and performance factors, irrespective of whether these layers are hosted on cloud or on-premise or both. It also applies the AI for monitoring, mapping, tracing and analyze the pattern to provide the Observability and Insights. Given below is a typical representation of distributed application and its components. And the rest of the section covers, how ZIF APM provides such deep level of insights.

ZIF APM

Once the APM Engine is installed/run on portfolio servers, the build-in AI engine does the following automatically: 

  1. Monitors the performance of the application (Web) layer, Service Layer, API, and Middle tier and Maps the insights from User <–> Web <–> API <–> Database for each and every applications – No need to manually link Application 1 in Web Server A with API1 in Middle Tier B and so on.
  2. Traces the end-to-end user transaction journey for all transactions with Unique ID.
  3. Monitors the performance of the 3rd party calls (e.g. web service, API calls, etc.), no need to map them.
  4. Monitors the End User Experience through RUM (Real User Monitoring) without any end-user agent.

<A reference screenshot of how APM maps the user transaction journey across different nodes. The screenshot also gives the Method level performance insights>

Why choose ZIF APM? Key Features and Benefits

  1. All-in-One – Provides the complete insight of the underlying Web Server, API server, DB server related infrastructure metrics like CPU, Memory, Disk, and others.
  2. End-user experience (RUM) – Captures performance issues and anomalies faced by end-user at the browser side.
  3. Anomalies detection – Offers deeper insights on the exceptions faced by the application including the line number in the source code where the issue has occurred.
  4. Code-level insights – Gives details about which method and function calls within the source code is taking more time or slowing down the application.
  5. 3rd Party and DB Layer visibility – Provides the details about 3rd party APIs or Database calls and Queries which are delaying the web application response.
  6. AHI – Application Health Index is a scorecard based on A) End User Experience, B) Application Anomalies, C) Server Performance and D) Database performance factors that are applicable in the given environment or application. Weightage and number of components A, B, C, D are variables. For instance, if ‘Web server performance’ or ‘Network Performance’ needs to be brought in as new variable ‘E’, then accordingly the weightage will be adjusted/calculated against 100%.
  7. Pattern Analysis – Analyzes unusual spikes through pattern matching and alerts are provided.
  8. GTrace – Provides the transaction journey of the user transaction and the layers it is passing through and where the transaction slows down, by capturing the performance of each transaction of all users.
  9. JVM and CLR – Provides the Performance of the underlying operating system, Web server, and run time (JVM, CLR).
  10. LOG Monitoring – Provides deeper insight on the application logs.
  11. Problem isolation– ZIF APM helps in problem isolation by comparing the performance with another user in the same location at the same time.

Visit www.zif.ai for more details.

About the Author –

Suresh Kumar Ramasamy

Suresh heads the Monitor component of ZIF at GAVS. He has 20 years of experience in Native Applications, Web, Cloud, and Hybrid platforms from Engineering to Product Management. He has designed & hosted the monitoring solutions. He has been instrumental in conglomerating components to structure the Environment Performance Management suite of ZIF Monitor. Suresh enjoys playing badminton with his children. He is passionate about gardening, especially medicinal plants.