Algorithmic Alert Correlation

Today’s always-on businesses and 24×7 uptime demands have necessitated IT monitoring to go into overdrive. While constant monitoring is a good thing, the downside is that the flood of alerts generated can quickly get overwhelming. Constantly having to deal with thousands of alerts each day causes alert fatigue, and impacts the overall efficiency of the monitoring process.

Hence, chalking out an optimal strategy for alert generation & management becomes critical. Pattern-based thresholding is an important first step, since it tunes thresholds continuously, to adapt to what ‘normal’ is, for the real-time environment. Threshold accuracy eliminates false positives and prevents alerts from getting fired incorrectly. Selective alert suppression during routine IT Ops maintenance activities like backups, patches, or upgrades, is another. While there are many other strategies to keep alert numbers under control, a key process in alert management is the grouping of alerts, known as alert correlation. It groups similar alerts under one actionable incident, thereby reducing the number of alerts to be handled individually.

But, how is alert ‘similarity’ determined? One way to do this is through similarity definitions, in the context of that IT landscape. A definition, for instance, would group together alerts generated from applications on the same host, or connectivity issues from the same data center. This implies that similarity definitions depend on the physical and logical relationships in the environment – in other words – the topology map. Topology mappers detect dependencies between applications, processes, networks, infrastructure, etc., and construct an enterprise blueprint that is used for alert correlation.

But what about related alerts generated by entities that are neither physically nor logically linked? To give a hypothetical example, let’s say application A accesses a server S which is responding slowly, and so A triggers alert A1. This slow communication of A with S eats up host bandwidth, and hence affects another application B in the same host. Due to this, if a third application C from another host calls B, alert A2 is fired by C due to the delayed response from B.  Now, although we see the link between alerts A1 & A2, they are neither physically nor logically related, so how can they be correlated? In reality, such situations could imply thousands of individual alerts that cannot be combined.

This is one of the many challenges in IT operations that we have been trying to solve at GAVS. The correlation engine of our AIOps Platform ZIF uses algorithmic alert correlation to find a solution for this problem. We are working on two unsupervised machine learning algorithms that are fundamentally different in their approach – one based on pattern recognition and the other based on spatial clustering. Both algorithms can function with or without a topology map, and work around what is supplied and available. The pattern learning algorithm derives associations based on learnings from historic patterns of alert relationships. The spatial clustering algorithm works on the principle of similarity based on multiple features of alerts, including problem similarity derived by applying Natural Language Processing (NLP), and relationships, among several others. Tuning parameters enable customization of algorithmic behavior to meet specific demands, without requiring modifications to the core algorithms. Time is also another important dimension factored into these algorithms, since the clustering of alerts generated over an extended period of time will not give meaningful results.

The traditional alert correlation has not been able to scale up to handle the volume and complexity of alerts generated by the modern-day hybrid and dynamic IT infrastructure. We have reached a point where our ITOps needs have surpassed the limits of human capabilities, and so, supplementing our intelligence with Artificial Intelligence and Machine Learning has now become indispensable.

About the Author –

Padmapriya Sridhar

Priya is part of the Marketing team at GAVS. She is passionate about Technology, Indian Classical Arts, Travel, and Yoga. She aspires to become a Yoga Instructor someday!

Gireesh Sreedhar KP

Gireesh is a part of the projects run in collaboration with IIT Madras for developing AI solutions and algorithms. His interest includes Data Science, Machine Learning, Financial markets, and Geo-politics. He believes that he is competing against himself to become better than who he was yesterday. He aspires to become a well-recognized subject matter expert in the field of Artificial Intelligence.

Cloud Adoption, Challenges, and Solution

Cloud Adoption

Cloud computing is the delivery of computing services including Servers, Database, Storage, Networking, and others over the internet. Public, Private, and Hybrid clouds are different ways of deploying cloud computing.  

  • In a public cloud, the cloud resources are owned by 3rd party cloud service providers
  • A private cloud consists of computing resources exclusively by one business or organization
  • Hybrid provides the best of both worlds, combines on-premises infrastructure, private cloud with public cloud

Microsoft, Google, Amazon, Oracle, IBM, and others are providing cloud platforms to users to host and experience practical business solutions. The worldwide public cloud services market is forecast to grow 17% in 2020 to total $266.4 billion and $354.6 billion in 2022, up from $227.8 billion in 2019, per Gartner, Inc.

There are various types of Instances, workloads and options available as part of the cloud ecosystem, i.e. IaaS, PaaS, SaaS, Multi-cloud, Serverless.


When very large, large and medium enterprises decide to move their IT environment from on-premise to cloud, they try to move some/most of their on-premises into cloud and keep the rest under their control on-premise. There are various factors that impact the decision, to name a few,

  1. ROI vs Cost of Cloud Instance, Operation cost
  2. Architecture dependency of the application, i.e. whether it is monolithic or multi-tier or polyglot or hybrid cloud
  3. Requirement and need for elasticity and scalability
  4. Availability of right solution from the cloud provider
  5. Security of some key data

After crossing all, once the IT environment is cloud enabled, the challenge comes in ensuring the monitoring of the cloud enabled IT environment. Here are some of the business and IT challenges

  • How to ensure the various workloads and Instances are working as expected?

While the cloud provider may give high availability and uptime depending on the tier we choose, it is important that our IT team monitors the environment, as in the case of IaaS and to some extent in PaaS as well.

  • How to ensure the Instances are optimally used in terms of computing and storage?

Cloud providers give most of the metrics around the Instances, though it may not provide all the metrics that we may need to make decisions in every possible scenario.

The disadvantages of this model are cost, latency and not straight forward, e.g. the LOG analytics which comes in Azure involves cost for every MB/GB of data that is stored and the latency in getting the right metrics at right time, if there is latency/delay, you may not get the right result.

  • How to ensure the Application or the components of a single solution that are spread across On-Premise and Cloud environment is working as expected?

Some cloud providers give tools for integrating the metrics from on-premise to the cloud environment to have a shared view.

The disadvantage of this model is that it is not possible to bring in all sorts of data together to get the insights straight. That is, observability is always a question. The ownership of getting the observability lies with the IT team who handles the data.

  • How to ensure that the Multi-Cloud + On-Premise environment is effectively monitored and utilized for the best end-user experience?

Multi-Cloud environment – With rapidly growing Microservices Architecture and Container-based cloud enabled model, it is quite natural that the enterprise may choose the best from different cloud providers like Azure, AWS, Google, and others.

There is little support from cloud provider on this space. In fact, some cloud providers do not even support this scenario.

  • How to get a single panel of view for troubleshooting and root cause analysis?

Especially when issues crop up in Application, Database, Middle Tier, Network and 3rd party layers that are spread across multi-cluster, multi-cloud, elastic environment, it is very important to get a unified view of entire environment.

ZIF (Zero Incident FrameworkTM) provides a single platform for Cloud Monitoring.

ZIF has Discovering, Monitoring, Predicting and Remediating capabilities. It provides the unified dashboard with insights across all layers of IT infrastructure that is distributed across On-premise host, Cloud Instance and Containers.

Cloud Adoption

Core features of ZIF for Cloud are,

  • Discovery and Topology
    • Real-time mapping of applications and its dependent layers irrespective of whether the components live on-premise, or on cloud or containerized in cloud.
    • Dynamically built topology of all layers which helps in taking effective decisions.
  • Observability across Multi-Cloud and On-Premise tiers
    • Analysis of the monitored data to come up with meaningful insights.
    • Unified view of the entire IT environment, especially important when the IT infrastructure is spread across multiple cloud platform like Azure, AWS, Google Cloud and others.
  • Root cause analysis
    • Quick root cause analysis by analysing various causes captured by ZIF Monitor instead of going through layer by layer. This saves time to focus on problem solving and arresting instead of spending effort on identifying the root cause.
    • Insights across your workload including the impact due to 3rd party layers.
  • Container and Microservice support
    • Understand the resource utilization of your containers that are hosted in the cloud and on-premise.
    • Know the bottlenecks around the Microservices and tune your environment for the spikes in load.
    • Get full support for monitoring applications distributed across your local host and containers in cloud in a multi-cluster setup.
  • End-User Experience
    • Helps improve the experience of the end-user getting served by the workload from cloud.
    • Helps to trace each and every request of each and every user, thus it is quite natural for ZIF to unearth the performance bottlenecks across all layers which in turn improves the user experience.
  • Metrics driven decision for resource optimization
    • Provides meaningful insights and alerts in terms of the surge in the load, the growth in number of VMs, containers and the usage of resource across other workloads.
    • Enables authorization of Elasticity and Scalability through well informed metrics.

ZIF Seamlessly integrates with following Cloud and Container environments,

  • Microsoft Azure
    • AWS
    • Google Cloud
    • Docker
    • Kubernetes

Watch for this space for more Use cases around ZIF for Cloud.

About the Author

Suresh Kumar Ramasamy

Suresh heads the Monitor component of ZIF at GAVS. He has 20 years of experience in Native Applications, Web, Cloud, and Hybrid platforms from Engineering to Product Management. He has designed & hosted the monitoring solutions. He has been instrumental in conglomerating components to structure the Environment Performance Management suite of ZIF Monitor.

Suresh enjoys playing badminton with his children. He is passionate about gardening, especially medicinal plants.

Generative Adversarial Networks (GAN)

In my previous article (, I had introduced Inverse Reinforcement Learning and explained how it differs from Reinforcement Learning. In this article, let’s explore Generative Adversarial Networks or GAN; both GAN and reinforcement learning help us understand how deep learning is trying to imitate human thinking.

With access to greater hardware power, Neural Networks have made great progress. We use them to recognize images and voice at levels comparable to humans sometimes with even better accuracy. Even with all of that we are very far from automating human tasks with machines because a tremendous amount of information is out there and to a large extent easily accessible in the digital world of bits. The tricky part is to develop models and algorithms that can analyze and understand this humongous amount of data.

GAN in a way comes close to achieving the above goal with what we call automation, we will see the use cases of GAN later in this article.

This technique is very new to the Machine Learning (ML) world. GAN is a deep learning, unsupervised machine learning technique proposed by Ian Goodfellow and few other researchers including Yoshua Bengio in 2014. One of the most prominent researcher in the deep learning area, Yann LeCun described it as “the most interesting idea in the last 10 years in Machine Learning”.

What is Generative Adversarial Network (GAN)?

A GAN is a machine learning model in which two neural networks compete to become more accurate in their predictions. GANs typically run unsupervised and use a cooperative zero-sum game framework to learn.

The logic of GANs lie in the rivalry between the two Neural Nets. It mimics the idea of rivalry between a picture forger and an art detective who repeatedly try to outwit one another. Both networks are trained on the same data set.

A generative adversarial network (GAN) has two parts:

  • The generator (the artist) learns to generate plausible data. The generated instances become negative training examples for the discriminator.
  • The discriminator (the critic) learns to distinguish the generator’s fake data from real data. The discriminator penalizes the generator for producing implausible results.

GAN can be compared with Reinforcement Learning, where the generator is receiving a reward signal from the discriminator letting it know whether the generated data is accurate or not.

Generative Adversarial Networks

During training, the generator tries to become better at generating real looking images, while the discriminator trains to be better classify those images as fake. The process reaches equilibrium at a point when the discriminator can no longer distinguish real images from fakes.

Generative Adversarial Networks

Here are the steps a GAN takes:

  • The input to the generator is random numbers which returns an image.
  • The output image of the generator is fed as input to the discriminator along with a stream of images taken from the actual dataset.
  • Both real and fake images are given to the discriminator which returns probabilities, a number between 0 and 1, 1 meaning a prediction of authenticity and 0 meaning fake.

So, you have a double feedback loop in the architecture of GAN:

  • We have a feedback loop with the discriminator having ground truth of the images from actual training dataset
  • The generator is, in turn, in a feedback loop along with the discriminator.

Most GANs today are at least loosely based on the DCGAN architecture (Radford et al., 2015). DCGAN stands for “deep, convolution GAN.” Though GANs were both deep and convolutional prior to DCGANs, the name DCGAN is useful to refer to this specific style of architecture.

Applications of GAN

Now that we know what GAN is and how it works, it is time to dive into the interesting applications of GANs that are commonly used in the industry right now.

Generative Adversarial Networks

Can you guess what’s common among all the faces in this image?

None of these people are real! These faces were generated by GANs, exciting and at the same time scary, right? We will focus about the ethical application of the GAN in the article.

GANs for Image Editing

Using GANs, appearances can be drastically changed by reconstructing the images.

GANs for Security

GANs has been able to address the concern of ‘adversarial attacks’.

These adversarial attacks use a variety of techniques to fool deep learning architectures. Existing deep learning models are made more robust to these techniques by GANs by creating more such fake examples and training the model to identify them.

Generating Data with GANs

The availability of data in certain domains is a necessity, especially in domains where training data is needed to model learning algorithms. The healthcare industry comes to mind here. GANs shine again as they can be used to generate synthetic data for supervision.

GANs for 3D Object Generation

GANs are quite popular in the gaming industry. Game designers work countless hours recreating 3D avatars and backgrounds to give them a realistic feel. And, it certainly takes a lot of effort to create 3D models by imagination. With the incredible power of GANs, wherein they can be used to automate the entire process!

GANs are one of the few successful techniques in unsupervised machine learning and it is evolving quickly and improving our ability to perform generative tasks. Since most of the successful applications of GANs have been in the domain of computer vision, generative model sure has a lot of potential, but is not without some drawbacks.

About the Author –

Naresh B

Naresh is a part of Location Zero at GAVS as an AI/ML solutions developer. His focus is on solving problems leveraging AI/ML.
He strongly believes in making success as a habit rather than considering it as a destination.
In his free time, he likes to spend time with his pet dogs and likes sketching and gardening.

Lambda (λ), Kappa (κ) and Zeta (ζ) – The Tale of 3 AIOps Musketeers (PART-3)

“Data that sit unused are no different from data that were never collected in the first place.” – Doug Fisher

In the part 1 (, we delved into Lambda Architecture and in part 2 ( about Generic Lambda. Given the limitations of the Generic lambda architecture and its inherent complexity, the data is replicated in two layers and keeping them in-sync is quite challenging in an already complex distributed system.There is a growing interest to find the simpler alternative to the Generic Lambda, that would bring just about the same benefits and handle the full problem set. The solution is Unified Lambda (λ) Architecture.

Unified Lambda (λ) Architecture

The unified approach addresses the velocity and volume problems of Big Data as it uses a hybrid computation model. This model combines both batch data and instantaneous data transparently.

There are basically three approaches:

  1. Pure Streaming Framework
  2. Pure Batch Framework
  3. Lambdoop Framework

1. Pure streaming framework

In this approach, a pure streaming model is adopted and a flexible framework like Apache Samza can be employed to provide unified data processing model for both stream and batch processing using the same data flow structure.

Pure streaming framework

To avoid the large turn-around times involved in Hadoop’s batch processing, LinkedIn came up with a distributed stream processing framework Apache Samza. It is built on top of distributed messaging bus; Apache Kafka, so that it can be a lightweight framework for streaming platform. i.e. for continuous data processing. Samza has built-in integration with Apache Kafka, which is comparable to HDFS and MapReduce. In the Hadoop world, HDFS is the storage layer and MapReduce, the processing layer. In the similar way, Apache Kafka ingests and stores the data in topics, which is then streamed and processed by Samza. Samza normally computes results continuously as and when the data arrives, thus delivering sub-second response times.

Albeit it’s a distributed stream processing framework, its architecture is pluggable i.e. can be integrated with umpteen sources like HDFS, Azure EventHubs, Kinensis etc. apart from Kafka. It follows the principle of WRITE ONCE, RUN ANYWHERE; meaning, the same code can run in both stream and batch mode. Apache Samza’s streams are re-playable, ordered partitions.

Unified API for Batch & Streaming in pure Streaming

Apache Samza offers a unified data processing model for both real-time as well as batch processing.  Based on the input data size, bounded or unbounded the data processing model can be identified, whether batch or stream.Typically bounded (e.g. static files on HDFS) are Batch data sources and streams are unbounded (e.g. a topic in Kafka). Under the bonnet, Apache Samza’s stream-processing engine handles both types with high efficiency.

Unified API for Batch & Streaming in pure Streaming

Another advantage of this unified API for Batch and Streaming in Apache Samza, is that makes it convenient for the developers to focus on the processing logic, without treating bounded and unbounded sources differently. Samza differentiates the bounded and unboundeddata by a special token end-of-stream. Also, only config change is needed, and no code changes are required, in case of switching gears between batch and streaming, e.g. Kafka to HDFS.Let us take an example of Count PageViewEvent for each mobile Device OS in a 5-minute window and send the counts to PageViewEventPerDeviceOS

Pure Batch framework

This is the reverse approach of pure streaming where a flexible Batch framework is employed, which would offer both the batch processing and real-time data processing ability. The streaming is achieved by using mini batches which is small enough to be close to real-time, with Apache Spark/Spark Streaming or Storm’s Trident. Under the hood, Spark streaming is a sequence of micro-batch processes with the sub-second latency. Trident is a high-level abstraction for doing streaming computation on top of Storm. The core data model of Trident is the “Stream”, processed as a series of batches.

Apache Spark achieves the dual goal of Batch as well as real-time processing by the following modes.

  • Micro-batch processing model
  • Continuous Processing model

Micro-batch processing model

Micro-batch processing is analogous to the traditional batch processing in that data are usually processed as a group. The primary difference is that the batches are smaller and processed more often. In spark streaming, the micro-batches are created based on the time rather than on the accumulated data size. The smaller the time to trigger a micro-batch to process, lesser the latency.

Continuous Processing model

Apache Spark 2.3, introduced Low-latency Continuous Processing Mode in Structured Streaming whichenables low (~1 ms) end-to-end latency with at-least-once fault-tolerance guarantees. Comparing this with the default micro-batch processing engine which can achieve exactly-once guarantees but achieve latencies of ~100 ms at best. Without modifying the application logic i.e. DataFrame/Dataset operations mini-batching or continuous streaming can be chosen at runtime. Spark Streaming also has the abilityto work well with several data sources like HDFS, Flume or Kafka.

Example of Micro-batching and Continuous Batching

3. Lambdoop Approach

In many places, capability of both batch and real time processing is needed.It is cumbersome to develop a software architecture of such capabilities by tailoring suitable technologies, software layers, data sources, data storage solutions, smart algorithms and so on to achieve the good scalable solution. This is where the frameworks like Spring “XD”, Summingbird or Lambdoop comes in, since they already have a combined API for batch and real-time processing.


Lambdoop is a software framework based on the Lambda architecture which provides an abstraction layer to the developers. This feature makes the developers life easy to develop any Big Data applications by combining real time and batch processing approaches. Developers don’t have to deal with different technologies, configurations, data formats etc. They can use the Lambdoop framework as the only needed API. Also, Lambdoop includes other interesting tools such as input/output drivers, visualization tools, cluster management tools and widely accepted AI algorithms.

The Speed layer in Lambdoop runs on Storm and the Batch layer on Hadoop, Lambdoop (Lambda-Hadoop, with HBase, Storm and Redis) also combines batch/real-time by offering a single API for both processing models.


Summingbird aka‘Streaming MapReduce’ is a hybrid computational system where both the batch/streaming computations can be run at the same time and the results can be merged automatically. In Summingbird, the developer can write the code/job logic once and change the backend as and when needed. Following are the modes in which Summingbird Job/code can be executed.

  • batch mode (using Scalding on Hadoop)
  • real-time mode (using Storm)
  • hybrid batch/real-time mode (offers attractive fault-tolerance properties)

If the model assumes streaming, one-at-a-time semantics, then the code can be run in real-time e.g. Strom or in offline/batch mode e.g. Hadoop, spark etc. It can operate in a hybrid processing mode, when there is a need to transparently integrate batch and online results to efficiently generate up-to-date views over long time spans.


The volume of any Big Data platform is handled by building a batch processing application which requires, MapReduce, spark development, Use of other Hadoop related tools like Sqoop, Zookeeper, HCatalog etc. and storage systems like HBase, MongoDB, HDFS, Cassandra. At the same time the velocity of any Big Data platform is handled by building a real-time streaming application which requires, stream computing development using Storm, Samza, Kafka-connect, Apache Flink andS4, and use of temporal datastores like in-memory data stores, Apache Kafka messaging system etc.

The Unified Lambda handles the both Volume and Velocity if any Big Data platform by the intermixed approach of featuring a hybrid computation model, where both batch and real-time data processing are combined transparently. Also, the limitations of Generic Lambda like Dual execution mode, Replicating and maintaining the data sync between different layers are avoided and in the Unified Lambda, there would be only one system to learn and maintain.

About the Author:

Bargunan Somasundaram

Bargunan Somasundaram

Bargunan is a Big Data Engineer and a programming enthusiast. His passion is to share his knowledge by writing his experiences about them. He believes “Gaining knowledge is the first step to wisdom and sharing it is the first step to humanity.”

Inverse Reinforcement Learning

Naresh B

What is Inverse Reinforcement Learning(IRL)?

Inverse reinforcement learning is a recently developed Machine Learning framework that can solve the inverse problem of Reinforcement Learning (RL). Basically, IRL is about learning from humans. Inverse reinforcement learning is the field of learning an agent’s objectives, values, or rewards by observing its behavior.

Before getting into further details of IRL, let us recap RL.
Reinforcement learning is an area of Machine Learning (ML) that takes suitable actions to maximize rewards. The goal of reinforcement learning algorithms is to find the best possible action to take in a specific situation.

Challenges in RL

One of the hardest challenges in many reinforcement learning tasks is that it is often difficult to find a good reward function which is both learnable (i.e. rewards happen early and often enough) and correct (i.e. leads to the desired outcomes). Inverse reinforcement learning aims to deal with this problem by learning a reward function based on observations of expert behavior.

What distinguishes Inverse Reinforcement Learning from Reinforcement Learning?

In RL, our agent is provided with a reward function which, whenever it executes an action in some state, provides feedback about the agent’s performance. This reward function is used to obtain an optimal policy, one where the expected future reward (discounted by how far away it will occur) is maximal.

In IRL, the setting is (as the name suggests) inverse. We are now given some agent’s policy or a history of behavior and we try to find a reward function that explains the given behavior. Under the assumption that our agent acted optimally, i.e. always picks the best possible action for its reward function, we try to estimate a reward function that could have led to this behavior.

The biggest motivation for IRL

Maybe the biggest motivation for IRL is that it is often immensely difficult to manually specify a reward function for a task. So far, RL has been successfully applied in domains where the reward function is very clear. But in the real world, it is often not clear at all what the reward should be and there are rarely intrinsic reward signals such as a game score.

For example, consider we want to design an artificial intelligence for a self-driving car. A simple approach would be to create a reward function that captures the desired behavior of a driver, like stopping at red lights, staying off the sidewalk, avoiding pedestrians, and so on. In real life, this would require an exhaustive list of every behavior we’d want to consider, as well as a list of weights describing how important each behavior is.

Instead, in the IRL framework, the task is to take a set of human-generated driving data and extract an approximation of that human’s reward function for the task. Of course, this approximation necessarily deals with a simplified model of driving. Still, much of the information necessary for solving a problem is captured within the approximation of the true reward function. Since it quantifies how good or bad certain actions are. Once we have the right reward function, the problem is reduced to finding the right policy and can be solved with standard reinforcement learning methods.

For our self-driving car example, we’d be using human driving data to automatically learn the right feature weights for the reward. Since the task is described completely by the reward function, we do not even need to know the specifics of the human policy, so long as we have the right reward function to optimize. In the general case, algorithms that solve the IRL problem can be seen as a method for leveraging expert knowledge to convert a task description into a compact reward function.


The foundational methods of inverse reinforcement learning can achieve their results by leveraging information obtained from a policy executed by a human expert. However, in the long run, the goal is for machine learning systems to learn from a wide range of human data and perform tasks that are beyond the abilities of human experts.


About the Author

Naresh is a part of Location Zero at GAVS as an AI/ML solutions developer. His focus is on solving problems leveraging AI/ML. He strongly believes in making success as an habit rather than considering it a destination. In his free time, he likes to spend time with his pet dogs and likes sketching and gardening.

Is Your Investment in TRUE AI?

Yes, AIOps the messiah of ITOps is here to stay! The Executive decision now is on the who and how, rather than when. With a plethora of products in the market offering varying shades of AIOps capabilities, choosing the right vendor is critical, to say the least.

Exclusively AI-based Ops?

Simply put, AIOps platforms leverage Big Data & AI technologies to enhance IT operations. Gartner defines Acquire, Aggregate, Analyze & Act as the four stages of AIOps. These four fall under the purview of Monitoring tools, AIOps Platforms & Action Platforms. However, there is no Industry-recognized mandatory feature list to be supported, for a Platform to be classified as AIOps. Due to this ambiguity in what an AIOps Platform needs to Deliver, huge investments made in rosy AIOps promises can lead to sub-optimal ROI, disillusionment or even derailed projects. Some Points to Ponder…

  • Quality in, Quality out. The value delivered from an AIOps investment is heavily dependent on what data goes into the system. How sure can we be that IT Asset or Device monitoring data provided by the Customer is not outdated, inaccurate or patchy? How sure can we be that we have full visibility of the entire IT landscape? With Shadow IT becoming a tacitly approved aspect of modern Enterprises, are we seeing all devices, applications and users? Doesn’t this imply that only an AIOps Platform providing Application Discovery & Topology Mapping, Monitoring features would be able to deliver accurate insights?
  • There is a very thin line between Also AI and Purely AI. Behind the scenes, most AIOps Platforms are reliant on CMDB or similar tools, which makes Insights like Event Correlation, Noise Reduction etc., rule-based. Where is the AI here?
  • In Gartner’s Market Guide, apart from support features for the different data types, Automated Pattern Discovery is the only other Capability taken into account for the Capabilities of AIOps Vendors matrix. With Gartner being one of the most trusted Technology Research and Advisory companies, it is natural for decision makers to zero-in on one of these listed vendors. What is not immediately evident is that there is so much more to AIOps than just this, and with so much at stake, companies need to do their homework and take informed decisions before finalizing their vendor.
  • Most AIOps vendors ingest, provide access to & store heterogenous data for analysis, and provide actionable Insights and RCA; at which point the IT team takes over. This is a huge leap forward, since it helps IT work through the data clutter and significantly reduces MTTR. But, due to the absence of comprehensive Predictive, Prescriptive & Remediation features, these are not end-to-end AIOps Platforms.
  • At the bleeding edge of the Capability Spectrum is Auto-Remediation based on Predictive & Prescriptive insights. A Comprehensive end-to-end AIOps Platform would need to provide a Virtual Engineer for Auto-Remediation. But, this is a grey area not fully catered to by AIOps vendors.  

The big question now is, if an AIOps Platform requires human intervention or multiple external tools to take care of different missing aspects, can it rightfully claim to be true end-to-end AIOps?

So, what do we do?

Time for you to sit back and relax! Introducing ZIF- One Solution for all your ITOps ills!

We have you completely covered with the full suite of tools that an IT infrastructure team would need. We deliver the entire AIOps Capability spectrum and beyond.

ZIF (Zero Incident Framework™) is an AIOps based TechOps platform that enables proactive Detection and Remediation of incidents helping organizations drive towards a Zero Incident Enterprise™.

The Key Differentiator is that ZIF is a Pure-play AI Platform powered by Unsupervised Pattern-based Machine Learning Algorithms. This is what sets us a Class Apart.

  • Rightly aligns with the Gartner AIOps strategy. ZIF is based on and goes beyond the AIOps framework
  • Huge Investments in developing various patented AI Machine Learning algorithms, Auto-Discovery modules, Agent & Agentless Application Monitoring tools, Network sniffers, Process Automation, Remediation & Orchestration capabilities to form Zero Incident Framework™
  • Powered entirely by Unsupervised Pattern-based Machine Learning Algorithms, ZIF needs no further human intervention and is completely Self-Reliant
  • Unsupervised ML empowers ZIF to learn autonomously, glean Predictive & Prescriptive Intelligence and even uncover Latent Insights
  • The 5 Modules can work together cohesively or as independent stand-alone components
  • Can be Integrated with existing Monitoring and ITSM tools, as required
  • Applies LEAN IT Principle and is on an ambitious journey towards FRICTIONLESS IT.

Realizing a Zero Incident EnterpriseTM

Optimizing ITOps for Digital Transformation

The key focus of Digital Transformation is removing procedural bottlenecks and bending the curve on productivity. As Chief Insights Officer, Forbes Media says, Digital Transformation is now “essential for corporate survival”.

Emerging technologies are enabling dramatic innovations in IT infrastructure and operations. It is no longer just about hardware, software, data centers, the cloud or the service desk; it is about backing business strategies. So, here are some reasons why companies should think about redesigning their IT services to embrace digital disruption.

DevOps for Agility

As companies move away from the traditional Waterfall model of software development and adopt Agile methodologies, IT infrastructure and operations also need to become agile and malleable. Agility has become indispensible to stay competitive in this era of dynamism and constant change. What started off as a set of software development methodologies has now permeated all aspects of an organization, ITOps being one of them. Development, QA and IT teams need to come out of their silos and work in tandem for constant productive collaboration, in what is termed DevOps.

Shorter development & deployment cycles have necessitated overall ITOps efficiency and among other things, IT enviroment provisioning to be on-demand and self-service. Provisioning needs to be automated and built into the CI/CD pipeline.  

Downtime Mitigation

With agility being the org-wide mantra, predictable IT uptime becomes a mandate. Outages incur a very high cost and adversely affect the pace of innovation. The average cost of unplanned application downtime for Fortune 1000 companies is anywhere between $1.25 billion to $2.5 billion, says a report by It further goes on to say that, infrastructure failure can cost the bottom line $100,000/hr and the cost of critical application failure is $500,000 to $1 million/hr.

ITOps must stay ahead of the game by eliminating outdated legacy systems, tools, technologies and workflows. End-to-end automation is key. IT needs to modernize its stack by zeroing-in on tools for Discovery of the complete IT landscape, Monitoring of devices, Analytics for noise reduction and event correlation, AI-based tools for RCA, incident Prediction and Auto-Remediation. All of this intelligent automation will help proactive response rather than a reactive response after the fact, when the damage has already been done.

Moving away from the shadows

Shadow IT, the use of technology outside the IT purview, is becoming a tacitly approved aspect of most modern enterprises. It is a result of proliferation of technology and the cloud offering easy access to applications and storage. Users of Shadow IT systems bypass the IT approval and provisioning process to use unauthorized technology, without the consent of the IT department. There are huge security and compliance risks waiting to happen if this sprawling syndrome is not reined in. To bring Shadow IT under control, the IT dept must first know about it. This is where automated Discovery tools bring in a lot of value by automating the process of application discovery and topology mapping.

Moving towards Hybrid IT

Hybrid IT means the use of an optimal, cost-effective mix of public & private clouds and on-premise systems that enable an infrastructure that is dynamic, on-demand, scalable, and composable. IT spend on datacenters is seeing a downward trend. Most organizations are thinking beyond traditional datacentres to options in the cloud. Colocation is an important consideration since it delivers better availability, energy and time savings, scalability and reduces the impact of network latency. Organizations are only keeping mission-critical processes that require close monitoring & control, on-premise.

Edge computing

Gartner defines edge computing as solutions that facilitate data processing at or near the source of data generation. With huge volumes of data being churned out at rapid rates, for instance by monitoring or IoT devices, it is highly inefficient to stream all this data to a centralized datacenter or cloud for processing. Organizations now understand the value in a decentralized approach to address modern digital infrastructure needs. Edge computing serves as the decentralized extension of the datacenter/cloud and addresses the need for localized computing power.


Cyber attacks are on the rise and securing networks and protecting data is posing big challenges. With Hybrid IT, IoT, Edge computing etc, extension of the IT footprint beyond secure enterprise boundaries has increased the number of attack target points manifold. IT teams need to be well versed with the nuances of security set-up in different cloud vendor environments. There is a lot of ambiguity in ownership of data integrity, in the wake of data being spread across on-premise, cloud environments, shared workstations and virtual machines. With Hybrid IT deployments, a comprehensive security plan regardless of the data’s location has gained paramount importance.

Upskilling IT Teams

With blurring lines between Dev and IT, there is increasing demand for IT professionals equipped with a broad range of cross-functional skills in addition to core IT competencies. With constant emergence of new technologies, there is usually not much clarity on the exact skillsets required by the IT team in an organization. More than expertise in one specific area, IT teams need to be open to continuous learning to adapt to changing IT environments, to close the skills gap and support their organization’s Digital Transformation goals.


AIOps – IT Infrastructure Services for the Digital Age

The IT infrastructure services landscape is undergoing a significant shift, driven by digitalization. As focus shifts from cost efficiency to digital enablement, organizations need to re-imagine the IT infrastructure services model to deliver the necessary back-end agility, flexibility, and fluidity. Automation, analytics, and Artificial Intelligence (AI) – comprising the “codifying elements” for driving AIOps – help drive this desired level of adaptability within IT infrastructure services. Automation, analytics, and AI – which together comprise the “codifying elements” for driving AIOps– help drive the desired level of adaptiveness within IT infrastructure services. Intelligent automation, leveraging analytics and ML, embeds powerful, real-time business and user context and autonomy into IT infrastructure services. Intelligent automation has made inroads in enterprises in the last two to three years, backed by a rapid proliferation and maturation of solutions in the market.

Artificial Intelligence Operations (AIOps) . Everest Group 2018 Report . IT Infrastructure

Benefits of codification of IT infrastructure services

Progressive leverage of analytics and AI, to drive an AIOps strategy, enables the introduction of a broader and more complex set of operational use cases into IT infrastructure services automation. As adoption levels scale and processes become orchestrated, the benefits potentially expand beyond cost savings to offer exponential value around user experience enrichment, services agility and availability, and operations resilience. Intelligent automation helps maximize value from IT infrastructure services by:

  1. Improving the end-user experience through contextual and personalized support
  2. Driving faster resolution of known/identified incidents leveraging existing knowledge, intelligent diagnosis, and reusable, automated workflows
  3. Avoiding potential incidents and improving business systems performance through contextual learning (i.e., based on relationships among systems), proactive health monitoring and anomaly detection, and preemptive healing

Although the benefits of intelligent automation are manifold, enterprises are yet to realize commensurate advantage from investments in infrastructure services codification. Siloed adoption, lack of well-defined change management processes, and poor governance are some of the key barriers to achieving the expected value.  The design should involve an optimal level of human effort/intervention targeted primarily at training, governing, and enhancing the system, rather than executing routine, voluminous tasks.  A phased adoption of automation, analytics, and AI within IT infrastructure services has the potential to offer exponential business value. However, to realize the full potential of codification, enterprises need to embrace a lean operating model, underpinned by a technology-agnostic platform. The platform should embed the codifying elements within a tightly integrated infrastructure services ecosystem with end-to-end workflow orchestration and resolution.

The market today has a wide choice of AIOps solutions, but the onus is on enterprises to select the right set of tools / technologies that align with their overall codification strategy.

Click here to read the complete whitepaper by Everest Group