Prediction for Business Service Assurance

Artificial Intelligence for IT operations or AIOps has exploded over the past few years. As more and more enterprises set about their digital transformation journeys, AIOps becomes imperative to keep their businesses running smoothly. 

AIOps uses several technologies like Machine Learning and Big Data to automate the identification and resolution of common Information Technology (IT) problems. The systems, services, and applications in a large enterprise produce volumes of log and performance data. AIOps uses this data to monitor the assets and gain visibility into the behaviour and dependencies among these assets.

According to a Gartner publication, the adoption of AIOps by large enterprises would rise to 30% by 2023.

ZIF – The ideal AIOps platform of choice

Zero Incident FrameworkTM (ZIF) is an AIOps based TechOps platform that enables proactive detection and remediation of incidents helping organizations drive towards a Zero Incident Enterprise™.

ZIF comprises of 5 modules, as outlined below.

At the heart of ZIF, lies its Analyze and Predict (A&P) modules which are powered by Artificial Intelligence and Machine Learning techniques. From the business perspective, the primary goal of A&P would be 100% availability of applications and business processes.

Let us understand more about thePredict module of ZIF.

Predictive Analytics is one of the main USP of the ZIF platform. ZIF encompassesSupervised, Unsupervised and Reinforcement Learning algorithms for realization of various business use cases (as shown below).

How does the Predict Module of ZIF work?

Through its data ingestion capabilities, the ZIF platform can receive and process all types of data (both structured and unstructured) from various tools in the enterprise. The types of data can be related to alerts, events, logs, performance of devices, relations of devices, workload topologies, network topologies etc. By analyzing all these data, the platform predicts the anomalies that can occur in the environment. These anomalies get presented as ‘Opportunity Cards’ so that suitable action can be taken ahead of time to eliminate any undesired incidents from occurring. Since this is ‘Proactive’ and not ‘Reactive’, it brings about a paradigm shift to any organization’s endeavour to achieve 100% availability of their enterprise systems and platforms. Predictions are done at multiple levels – application level, business process level, device level etc.

Sub-functions of Prediction Module

How does the Predict module manifest to enterprise users of the platform?

Predict module categorizes the opportunity cards into three swim lanes.

  1. Warning swim lane – Opportunity Cards that have an “Expected Time of Impact” (ETI) beyond 60 minutes.
  2. Critical swim lane – Opportunity Cards that have an ETI within 60 minutes.
  3. Processed / Lost– Opportunity Cards that have been processed or lost without taking any action.

Few of the enterprises that realized the power of ZIF’s Prediction Module

  • A manufacturing giant in the US
  • A large non-profit mental health and social service provider in New York
  • A large mortgage loan service provider in the US
  • Two of the largest private sector banks in India

For more detailed information on GAVS’ Analyze, or to request a demo please visithttps://zif.ai/products/predict/

References:https://www.gartner.com/smarterwithgartner/how-to-get-started-with-aiops/

About the Author:

Vasudevan Gopalan

Vasu heads Engineering function for A&P. He is a Digital Transformation leader with ~20 years of IT industry experience spanning across Product Engineering, Portfolio Delivery, Large Program Management etc. Vasu has designed and delivered Open Systems, Core Banking, Web / Mobile Applications etc.

Outside of his professional role, Vasu enjoys playing badminton and focusses on fitness routines.

Discover, Monitor, Analyze & Predict COVID-19

Uber, the world’s largest taxi company, owns no vehicles. Facebook, the world’s most popular media owner, creates no content. Alibaba, the most valuable retailer, has no inventory. Netflix, the world’s largest movie house, own no cinemas. And Airbnb, the world’s largest accommodation provider, owns no real estate. Something interesting is happening.”

– Tom Goodwin, an executive at the French media group Havas.

This new breed of companies is the fastest growing in history because they own the customer interface layer. It is the platform where all the value and profit is. “Platform business” is a more wholesome termfor this model for which data is the fuel; Big Data & AI/ML technologies are the harbinger of new waves of productivity growth and innovation.

With Big data and AI/ML is making a big difference in the area of public health, let’s see how it is helping us tackle the global emergency of coronavirus formally known as COVID-19.

“With rapidly spreading disease, a two-week lag is an eternity.”

DISCOVERING/ DETECTING

Chinese technology giant Alibaba has developed an AI system for detecting the COVID-19 in CT scans of patients’ chests with 96% accuracy against viral pneumonia cases. It only takes 20 seconds for the AI to decide, whereas humans generally take about 15 minutes to diagnose the illness as there can be upwards of 300 images to evaluate.The system was trained on images and data from 5,000 confirmed coronavirus cases and has been tested in hospitals throughout China. Per a report, at least 100 healthcare facilities are currently employing Alibaba’s AI to detect COVID-19.

Ping An Insurance (Group) Company of China, Ltd (Ping An) aims to address the issue of lack of radiologists by introducing the COVID-19 smart image-reading system. This image-reading system can read the huge volumes of CT scans in epidemic areas.

Ping An Smart Healthcare uses clinical data to train the AI model of the COVID-19 smart image-reading system. The AI analysis engine conducts a comparative analysis of multiple CT scan images of the same patient and measures the changes in lesions. It helps in tracking the development of the disease, evaluation of the treatment and in prognosis of patients.Ultimately it assists doctors to diagnose, triage and evaluate COVID-19 patients swiftly and effectively.

Ping An Smart Healthcare’s COVID-19 smart image-reading system also supports AI image-reading remotely by medical professionals outside the epidemic areas.Since its launch, the smart image-reading system has provided services to more than 1,500 medical institutions. More than 5,000 patients have received smart image-reading services for free.

The more solutions the better. At least when it comes to helping overwhelmed doctors provide better diagnoses and, thus, better outcomes.

MONITORING

  • AI based Temperature monitoring & scanning

In Beijing, China, subway passengers are being screened for symptoms of coronavirus, but not by health authorities. Instead, artificial intelligence is in-charge.

Two Chinese AI giants, Megvii and Baidu, have introduced temperature-scanning. They have implemented scanners to detect body temperature and send alerts to company workers if a person’s body temperature is high enough to constitute a fever.

Megvii’s AI system detects body temperatures for up to 15 people per second andup to 16 feet. It monitors as many as 16 checkpoints in a single station. The system integrates body detection, face detection, and dual sensing via infrared cameras and visible light. The system can accurately detect and flag high body temperature even when people are wearing masks, hats, or covering their faces with other items. Megvii’s system also sends alerts to an on-site staff member.

Baidu, one of the largest search-engine companies in China, screens subway passengers at the Qinghe station with infrared scanners. It also uses a facial-recognition system, taking photographs of passengers’ faces. If the Baidu system detects a body temperature of at least 99-degrees Fahrenheit, it sends an alert to the staff member for another screening. The technology can scan the temperatures of more than 200 people per minute.

  • AI based Social Media Monitoring

An international team is using machine learning to scour through social media posts, news reports, data from official public health channels, and information supplied by doctors for warning signs of the virus across geographies.The program is looking for social media posts that mention specific symptoms, like respiratory problems and fever, from a geographic area where doctors have reported potential cases. Natural language processing is used to parse the text posted on social media, for example, to distinguish between someone discussing the news and someone complaining about how they feel.

The approach has proven capable of spotting a coronavirus needle in a haystack of big data. This technique could help experts learn how the virus behaves. It may be possible to determine the age, gender, and location of those most at risk quicker than using official medical sources.

PREDICTING

Data from hospitals, airports, and other public locations are being used to predict disease spread and risk. Hospitals can also use the data to plan for the impact of an outbreak on their operations.

Kalman Filter

Kalman filter was pioneered by Rudolf Emil Kalman in 1960, originally designed and developed to solve the navigation problem in the Apollo Project. Since then, it has been applied to numerous cases such as guidance, navigation, and control of vehicles, computer vision’s object tracking, trajectory optimization, time series analysis in signal processing, econometrics and more.

Kalman filter is a recursive algorithm which uses time-series measurement over time, containing statistical noise and produce estimations of unknown variables.

For the one-day prediction Kalman filter can be used, while for the long-term forecast a linear model is used where its main features are Kalman predictors, infected rate relative to population, time-depended features, and weather history and forecasting.

The one-day Kalman prediction is very accurate and powerful while a longer period prediction is more challenging but provides a future trend.Long term prediction does not guarantee full accuracy but provides a fair estimation following the recent trend. The model should re-run daily to gain better results.

GitHub Link: https://github.com/Rank23/COVID19

ANALYZING

The Center for Systems Science and Engineering at Johns Hopkins University has developed an interactive, web-based dashboard that tracks the status of COVID-19 around the world. The resource provides a visualization of the location and number of confirmed COVID-19 cases, deaths and recoveries for all affected countries.

The primary data source for the tool is DXY, a Chinese platform that aggregates local media and government reports to provide COVID-19 cumulative case totals in near real-time at the province level in China and country level otherwise. Additional data comes from Twitter feeds, online news services and direct communication sent through the dashboard. Johns Hopkins then confirms the case numbers with regional and local health departments. This kind of Data analytics platform plays a pivotal role in addressing the coronavirus outbreak.

All data from the dashboard is also freely available in the following GitHub repository.

GitHub Link:https://bit.ly/2Wmmbp8

Mobile version: https://bit.ly/2WjyK4d

Web version: https://bit.ly/2xLyT6v

Conclusion

One of AI’s core strengths when working on identifying and limiting the effects of virus outbreaks is its incredibly insistent nature. AIsystems never tire, can sift through enormous amounts of data, and identify possible correlations and causations that humans can’t.

However, there are limits to AI’s ability to both identify virus outbreaks and predict how they will spread. Perhaps the best-known example comes from the neighboring field of big data analytics. At its launch, Google Flu Trends was heralded as a great leap forward in relation to identifying and estimating the spread of the flu—until it underestimated the 2013 flu season by a whopping 140 percent and was quietly put to rest.Poor data quality was identified as one of the main reasons Google Flu Trends failed. Unreliable or faulty data can wreak havoc on the prediction power of AI.

References:

About the Author:

Bargunan Somasundaram

Bargunan Somasundaram

Bargunan is a Big Data Engineer and a programming enthusiast. His passion is to share his knowledge by writing his experiences about them. He believes “Gaining knowledge is the first step to wisdom and sharing it is the first step to humanity.”

Understanding Reinforcement Learning in five minutes

Reinforcement learning (RL) is an area of Machine Learning (ML) that takes suitable actions to maximize rewards situations. The goal of reinforcement learning algorithms is to find the best possible action to take in a specific situation. Just like the human brain, it is rewarded for good choices and penalized for bad choices and learns from each choice. RL tries to mimic the way that humans learn new things, not from a teacher but via interaction with the environment. At the end, the RL learns to achieve a goal in an uncertain, potentially complex environment.

Understanding Reinforcement Learning

How does one learn cycling? How does a baby learn to walk? How do we become better at doing something with more practice? Let us explore learning to cycle to illustrate the idea behind RL.

Did somebody tell you how to cycle or gave you steps to follow? Or did you learn it by spending hours watching videos of people cycling? All these will surely give you an idea about cycling; but will it be enough to actually get you cycling? The answer is no. You learn to cycle only by cycling (action). Through trials and errors (practice), and going through all the positive experiences (positive reward) and negative experiences (negative rewards or punishments), before getting your balance and control right (maximum reward or best outcome). This analogy of how our brain learns cycling applies to reinforcement learning. Through trials, errors, and rewards, it finds the best course of action.

Components of Reinforcement Learning

The major components of RL are as detailed below:

  • Agent: Agent is the part of RL which takes actions, receives rewards for actions and gets a new environment state as a result of the action taken. In the cycling analogy, the agent is a human brain that decides what action to take and gets rewarded (falling is negative and riding is positive).
  • Environment: The environment represents the outside world (only relevant part of the world which the agent needs to know about to take actions) that interacts with agents. In the cycling analogy, the environment is the cycling track and the objects as seen by the rider.
  • State: State is the condition or position in which the agent is currently exhibiting or residing. In the cycling analogy, it will be the speed of cycle, tilting of the handle, tilting of the cycle, etc.
  • Action: What the agent does while interacting with the environment is referred to as action. In the cycling analogy, it will be to peddle harder (if the decision is to increase speed), apply brakes (if the decision is to reduce speed), tilt handle, tilt body, etc.
  • Rewards: Reward is an indicator to the agent on how good or bad the action taken was. In the cycling analogy, it can be +1 for not falling, -10 for hitting obstacles and -100 for falling, the reward for outcomes (+1, -10, -100) are defined while building the RL agent. Since the agent wants to maximize rewards, it avoids hitting and always tries to avoid falling.

Characteristics of Reinforcement Learning

Instead of simply scanning the datasets to find a mathematical equation that can reproduce historical outcomes like other Machine Learning techniques, reinforcement learning is focused on discovering the optimal actions that will lead to the desired outcome.

There are no supervisors to guide the model on how well it is doing. The RL agent gets a scalar reward and tries to figure out how good the action was.

Feedback is delayed. The agent gets an instant reward for action, however, the long-term effect of an action is known only later. Just like a move in chess may seem good at the time it is made, but may turn out to be a bad long term move as the game progress.

Time matters (sequential). People who are familiar with supervised and unsupervised learning will know that the sequence in which data is used for training does not matter for the outcome. However, for RL, since action and reward at current state influence future state and action, the time and sequence of data matters.

Action affects subsequent data RL agent receives.

Why Reinforcement Learning

The type of problems that reinforcement learning solves are simply beyond human capabilities. They are even beyond the solving capabilities of ML techniques. Besides, RL eliminates the need for data to learn, as the agent learns by interacting with the environment. This is a great advantage to solve problems where data availability or data collection is an issue.

Reinforcement Learning applications

RL is the darling of ML researchers now. It is advancing with incredible pace, to solve business and industrial problems and garnering a lot of attention due to its potential. Going forward, RL will be core to organizations’ AI strategies.

Reinforcement Learning at GAVS

Reinforcement Learning is core to GAVS’ AI strategy and is being actively pursued to power the IP led AIOps platform – Zero Incident FrameworkTM (ZIF). We had our first success on RL; developing an RL agent for automated log rotation in servers.

References:

Reinforcement Learning: An Introduction second edition by Richard S. Sutton and Andrew G. Barto

https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf

About the Author:

Gireesh Sreedhar KP

Gireesh is a part of the projects run in collaboration with IIT Madras for developing AI solutions and algorithms. His interest includes Data Science, Machine Learning, Financial markets, and Geo-politics. He believes that he is competing against himself to become better than who he was yesterday. He aspires to become a well-recognized subject matter expert in the field of Artificial Intelligence.

AI in Healthcare

The Healthcare Industry is going through a quiet revolution. Factors like disease trends, doctor demographics, regulatory policies, environment, technology etc. are forcing the industry to turn to emerging technologies like AI, to help adapt to the pace of change. Here, we take a look at some key use cases of AI in Healthcare.

Medical Imaging

The application of Machine Learning (ML) in Medical Imaging is showing highly encouraging results. ML is a subset of AI, where algorithms and models are used to help machines imitate the cognitive functions of the human brain and to also self-learn from their experiences.

AI can be gainfully used in the different stages of medical imaging- in acquisition, image reconstruction, processing, interpretation, storage, data mining & beyond. The performance of ML computational models improves tremendously as they get exposed to more & more data and this foundation on colossal amounts of data enables them to gradually better humans at interpretation. They begin to detect anomalies not perceptible to the human eye & not discernible to the human brain!

What goes hand-in-hand with data, is noise. Noise creates artifacts in images and reduces its quality, leading to inaccurate diagnosis. AI systems work through the clutter and aid noise- reduction leading to better precision in diagnosis, prognosis, staging, segmentation and treatment.

At the forefront of this use case is Radio genomics- correlating cancer imaging features and gene expression. Needless to say, this will play a pivotal role in cancer research.

Drug Discovery

Drug Discovery is an arduous process that takes several years from the start of research to obtaining approval to market. Research involves laboring through copious amounts of medical literature to identify the dynamics between genes, molecular targets, pathways, candidate compounds. Sifting through all of this complex data to arrive at conclusions is an enormous challenge. When this voluminous data is fed to the ML computational models, relationships are reliably established. AI powered by domain knowledge is slashing down time & cost involved in new drug development.

Cybersecurity in Healthcare

Data security is of paramount importance to Healthcare providers who need to ensure confidentiality, integrity, and availability of patient data. With cyberattacks increasing in number and complexity, these formidable threats are giving security teams sleepless nights! The main strength of AI is its ability to curate massive quantities of data- here threat intelligence, nullify the noise, provide instant insights & self-learn in the process. Predictive & Prescriptive capabilities of these computational models drastically reduces response time.

Virtual Health assistants

Virtual Health assistants like Chatbots, give patients 24/7 access to critical information, in addition to offering services like scheduling health check-ups or setting up appointments. AI- based platforms for wearable health devices and health apps come armed with loads of features to monitor health signs, daily activities, diet, sleep patterns etc. and provide alerts for immediate action or suggest personalized plans to enable healthy lifestyles.

AI for Healthcare IT Infrastructure

Healthcare IT Infrastructure running critical applications that enable patient care, is the heart of a Healthcare provider. With dynamically changing IT landscapes that are distributed, hybrid & on-demand, IT Operations teams are finding it hard to keep up. Artificial Intelligence for IT Ops (AIOps) is poised to fundamentally transform the Healthcare Industry. It is powering Healthcare Providers across the globe, who are adopting it to Automate, Predict, Remediate & Prevent Incidents in their IT Infrastructure. GAVS’ Zero Incident FrameworkTM (ZIF) – an AIOps Platform, is a pure-play AI platform based on unsupervised Machine Learning and comes with the full suite of tools an IT Infrastructure team would need. Please watch this video to learn more.

READ ALSO OUR NEW UPDATES

Analyze

Have you heard of AIOps?

Artificial intelligence for IT operations (AIOps) is an umbrella term for the application of Big Data Analytics, Machine Learning (ML) and other Artificial Intelligence (AI) technologies to automate the identification and resolution of common Information Technology (IT) problems. The systems, services and applications in a large enterprise produce immense volumes of log and performance data. AIOps uses this data to monitor the assets and gain visibility into the working behaviour and dependencies between these assets.

According to a Gartner study, the adoption of AIOps by large enterprises would rise to 30% by 2023.

ZIF – The ideal AIOps platform of choice

Zero Incident FrameworkTM (ZIF) is an AIOps based TechOps platform that enables proactive detection and remediation of incidents helping organizations drive towards a Zero Incident Enterprise™

ZIF comprises of 5 modules, as outlined below.

At the heart of ZIF, lies its Analyze and Predict (A&P) modules which are powered by Artificial Intelligence and Machine Learning techniques. From the business perspective, the primary goal of A&P would be 100% availability of applications and business processes.

Come, let us understand more about the Analyze function of ZIF.

With Analyzehaving a Big Data platform under its hood, volumes of raw monitoring data, both structured and unstructured, can be ingested and grouped to build linkages and identify failure patterns.

Data Ingestion and Correlation of Diverse Data

The module processes a wide range of data from varied data sources to break siloes while providing insights, exposing anomalies and highlighting risks across the IT landscape. It increases productivity and efficiency through actionable insights.

  • 100+ connectors for leading tools, environments and devices
  • Correlation and aggregation methods uncover patterns and relationships in the data

Noise Nullification

Eliminates duplicate incidents, false positives and any alerts that are insignificant. This also helps reduce the Mean-Time-To-Resolution and event-to-incident ratio.

  • Deep learning algorithms isolate events that have the potential to become incidents along with their potential criticality
  • Correlation and Aggregation methods group alerts and incidents that are related and needs a common remediation
  • Reinforcement learning techniques are applied to find and eliminate false positives and duplicates

Event Correlation

Data from various sources are ingested real-time into ZIF either by push or pull mechanism. As the data is ingested, labelling algorithms are run to label the data based on identifiers. The labelled data is passed through the correlation engine where unsupervised algorithms are run to mine the patterns. Sub-sequence mining algorithms help in identifying unique patterns from the data.

Unique patterns identified are clustered using clustering algorithms to form cases. Every case that is generated is marked by a unique case id. As part of the clustering process, seasonality aspects are checked from historical transactions to derive higher accuracy of correlation.

Correlation is done based on pattern recognition, helping to eliminate the need for relational CMDB from the enterprise. The accuracy of the correlation increases as patterns reoccur. Algorithms also can unlearn patterns based on the feedback that can be provided by actions taken on correlation. As these are unsupervised algorithms, the patterns are learnt with zero human intervention.

Accelerated Root Cause Analysis (RCA)

Analyze module helps in identifying the root causes of incidents even when they occur in different silos. Combination of correlation algorithms with unsupervised deep learning techniques aid in accurately nailing down the root causes of incidents/problems. Learnings from historical incidents are also applied to find root causes in real-time. The platform retraces the user journeys step-by-step to identify the exact point where an error occurs.

Customer Success Story – How ZIF’s A&P transformed IT Operations of a Manufacturing Giant

  • Seamless end-to-end monitoring – OS, DB, Applications, Networks
  • Helped achieve more than 50% noise reduction in 6 months
  • Reduced P1 incidents by ~30% through dynamic and deep monitoring
  • Achieved declining trend of MTTR and an increasing trend of Availability
  • Resulted in optimizingcommand centre/operations head count by ~50%
  • Resulted in ~80% reduction in operations TCO

For more detailed information on GAVS’ Analyze, or to request a demo please visit zif.ai/products/analyze

References: www.gartner.com/smarterwithgartner/how-to-get-started-with-aiops

ABOUT THE AUTHOR

Vasudevan Gopalan


Vasu heads Engineering function for A&P. He is a Digital Transformation leader with ~20 years of IT industry experience spanning across Product Engineering, Portfolio Delivery, Large Program Management etc. Vasu has designed and delivered Open Systems, Core Banking, Web / Mobile Applications etc.

Outside of his professional role, Vasu enjoys playing badminton and focusses on fitness routines.

READ ALSO OUR NEW UPDATES

Monitoring for Success

Do you know if your end users are happy?

(In the context of users of Applications (desktop, web or cloud-based), Services, Servers and components of IT environment, directly or indirectly.)

The question may sound trivial, but it has a significant impact on the success of a company. The user experience is a journey, from the time they use the application or service, till after they complete the interaction. Experience can be determined based on factors like Speed, Performance, Flawlessness, Ease of use, Security, Resolution time, among others. Hence, monitoring the ‘Wow’ & ‘Woe’ moments of the users is vital.

Monitor is a component of GAVS’ AIOps Platform, Zero Incident FrameworkTM (ZIF). One of the key objectives of the Monitor platform is to measure and improve end-user experience. This component monitors all the layers (includes but not limited to application, database, server, APIs, end-points, and network devices) in real-time that are involved in the user experience. Ultimately,this helps to drive the environment towards Zero Incidents.

This figure shows the capability of ZIF monitoring that cut across all layers starting from end-user to storage and how it is linked to other the components of the platform

Key Features of ZIF Monitor are,

  • Unified solution for all IT environment monitoring needs: The platform covers the end-to-end monitoring of an IT landscape. The key focus is to ensure all verticals of IT are brought under thorough monitoring. The deeper the monitoring, the closer an organization is to attaining a Zero Incident EnterpriseTM.
  • Agents with self-intelligence: The intelligent agents capture various health parameters about the environment. When the target environment is already running under low resource, the agent will not task it with more load. It will collect the health-related metrics and communicate through the telemetry channel efficiently and effectively. The intelligence is applied in terms of parameters to be collected, the period of collection and many more.
  • Depth of monitoring: The core strength of Monitor is it comes with a list of performance counters which are defined by SMEs across all layers of the IT environment. This is a key differentiator; the monitoring parameters can be dynamically configured for the target environment. Parameters can be added or removed on a need basis.
  • Agent & Agentless (Remote): The customers can choose from Agent & Agentless options for the solutions. The remote solution is called as Centralized Remote Monitoring Solution (CRMS). Each monitoring parameter can be remotely controlled and defined from the CRMS. Even the agents that are running in the target environment can be controlled from the server console.
  • Compliance: Plays a key role in terms of the compliance of the environment. Compliance ranges from ensuring the availability of necessary services and processes in the target environment and defines the standard of what Application, Make, Version, Provider, Size, etc. that are allowed in the target environment.
  • Auto discovery: Monitor can auto-discover the newer elements (servers, endpoints, databases, devices, etc.) that are getting added to the environment. It can automatically add those newer elements into the purview of monitoring.
  • Auto scale: Centralized Remote Monitoring Solution (CRMS) can auto-scale on its own when newer elements are added for monitoring through auto-discovery. The auto scale includes various aspects, like load on channel, load on individual polling engine, and load on each agentless solution.
  • Real time user & Synthetic Monitoring: Real-time user monitoring is to monitor the environment when the user is active. Synthetic monitoring is through simulated techniques. It doesn’t wait for the user to make a transaction or use the system. Instead, it simulates the scenario and provide insights to make decision proactively.
  • Availability & status of devices connected: Monitor also includes the monitoring of availability and control of USB and COM port devices that are connected.
  • Black box monitoring: It is not always possible to instrument the application to get insights.Hence, the Black Box technique is used. Here the application is treated as a black box and it is monitored in terms of its interaction with the Kernel & OS through performance counters.
High level overview of Monitor’s components,

  • Agents, Agentless: These are the means through which monitoring is done at the target environment, like user devices, servers, network devices, load balancers, virtualized environment, API layers, databases, replications, storage devices, etc.
  • ZIF Telemetry Channel: The performance telemetry that are collected from source to target are passed through this channel to the big data platform.
  • Telemetry Data: Refers to the performance data and other metrics collected from all over the environment.
  • Telemetry Database:This is the big data platform, in which the telemetry data from all sources are captured and stored.
  • Intelligence Engine: This parses the telemetry data in near real time and raises notifications based on rule-based threshold and as well as through dynamic threshold.
  • Dashboard&Alerting Mechanism: These are the means through which the results of monitoring are conveyed as metrics in dashboard and as well as notifications.
  • Integration with Analyze, Predict & Remediate components: Monitoring module communicates the telemetry to Analyze & Predict components of the ZIF platform for it to use the data for analysis and apply Machine Learning for prediction. Both Monitor & Predict components, communicate with Remediate platform to trigger remediation.

The Monitor component works in tandem with Analyze, Predict and Remediate components of the ZIF platform to achieve an incident free IT environment. Implementation of ZIF is the right step to driving an enterprise towards Zero Incidents. ZIF is the only platform in the industry which comes from the single product platform owner who owns the end-to-end IP of the solution with products developed from scratch.

For more detailed information on GAVS’ Monitor, or to request a demo please visit zif.ai/products/monitor/

(To be continued…)

About the Author

Suresh Kumar Ramasamy


Suresh heads the Monitor component of ZIF at GAVS. He has 20 years of experience in Native Applications, Web, Cloud and Hybrid platforms from Engineering to Product Management. He has designed & hosted the monitoring solutions. He has been instrumental in conglomerating components to structure the Environment Performance Management suite of ZIF Monitor.

Suresh enjoys playing badminton with his children. He is passionate about gardening, especially medicinal plants.

READ ALSO OUR NEW UPDATES

AIOps Demystified

IT Infrastructure has been on an incredibly fascinating journey from the days of mainframes housed in big rooms just a few decades ago, to mini computers, personal computers, client-servers, enterprise & mobile networks, virtual machines and the cloud! While mobile technologies have made computing omnipresent, the cloud coupled with technologies like virtual computing and containers has changed the traditional IT industry in unimaginable ways and has fuelled the rise of service-oriented architectures where everything is offered as a service and on-demand. Infrastructure as a Service (IaaS), Platform as a Service (PaaS), DBaaS, MBaaS, SaaS and so on.

As companies try to grapple with this technology explosion, it is very clear that the first step has to be optimization of the IT infrastructure & operations. Efficient ITOps has become the foundation not just to aid transformational business initiatives, but even for basic survival in this competitive world.

The term AIOps was first coined by Gartner based on their research on Algorithmic IT Operations. Now, it refers to the use of Artificial Intelligence(AI) for IT Operations(Ops), which is the use of Big Data Analytics and AI technologies to optimize, automate and supercharge all aspects of IT Operations.

Why AI in IT operations?

The promise behind bringing AI into the picture has been to do what humans have been doing, but better, faster and at a much larger scale. Let’s delve into the different aspects of IT operations and see how AI can make a difference.

Visibility

The first step to effectively managing the IT landscape is to get complete visibility into it. Why is that so difficult? The sheer variety and volume of applications, users and environments make it extremely challenging to get a full 360 degree view of the landscape. Most organizations use applications that are web-based, virtually delivered, vendor-built, custom-made, synchronous/asynchronous/batch processing, written using different programming languages and/or for different operating systems, SaaS, running in public/private/hybrid cloud environments, multi-tenant, multiple instances of the same applications, multi-tiered, legacy, running in silos! Adding to this complexity is the rampant issue of shadow IT, which is the use of applications outside the purview of IT, triggered by the easy availability of and access to applications and storage on the cloud. And, that’s not all! After all the applications have been discovered, they need to be mapped to the topology, their performances need to be baselined and tracked, all users in the system have to be found and their user experiences captured.

The enormity of this challenge is now evident. AI powers auto-discovery of all applications, topology mapping, baselining response times and tracking all users of all these applications. Machine Learning algorithms aid in self-learning, unlearning and auto-correction to provide a highly accurate view of the IT landscape.

Monitoring

When the IT landscape has been completely discovered, the next step is to monitor the infrastructure and application stacks. Monitoring tools provide real-time data on their availability and performance based on relevant metrics.

The problem is two-fold here. Typically, IT organizations need to rely on several monitoring tools that cater to the different environments/domains in the landscape. Since these tools work in silos, they give a very fractured view of the entire system, necessitating data correlation before it can be gainfully used for Root Cause Analysis(RCA) or actionable insights.

Pattern recognition-based learning from current and historical data helps correlate these seemingly independent events, and therefore to recognize & alert deviations, performance degradations or capacity utilization bottlenecks in real-time and consequently enable effective Root Cause Analysis(RCA) and reduce an important KPI, Mean Time to Identify(MTTI).

Secondly, there is colossal amounts of data in the form of logs, events, metrics pouring in at high velocity from all these monitoring tools, creating alert fatigue. This makes it almost impossible for the IT support team to check each event, correlate with the other events, tag and prioritize them and plan remedial action.

Inherently, machines handle volume with ease and when programmed with ML algorithms learn to sift through all the noise and zero-in on what is relevant. Noise nullification is achieved by the use of Deep Learning algorithms that isolate events that have the potential to become incidents and Reinforcement Learning algorithms that find and eliminate duplicates and false positives. These capabilities help organizations bring dramatic improvements to another critical ITOps metric, Mean Time to Resolution(MTTR).

Other areas of ITOps where AI brings a lot of value are in Advanced Analytics- Predictive & Prescriptive- and Remediation.

Advanced Analytics

Unplanned IT Outages result in huge financial losses for companies and even worse, a sharp dip in customer confidence. One of the biggest value-adds of AI for ITOps then, is in driving proactive operations that deliver superior user experiences with predictable uptime. Advanced Analytics on historical incident data identifies patterns, causes and situations in the entire stack(infrastructure, networks, services and applications) that lead to an outage. Multivariate predictive algorithms drive predictions of incident and service request volumes, spikes and lulls way in advance. AIOps tools forecast usage patterns and capacity requirements to enable planning, just-in-time procurement and staffing to optimize resource utilization. Reactive purchases after the fact, can be very disruptive & expensive.

Remediation

AI-powered remediation automates remedial workflows & service actions, saving a lot of manual effort and reducing errors, incidents and cost of operations. Use of chatbots provides round-the-clock customer support, guiding users to troubleshoot standard problems, and auto-assigns tickets to appropriate IT staff. Dynamic capacity orchestration based on predicted usage patterns and capacity needs induces elasticity and eliminates performance degradation caused by inefficient capacity planning.

Conclusion

The beauty of AIOps is that it gets better with age as the learning matures on exposure to more and more data. While AIOps is definitely a blessing for IT Ops teams, it is only meant to augment the human workforce and not to replace them entirely. And importantly, it is not a one-size-fits-all approach to AIOps. Understanding current pain points and future goals and finding an AIOps vendor with relevant offerings is the cornerstone of a successful implementation.

GAVS’ Zero Incident Framework TM (ZIF) is an AIOps-based TechOps Platform that enables organizations to trend towards a Zero Incident Enterprise TM. ZIF comes with an end-to-end suite of tools for ITOps needs. It is a pure-play AI Platform powered entirely by Unsupervised Pattern-based Machine Learning! You can learn more about ZIF or request a demo here.

READ ALSO OUR NEW UPDATES

AIOps Demystified

IT Infrastructure has been on an incredibly fascinating journey from the days of mainframes housed in big rooms just a few decades ago, to mini computers, personal computers, client-servers, enterprise & mobile networks, virtual machines and the cloud! While mobile technologies have made computing omnipresent, the cloud coupled with technologies like virtual computing and containers has changed the traditional IT industry in unimaginable ways and has fuelled the rise of service-oriented architectures where everything is offered as a service and on-demand. Infrastructure as a Service (IaaS), Platform as a Service (PaaS), DBaaS, MBaaS, SaaS and so on.

As companies try to grapple with this technology explosion, it is very clear that the first step has to be optimization of the IT infrastructure & operations. Efficient ITOps has become the foundation not just to aid transformational business initiatives, but even for basic survival in this competitive world.

The term AIOps was first coined by Gartner based on their research on Algorithmic IT Operations. Now, it refers to the use of Artificial Intelligence(AI) for IT Operations(Ops), which is the use of Big Data Analytics and AI technologies to optimize, automate and supercharge all aspects of IT Operations.

Why AI in IT operations?

The promise behind bringing AI into the picture has been to do what humans have been doing, but better, faster and at a much larger scale. Let’s delve into the different aspects of IT operations and see how AI can make a difference.

Visibility

The first step to effectively managing the IT landscape is to get complete visibility into it. Why is that so difficult? The sheer variety and volume of applications, users and environments make it extremely challenging to get a full 360 degree view of the landscape. Most organizations use applications that are web-based, virtually delivered, vendor-built, custom-made, synchronous/asynchronous/batch processing, written using different programming languages and/or for different operating systems, SaaS, running in public/private/hybrid cloud environments, multi-tenant, multiple instances of the same applications, multi-tiered, legacy, running in silos! Adding to this complexity is the rampant issue of shadow IT, which is the use of applications outside the purview of IT, triggered by the easy availability of and access to applications and storage on the cloud. And, that’s not all! After all the applications have been discovered, they need to be mapped to the topology, their performances need to be baselined and tracked, all users in the system have to be found and their user experiences captured.

The enormity of this challenge is now evident. AI powers auto-discovery of all applications, topology mapping, baselining response times and tracking all users of all these applications. Machine Learning algorithms aid in self-learning, unlearning and auto-correction to provide a highly accurate view of the IT landscape.

Monitoring

When the IT landscape has been completely discovered, the next step is to monitor the infrastructure and application stacks. Monitoring tools provide real-time data on their availability and performance based on relevant metrics.

The problem is two-fold here. Typically, IT organizations need to rely on several monitoring tools that cater to the different environments/domains in the landscape. Since these tools work in silos, they give a very fractured view of the entire system, necessitating data correlation before it can be gainfully used for Root Cause Analysis(RCA) or actionable insights.

Pattern recognition-based learning from current and historical data helps correlate these seemingly independent events, and therefore to recognize & alert deviations, performance degradations or capacity utilization bottlenecks in real-time and consequently enable effective Root Cause Analysis(RCA) and reduce an important KPI, Mean Time to Identify(MTTI).

Secondly, there is colossal amounts of data in the form of logs, events, metrics pouring in at high velocity from all these monitoring tools, creating alert fatigue. This makes it almost impossible for the IT support team to check each event, correlate with the other events, tag and prioritize them and plan remedial action.

Inherently, machines handle volume with ease and when programmed with ML algorithms learn to sift through all the noise and zero-in on what is relevant. Noise nullification is achieved by the use of Deep Learning algorithms that isolate events that have the potential to become incidents and Reinforcement Learning algorithms that find and eliminate duplicates and false positives. These capabilities help organizations bring dramatic improvements to another critical ITOps metric, Mean Time to Resolution(MTTR).

Other areas of ITOps where AI brings a lot of value are in Advanced Analytics- Predictive & Prescriptive- and Remediation.

Advanced Analytics

Unplanned IT Outages result in huge financial losses for companies and even worse, a sharp dip in customer confidence. One of the biggest value-adds of AI for ITOps then, is in driving proactive operations that deliver superior user experiences with predictable uptime. Advanced Analytics on historical incident data identifies patterns, causes and situations in the entire stack(infrastructure, networks, services and applications) that lead to an outage. Multivariate predictive algorithms drive predictions of incident and service request volumes, spikes and lulls way in advance. AIOps tools forecast usage patterns and capacity requirements to enable planning, just-in-time procurement and staffing to optimize resource utilization. Reactive purchases after the fact, can be very disruptive & expensive.

Remediation

AI-powered remediation automates remedial workflows & service actions, saving a lot of manual effort and reducing errors, incidents and cost of operations. Use of chatbots provides round-the-clock customer support, guiding users to troubleshoot standard problems, and auto-assigns tickets to appropriate IT staff. Dynamic capacity orchestration based on predicted usage patterns and capacity needs induces elasticity and eliminates performance degradation caused by inefficient capacity planning.

Conclusion

The beauty of AIOps is that it gets better with age as the learning matures on exposure to more and more data. While AIOps is definitely a blessing for IT Ops teams, it is only meant to augment the human workforce and not to replace them entirely. And importantly, it is not a one-size-fits-all approach to AIOps. Understanding current pain points and future goals and finding an AIOps vendor with relevant offerings is the cornerstone of a successful implementation.

GAVS’ Zero Incident Framework TM (ZIF) is an AIOps-based TechOps Platform that enables organizations to trend towards a Zero Incident Enterprise TM. ZIF comes with an end-to-end suite of tools for ITOps needs. It is a pure-play AI Platform powered entirely by Unsupervised Pattern-based Machine Learning! You can learn more about ZIF or request a demo here.

READ ALSO OUR NEW UPDATES

Optimizing ITOps for Digital Transformation

The key focus of Digital Transformation is removing procedural bottlenecks and bending the curve on productivity. As Chief Insights Officer, Forbes Media says, Digital Transformation is now “essential for corporate survival”.

Emerging technologies are enabling dramatic innovations in IT infrastructure and operations. It is no longer just about hardware, software, data centers, the cloud or the service desk; it is about backing business strategies. So, here are some reasons why companies should think about redesigning their IT services to embrace digital disruption.

DevOps for Agility

As companies move away from the traditional Waterfall model of software development and adopt Agile methodologies, IT infrastructure and operations also need to become agile and malleable. Agility has become indispensible to stay competitive in this era of dynamism and constant change. What started off as a set of software development methodologies has now permeated all aspects of an organization, ITOps being one of them. Development, QA and IT teams need to come out of their silos and work in tandem for constant productive collaboration, in what is termed DevOps.

Shorter development & deployment cycles have necessitated overall ITOps efficiency and among other things, IT enviroment provisioning to be on-demand and self-service. Provisioning needs to be automated and built into the CI/CD pipeline.  

Downtime Mitigation

With agility being the org-wide mantra, predictable IT uptime becomes a mandate. Outages incur a very high cost and adversely affect the pace of innovation. The average cost of unplanned application downtime for Fortune 1000 companies is anywhere between $1.25 billion to $2.5 billion, says a report by DevOps.com. It further goes on to say that, infrastructure failure can cost the bottom line $100,000/hr and the cost of critical application failure is $500,000 to $1 million/hr.

ITOps must stay ahead of the game by eliminating outdated legacy systems, tools, technologies and workflows. End-to-end automation is key. IT needs to modernize its stack by zeroing-in on tools for Discovery of the complete IT landscape, Monitoring of devices, Analytics for noise reduction and event correlation, AI-based tools for RCA, incident Prediction and Auto-Remediation. All of this intelligent automation will help proactive response rather than a reactive response after the fact, when the damage has already been done.

Moving away from the shadows

Shadow IT, the use of technology outside the IT purview, is becoming a tacitly approved aspect of most modern enterprises. It is a result of proliferation of technology and the cloud offering easy access to applications and storage. Users of Shadow IT systems bypass the IT approval and provisioning process to use unauthorized technology, without the consent of the IT department. There are huge security and compliance risks waiting to happen if this sprawling syndrome is not reined in. To bring Shadow IT under control, the IT dept must first know about it. This is where automated Discovery tools bring in a lot of value by automating the process of application discovery and topology mapping.

Moving towards Hybrid IT

Hybrid IT means the use of an optimal, cost-effective mix of public & private clouds and on-premise systems that enable an infrastructure that is dynamic, on-demand, scalable, and composable. IT spend on datacenters is seeing a downward trend. Most organizations are thinking beyond traditional datacentres to options in the cloud. Colocation is an important consideration since it delivers better availability, energy and time savings, scalability and reduces the impact of network latency. Organizations are only keeping mission-critical processes that require close monitoring & control, on-premise.

Edge computing

Gartner defines edge computing as solutions that facilitate data processing at or near the source of data generation. With huge volumes of data being churned out at rapid rates, for instance by monitoring or IoT devices, it is highly inefficient to stream all this data to a centralized datacenter or cloud for processing. Organizations now understand the value in a decentralized approach to address modern digital infrastructure needs. Edge computing serves as the decentralized extension of the datacenter/cloud and addresses the need for localized computing power.

CyberSecurity

Cyber attacks are on the rise and securing networks and protecting data is posing big challenges. With Hybrid IT, IoT, Edge computing etc, extension of the IT footprint beyond secure enterprise boundaries has increased the number of attack target points manifold. IT teams need to be well versed with the nuances of security set-up in different cloud vendor environments. There is a lot of ambiguity in ownership of data integrity, in the wake of data being spread across on-premise, cloud environments, shared workstations and virtual machines. With Hybrid IT deployments, a comprehensive security plan regardless of the data’s location has gained paramount importance.

Upskilling IT Teams

With blurring lines between Dev and IT, there is increasing demand for IT professionals equipped with a broad range of cross-functional skills in addition to core IT competencies. With constant emergence of new technologies, there is usually not much clarity on the exact skillsets required by the IT team in an organization. More than expertise in one specific area, IT teams need to be open to continuous learning to adapt to changing IT environments, to close the skills gap and support their organization’s Digital Transformation goals.

READ ALSO OUR NEW UPDATES

What you need to know about AIOps?

Emergence of AIOps

There has been a gigantic growth of AIOps in the last two years. It has successfully transitioned from an emergent category to an inevitability. Companies adopted AIOps to automate and improve IT operations by applying big data and machine learning (ML). Adoption of such technologies compelled IT operations to adapt a multi-cloud infrastructure. According to Infoholic Research, the AIOps market is expected to grow at a CAGR of 33.08% during the forecast period 2018–2024.

What is AIOps?

AIOps broadly stands for Artificial Intelligence for IT Operations. With a combination of big data and ML, AIOps platform improvises IT operations and also replaces certain tasks including tracking availability, event correlation, performance monitoring, IT service management and automation. Most of these technologies are well-defined and matured.

AIOps data originates from log files, metrics, monitoring tools, helpdesk ticketing and other sources. It sorts, manages and assimilates these data to provide insight in problem areas. The goal of AIOps is to analyze data and discover patterns that can predict potential incidents in future.

Focus areas of AIOps

  • AIOps helps with open data access without letting organizational silos play a part in it.
  • AIOps upgrades data handling ability which also impacted on the scope of data analysis.
  • It has a unique ability to stay aligned to organizational goals.
  • AIOps increases the scope of risk prediction.
  • It also reduces response time.

Impact of AI in IT operations

  • Capacity planning: AIOps can support in understanding workloads and plan configuration appropriately without allowing a scope for speculation.
  • Resource utilization: AIOps allows predictive scaling where auto-scale feature of cloud IaaS can adjust itself based on historical data.
  • Storage: AIOps helps in storage activity through disk calibration, reconfiguration and allocation of new storage volumes.
  • Anomaly detection: It can detect anomalies and critical issues faster with accuracy more than humans, reducing potential threats and system downtime.
  • Threat management: It helps to analyze breaches in both internal and external environments.
  • Root-cause analysis: AIOps is effective in root-cause analysis, through which it reduces response time and creates remedy after locating the issue.
  • Forecasting outages: Outage prediction is essential for the growth of IT operations. Infact, the market of forecasting outages through AIOps, is expected to grow from $493.7 to $1.14 billion between 2016 and 2021 based on industry reports.
  • Future innovation: AIOps has played a key role in automating a major chunk of IT operations in a massive way. It frees resources to focus on crucial things aligned to strategy and organizational goals.

Problems AIOps solved

The common issues AIOps solves to enable IT operations’ adoption of digitization are as follows:

  • It has the ability to gain access over large data sets across environments while maintaining data reliability for comprehensive analysis.
  • It simplifies data analysis through automation empowered by ML.
  • Through accurate prediction mechanism, it can avoid costly downtime and improve customer satisfaction.
  • Through implementation of automation, manual tasks can be eliminated.
  • AIOps can improve teamwork and workflow activities between IT groups and other business units.

Peeping into the future

AIOps platform acts as a foundation stone in projecting future endeavors of organizations. It uses real-time analysis of data to provide insights to impact business decisions. Successful implementation of AIOps depends on key parameters index (KPIs). It can also deliver a predictive and proactive IT operation by reducing failure, detection, resolution and investigation.

READ ALSO OUR NEW UPDATES