Understanding Reinforcement Learning in five minutes

Reinforcement learning (RL) is an area of Machine Learning (ML) that takes suitable actions to maximize rewards situations. The goal of reinforcement learning algorithms is to find the best possible action to take in a specific situation. Just like the human brain, it is rewarded for good choices and penalized for bad choices and learns from each choice. RL tries to mimic the way that humans learn new things, not from a teacher but via interaction with the environment. At the end, the RL learns to achieve a goal in an uncertain, potentially complex environment.

Understanding Reinforcement Learning

How does one learn cycling? How does a baby learn to walk? How do we become better at doing something with more practice? Let us explore learning to cycle to illustrate the idea behind RL.

Did somebody tell you how to cycle or gave you steps to follow? Or did you learn it by spending hours watching videos of people cycling? All these will surely give you an idea about cycling; but will it be enough to actually get you cycling? The answer is no. You learn to cycle only by cycling (action). Through trials and errors (practice), and going through all the positive experiences (positive reward) and negative experiences (negative rewards or punishments), before getting your balance and control right (maximum reward or best outcome). This analogy of how our brain learns cycling applies to reinforcement learning. Through trials, errors, and rewards, it finds the best course of action.

Components of Reinforcement Learning

The major components of RL are as detailed below:

  • Agent: Agent is the part of RL which takes actions, receives rewards for actions and gets a new environment state as a result of the action taken. In the cycling analogy, the agent is a human brain that decides what action to take and gets rewarded (falling is negative and riding is positive).
  • Environment: The environment represents the outside world (only relevant part of the world which the agent needs to know about to take actions) that interacts with agents. In the cycling analogy, the environment is the cycling track and the objects as seen by the rider.
  • State: State is the condition or position in which the agent is currently exhibiting or residing. In the cycling analogy, it will be the speed of cycle, tilting of the handle, tilting of the cycle, etc.
  • Action: What the agent does while interacting with the environment is referred to as action. In the cycling analogy, it will be to peddle harder (if the decision is to increase speed), apply brakes (if the decision is to reduce speed), tilt handle, tilt body, etc.
  • Rewards: Reward is an indicator to the agent on how good or bad the action taken was. In the cycling analogy, it can be +1 for not falling, -10 for hitting obstacles and -100 for falling, the reward for outcomes (+1, -10, -100) are defined while building the RL agent. Since the agent wants to maximize rewards, it avoids hitting and always tries to avoid falling.

Characteristics of Reinforcement Learning

Instead of simply scanning the datasets to find a mathematical equation that can reproduce historical outcomes like other Machine Learning techniques, reinforcement learning is focused on discovering the optimal actions that will lead to the desired outcome.

There are no supervisors to guide the model on how well it is doing. The RL agent gets a scalar reward and tries to figure out how good the action was.

Feedback is delayed. The agent gets an instant reward for action, however, the long-term effect of an action is known only later. Just like a move in chess may seem good at the time it is made, but may turn out to be a bad long term move as the game progress.

Time matters (sequential). People who are familiar with supervised and unsupervised learning will know that the sequence in which data is used for training does not matter for the outcome. However, for RL, since action and reward at current state influence future state and action, the time and sequence of data matters.

Action affects subsequent data RL agent receives.

Why Reinforcement Learning

The type of problems that reinforcement learning solves are simply beyond human capabilities. They are even beyond the solving capabilities of ML techniques. Besides, RL eliminates the need for data to learn, as the agent learns by interacting with the environment. This is a great advantage to solve problems where data availability or data collection is an issue.

Reinforcement Learning applications

RL is the darling of ML researchers now. It is advancing with incredible pace, to solve business and industrial problems and garnering a lot of attention due to its potential. Going forward, RL will be core to organizations’ AI strategies.

Reinforcement Learning at GAVS

Reinforcement Learning is core to GAVS’ AI strategy and is being actively pursued to power the IP led AIOps platform – Zero Incident FrameworkTM (ZIF). We had our first success on RL; developing an RL agent for automated log rotation in servers.

References:

Reinforcement Learning: An Introduction second edition by Richard S. Sutton and Andrew G. Barto

https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf

About the Author:

Gireesh Sreedhar KP

Gireesh is a part of the projects run in collaboration with IIT Madras for developing AI solutions and algorithms. His interest includes Data Science, Machine Learning, Financial markets, and Geo-politics. He believes that he is competing against himself to become better than who he was yesterday. He aspires to become a well-recognized subject matter expert in the field of Artificial Intelligence.

AI in Healthcare

The Healthcare Industry is going through a quiet revolution. Factors like disease trends, doctor demographics, regulatory policies, environment, technology etc. are forcing the industry to turn to emerging technologies like AI, to help adapt to the pace of change. Here, we take a look at some key use cases of AI in Healthcare.

Medical Imaging

The application of Machine Learning (ML) in Medical Imaging is showing highly encouraging results. ML is a subset of AI, where algorithms and models are used to help machines imitate the cognitive functions of the human brain and to also self-learn from their experiences.

AI can be gainfully used in the different stages of medical imaging- in acquisition, image reconstruction, processing, interpretation, storage, data mining & beyond. The performance of ML computational models improves tremendously as they get exposed to more & more data and this foundation on colossal amounts of data enables them to gradually better humans at interpretation. They begin to detect anomalies not perceptible to the human eye & not discernible to the human brain!

What goes hand-in-hand with data, is noise. Noise creates artifacts in images and reduces its quality, leading to inaccurate diagnosis. AI systems work through the clutter and aid noise- reduction leading to better precision in diagnosis, prognosis, staging, segmentation and treatment.

At the forefront of this use case is Radio genomics- correlating cancer imaging features and gene expression. Needless to say, this will play a pivotal role in cancer research.

Drug Discovery

Drug Discovery is an arduous process that takes several years from the start of research to obtaining approval to market. Research involves laboring through copious amounts of medical literature to identify the dynamics between genes, molecular targets, pathways, candidate compounds. Sifting through all of this complex data to arrive at conclusions is an enormous challenge. When this voluminous data is fed to the ML computational models, relationships are reliably established. AI powered by domain knowledge is slashing down time & cost involved in new drug development.

Cybersecurity in Healthcare

Data security is of paramount importance to Healthcare providers who need to ensure confidentiality, integrity, and availability of patient data. With cyberattacks increasing in number and complexity, these formidable threats are giving security teams sleepless nights! The main strength of AI is its ability to curate massive quantities of data- here threat intelligence, nullify the noise, provide instant insights & self-learn in the process. Predictive & Prescriptive capabilities of these computational models drastically reduces response time.

Virtual Health assistants

Virtual Health assistants like Chatbots, give patients 24/7 access to critical information, in addition to offering services like scheduling health check-ups or setting up appointments. AI- based platforms for wearable health devices and health apps come armed with loads of features to monitor health signs, daily activities, diet, sleep patterns etc. and provide alerts for immediate action or suggest personalized plans to enable healthy lifestyles.

AI for Healthcare IT Infrastructure

Healthcare IT Infrastructure running critical applications that enable patient care, is the heart of a Healthcare provider. With dynamically changing IT landscapes that are distributed, hybrid & on-demand, IT Operations teams are finding it hard to keep up. Artificial Intelligence for IT Ops (AIOps) is poised to fundamentally transform the Healthcare Industry. It is powering Healthcare Providers across the globe, who are adopting it to Automate, Predict, Remediate & Prevent Incidents in their IT Infrastructure. GAVS’ Zero Incident FrameworkTM (ZIF) – an AIOps Platform, is a pure-play AI platform based on unsupervised Machine Learning and comes with the full suite of tools an IT Infrastructure team would need. Please watch this video to learn more.

READ ALSO OUR NEW UPDATES

Analyze

Have you heard of AIOps?

Artificial intelligence for IT operations (AIOps) is an umbrella term for the application of Big Data Analytics, Machine Learning (ML) and other Artificial Intelligence (AI) technologies to automate the identification and resolution of common Information Technology (IT) problems. The systems, services and applications in a large enterprise produce immense volumes of log and performance data. AIOps uses this data to monitor the assets and gain visibility into the working behaviour and dependencies between these assets.

According to a Gartner study, the adoption of AIOps by large enterprises would rise to 30% by 2023.

ZIF – The ideal AIOps platform of choice

Zero Incident FrameworkTM (ZIF) is an AIOps based TechOps platform that enables proactive detection and remediation of incidents helping organizations drive towards a Zero Incident Enterprise™

ZIF comprises of 5 modules, as outlined below.

At the heart of ZIF, lies its Analyze and Predict (A&P) modules which are powered by Artificial Intelligence and Machine Learning techniques. From the business perspective, the primary goal of A&P would be 100% availability of applications and business processes.

Come, let us understand more about the Analyze function of ZIF.

With Analyzehaving a Big Data platform under its hood, volumes of raw monitoring data, both structured and unstructured, can be ingested and grouped to build linkages and identify failure patterns.

Data Ingestion and Correlation of Diverse Data

The module processes a wide range of data from varied data sources to break siloes while providing insights, exposing anomalies and highlighting risks across the IT landscape. It increases productivity and efficiency through actionable insights.

  • 100+ connectors for leading tools, environments and devices
  • Correlation and aggregation methods uncover patterns and relationships in the data

Noise Nullification

Eliminates duplicate incidents, false positives and any alerts that are insignificant. This also helps reduce the Mean-Time-To-Resolution and event-to-incident ratio.

  • Deep learning algorithms isolate events that have the potential to become incidents along with their potential criticality
  • Correlation and Aggregation methods group alerts and incidents that are related and needs a common remediation
  • Reinforcement learning techniques are applied to find and eliminate false positives and duplicates

Event Correlation

Data from various sources are ingested real-time into ZIF either by push or pull mechanism. As the data is ingested, labelling algorithms are run to label the data based on identifiers. The labelled data is passed through the correlation engine where unsupervised algorithms are run to mine the patterns. Sub-sequence mining algorithms help in identifying unique patterns from the data.

Unique patterns identified are clustered using clustering algorithms to form cases. Every case that is generated is marked by a unique case id. As part of the clustering process, seasonality aspects are checked from historical transactions to derive higher accuracy of correlation.

Correlation is done based on pattern recognition, helping to eliminate the need for relational CMDB from the enterprise. The accuracy of the correlation increases as patterns reoccur. Algorithms also can unlearn patterns based on the feedback that can be provided by actions taken on correlation. As these are unsupervised algorithms, the patterns are learnt with zero human intervention.

Accelerated Root Cause Analysis (RCA)

Analyze module helps in identifying the root causes of incidents even when they occur in different silos. Combination of correlation algorithms with unsupervised deep learning techniques aid in accurately nailing down the root causes of incidents/problems. Learnings from historical incidents are also applied to find root causes in real-time. The platform retraces the user journeys step-by-step to identify the exact point where an error occurs.

Customer Success Story – How ZIF’s A&P transformed IT Operations of a Manufacturing Giant

  • Seamless end-to-end monitoring – OS, DB, Applications, Networks
  • Helped achieve more than 50% noise reduction in 6 months
  • Reduced P1 incidents by ~30% through dynamic and deep monitoring
  • Achieved declining trend of MTTR and an increasing trend of Availability
  • Resulted in optimizingcommand centre/operations head count by ~50%
  • Resulted in ~80% reduction in operations TCO

For more detailed information on GAVS’ Analyze, or to request a demo please visit zif.ai/products/analyze

References: www.gartner.com/smarterwithgartner/how-to-get-started-with-aiops

ABOUT THE AUTHOR

Vasudevan Gopalan


Vasu heads Engineering function for A&P. He is a Digital Transformation leader with ~20 years of IT industry experience spanning across Product Engineering, Portfolio Delivery, Large Program Management etc. Vasu has designed and delivered Open Systems, Core Banking, Web / Mobile Applications etc.

Outside of his professional role, Vasu enjoys playing badminton and focusses on fitness routines.

READ ALSO OUR NEW UPDATES

Monitoring for Success

Do you know if your end users are happy?

(In the context of users of Applications (desktop, web or cloud-based), Services, Servers and components of IT environment, directly or indirectly.)

The question may sound trivial, but it has a significant impact on the success of a company. The user experience is a journey, from the time they use the application or service, till after they complete the interaction. Experience can be determined based on factors like Speed, Performance, Flawlessness, Ease of use, Security, Resolution time, among others. Hence, monitoring the ‘Wow’ & ‘Woe’ moments of the users is vital.

Monitor is a component of GAVS’ AIOps Platform, Zero Incident FrameworkTM (ZIF). One of the key objectives of the Monitor platform is to measure and improve end-user experience. This component monitors all the layers (includes but not limited to application, database, server, APIs, end-points, and network devices) in real-time that are involved in the user experience. Ultimately,this helps to drive the environment towards Zero Incidents.

This figure shows the capability of ZIF monitoring that cut across all layers starting from end-user to storage and how it is linked to other the components of the platform

Key Features of ZIF Monitor are,

  • Unified solution for all IT environment monitoring needs: The platform covers the end-to-end monitoring of an IT landscape. The key focus is to ensure all verticals of IT are brought under thorough monitoring. The deeper the monitoring, the closer an organization is to attaining a Zero Incident EnterpriseTM.
  • Agents with self-intelligence: The intelligent agents capture various health parameters about the environment. When the target environment is already running under low resource, the agent will not task it with more load. It will collect the health-related metrics and communicate through the telemetry channel efficiently and effectively. The intelligence is applied in terms of parameters to be collected, the period of collection and many more.
  • Depth of monitoring: The core strength of Monitor is it comes with a list of performance counters which are defined by SMEs across all layers of the IT environment. This is a key differentiator; the monitoring parameters can be dynamically configured for the target environment. Parameters can be added or removed on a need basis.
  • Agent & Agentless (Remote): The customers can choose from Agent & Agentless options for the solutions. The remote solution is called as Centralized Remote Monitoring Solution (CRMS). Each monitoring parameter can be remotely controlled and defined from the CRMS. Even the agents that are running in the target environment can be controlled from the server console.
  • Compliance: Plays a key role in terms of the compliance of the environment. Compliance ranges from ensuring the availability of necessary services and processes in the target environment and defines the standard of what Application, Make, Version, Provider, Size, etc. that are allowed in the target environment.
  • Auto discovery: Monitor can auto-discover the newer elements (servers, endpoints, databases, devices, etc.) that are getting added to the environment. It can automatically add those newer elements into the purview of monitoring.
  • Auto scale: Centralized Remote Monitoring Solution (CRMS) can auto-scale on its own when newer elements are added for monitoring through auto-discovery. The auto scale includes various aspects, like load on channel, load on individual polling engine, and load on each agentless solution.
  • Real time user & Synthetic Monitoring: Real-time user monitoring is to monitor the environment when the user is active. Synthetic monitoring is through simulated techniques. It doesn’t wait for the user to make a transaction or use the system. Instead, it simulates the scenario and provide insights to make decision proactively.
  • Availability & status of devices connected: Monitor also includes the monitoring of availability and control of USB and COM port devices that are connected.
  • Black box monitoring: It is not always possible to instrument the application to get insights.Hence, the Black Box technique is used. Here the application is treated as a black box and it is monitored in terms of its interaction with the Kernel & OS through performance counters.
High level overview of Monitor’s components,

  • Agents, Agentless: These are the means through which monitoring is done at the target environment, like user devices, servers, network devices, load balancers, virtualized environment, API layers, databases, replications, storage devices, etc.
  • ZIF Telemetry Channel: The performance telemetry that are collected from source to target are passed through this channel to the big data platform.
  • Telemetry Data: Refers to the performance data and other metrics collected from all over the environment.
  • Telemetry Database:This is the big data platform, in which the telemetry data from all sources are captured and stored.
  • Intelligence Engine: This parses the telemetry data in near real time and raises notifications based on rule-based threshold and as well as through dynamic threshold.
  • Dashboard&Alerting Mechanism: These are the means through which the results of monitoring are conveyed as metrics in dashboard and as well as notifications.
  • Integration with Analyze, Predict & Remediate components: Monitoring module communicates the telemetry to Analyze & Predict components of the ZIF platform for it to use the data for analysis and apply Machine Learning for prediction. Both Monitor & Predict components, communicate with Remediate platform to trigger remediation.

The Monitor component works in tandem with Analyze, Predict and Remediate components of the ZIF platform to achieve an incident free IT environment. Implementation of ZIF is the right step to driving an enterprise towards Zero Incidents. ZIF is the only platform in the industry which comes from the single product platform owner who owns the end-to-end IP of the solution with products developed from scratch.

For more detailed information on GAVS’ Monitor, or to request a demo please visit zif.ai/products/monitor/

(To be continued…)

About the Author

Suresh Kumar Ramasamy


Suresh heads the Monitor component of ZIF at GAVS. He has 20 years of experience in Native Applications, Web, Cloud and Hybrid platforms from Engineering to Product Management. He has designed & hosted the monitoring solutions. He has been instrumental in conglomerating components to structure the Environment Performance Management suite of ZIF Monitor.

Suresh enjoys playing badminton with his children. He is passionate about gardening, especially medicinal plants.

READ ALSO OUR NEW UPDATES

AIOps Demystified

IT Infrastructure has been on an incredibly fascinating journey from the days of mainframes housed in big rooms just a few decades ago, to mini computers, personal computers, client-servers, enterprise & mobile networks, virtual machines and the cloud! While mobile technologies have made computing omnipresent, the cloud coupled with technologies like virtual computing and containers has changed the traditional IT industry in unimaginable ways and has fuelled the rise of service-oriented architectures where everything is offered as a service and on-demand. Infrastructure as a Service (IaaS), Platform as a Service (PaaS), DBaaS, MBaaS, SaaS and so on.

As companies try to grapple with this technology explosion, it is very clear that the first step has to be optimization of the IT infrastructure & operations. Efficient ITOps has become the foundation not just to aid transformational business initiatives, but even for basic survival in this competitive world.

The term AIOps was first coined by Gartner based on their research on Algorithmic IT Operations. Now, it refers to the use of Artificial Intelligence(AI) for IT Operations(Ops), which is the use of Big Data Analytics and AI technologies to optimize, automate and supercharge all aspects of IT Operations.

Why AI in IT operations?

The promise behind bringing AI into the picture has been to do what humans have been doing, but better, faster and at a much larger scale. Let’s delve into the different aspects of IT operations and see how AI can make a difference.

Visibility

The first step to effectively managing the IT landscape is to get complete visibility into it. Why is that so difficult? The sheer variety and volume of applications, users and environments make it extremely challenging to get a full 360 degree view of the landscape. Most organizations use applications that are web-based, virtually delivered, vendor-built, custom-made, synchronous/asynchronous/batch processing, written using different programming languages and/or for different operating systems, SaaS, running in public/private/hybrid cloud environments, multi-tenant, multiple instances of the same applications, multi-tiered, legacy, running in silos! Adding to this complexity is the rampant issue of shadow IT, which is the use of applications outside the purview of IT, triggered by the easy availability of and access to applications and storage on the cloud. And, that’s not all! After all the applications have been discovered, they need to be mapped to the topology, their performances need to be baselined and tracked, all users in the system have to be found and their user experiences captured.

The enormity of this challenge is now evident. AI powers auto-discovery of all applications, topology mapping, baselining response times and tracking all users of all these applications. Machine Learning algorithms aid in self-learning, unlearning and auto-correction to provide a highly accurate view of the IT landscape.

Monitoring

When the IT landscape has been completely discovered, the next step is to monitor the infrastructure and application stacks. Monitoring tools provide real-time data on their availability and performance based on relevant metrics.

The problem is two-fold here. Typically, IT organizations need to rely on several monitoring tools that cater to the different environments/domains in the landscape. Since these tools work in silos, they give a very fractured view of the entire system, necessitating data correlation before it can be gainfully used for Root Cause Analysis(RCA) or actionable insights.

Pattern recognition-based learning from current and historical data helps correlate these seemingly independent events, and therefore to recognize & alert deviations, performance degradations or capacity utilization bottlenecks in real-time and consequently enable effective Root Cause Analysis(RCA) and reduce an important KPI, Mean Time to Identify(MTTI).

Secondly, there is colossal amounts of data in the form of logs, events, metrics pouring in at high velocity from all these monitoring tools, creating alert fatigue. This makes it almost impossible for the IT support team to check each event, correlate with the other events, tag and prioritize them and plan remedial action.

Inherently, machines handle volume with ease and when programmed with ML algorithms learn to sift through all the noise and zero-in on what is relevant. Noise nullification is achieved by the use of Deep Learning algorithms that isolate events that have the potential to become incidents and Reinforcement Learning algorithms that find and eliminate duplicates and false positives. These capabilities help organizations bring dramatic improvements to another critical ITOps metric, Mean Time to Resolution(MTTR).

Other areas of ITOps where AI brings a lot of value are in Advanced Analytics- Predictive & Prescriptive- and Remediation.

Advanced Analytics

Unplanned IT Outages result in huge financial losses for companies and even worse, a sharp dip in customer confidence. One of the biggest value-adds of AI for ITOps then, is in driving proactive operations that deliver superior user experiences with predictable uptime. Advanced Analytics on historical incident data identifies patterns, causes and situations in the entire stack(infrastructure, networks, services and applications) that lead to an outage. Multivariate predictive algorithms drive predictions of incident and service request volumes, spikes and lulls way in advance. AIOps tools forecast usage patterns and capacity requirements to enable planning, just-in-time procurement and staffing to optimize resource utilization. Reactive purchases after the fact, can be very disruptive & expensive.

Remediation

AI-powered remediation automates remedial workflows & service actions, saving a lot of manual effort and reducing errors, incidents and cost of operations. Use of chatbots provides round-the-clock customer support, guiding users to troubleshoot standard problems, and auto-assigns tickets to appropriate IT staff. Dynamic capacity orchestration based on predicted usage patterns and capacity needs induces elasticity and eliminates performance degradation caused by inefficient capacity planning.

Conclusion

The beauty of AIOps is that it gets better with age as the learning matures on exposure to more and more data. While AIOps is definitely a blessing for IT Ops teams, it is only meant to augment the human workforce and not to replace them entirely. And importantly, it is not a one-size-fits-all approach to AIOps. Understanding current pain points and future goals and finding an AIOps vendor with relevant offerings is the cornerstone of a successful implementation.

GAVS’ Zero Incident Framework TM (ZIF) is an AIOps-based TechOps Platform that enables organizations to trend towards a Zero Incident Enterprise TM. ZIF comes with an end-to-end suite of tools for ITOps needs. It is a pure-play AI Platform powered entirely by Unsupervised Pattern-based Machine Learning! You can learn more about ZIF or request a demo here.

READ ALSO OUR NEW UPDATES

Optimizing ITOps for Digital Transformation

The key focus of Digital Transformation is removing procedural bottlenecks and bending the curve on productivity. As Chief Insights Officer, Forbes Media says, Digital Transformation is now “essential for corporate survival”.

Emerging technologies are enabling dramatic innovations in IT infrastructure and operations. It is no longer just about hardware, software, data centers, the cloud or the service desk; it is about backing business strategies. So, here are some reasons why companies should think about redesigning their IT services to embrace digital disruption.

DevOps for Agility

As companies move away from the traditional Waterfall model of software development and adopt Agile methodologies, IT infrastructure and operations also need to become agile and malleable. Agility has become indispensible to stay competitive in this era of dynamism and constant change. What started off as a set of software development methodologies has now permeated all aspects of an organization, ITOps being one of them. Development, QA and IT teams need to come out of their silos and work in tandem for constant productive collaboration, in what is termed DevOps.

Shorter development & deployment cycles have necessitated overall ITOps efficiency and among other things, IT enviroment provisioning to be on-demand and self-service. Provisioning needs to be automated and built into the CI/CD pipeline.  

Downtime Mitigation

With agility being the org-wide mantra, predictable IT uptime becomes a mandate. Outages incur a very high cost and adversely affect the pace of innovation. The average cost of unplanned application downtime for Fortune 1000 companies is anywhere between $1.25 billion to $2.5 billion, says a report by DevOps.com. It further goes on to say that, infrastructure failure can cost the bottom line $100,000/hr and the cost of critical application failure is $500,000 to $1 million/hr.

ITOps must stay ahead of the game by eliminating outdated legacy systems, tools, technologies and workflows. End-to-end automation is key. IT needs to modernize its stack by zeroing-in on tools for Discovery of the complete IT landscape, Monitoring of devices, Analytics for noise reduction and event correlation, AI-based tools for RCA, incident Prediction and Auto-Remediation. All of this intelligent automation will help proactive response rather than a reactive response after the fact, when the damage has already been done.

Moving away from the shadows

Shadow IT, the use of technology outside the IT purview, is becoming a tacitly approved aspect of most modern enterprises. It is a result of proliferation of technology and the cloud offering easy access to applications and storage. Users of Shadow IT systems bypass the IT approval and provisioning process to use unauthorized technology, without the consent of the IT department. There are huge security and compliance risks waiting to happen if this sprawling syndrome is not reined in. To bring Shadow IT under control, the IT dept must first know about it. This is where automated Discovery tools bring in a lot of value by automating the process of application discovery and topology mapping.

Moving towards Hybrid IT

Hybrid IT means the use of an optimal, cost-effective mix of public & private clouds and on-premise systems that enable an infrastructure that is dynamic, on-demand, scalable, and composable. IT spend on datacenters is seeing a downward trend. Most organizations are thinking beyond traditional datacentres to options in the cloud. Colocation is an important consideration since it delivers better availability, energy and time savings, scalability and reduces the impact of network latency. Organizations are only keeping mission-critical processes that require close monitoring & control, on-premise.

Edge computing

Gartner defines edge computing as solutions that facilitate data processing at or near the source of data generation. With huge volumes of data being churned out at rapid rates, for instance by monitoring or IoT devices, it is highly inefficient to stream all this data to a centralized datacenter or cloud for processing. Organizations now understand the value in a decentralized approach to address modern digital infrastructure needs. Edge computing serves as the decentralized extension of the datacenter/cloud and addresses the need for localized computing power.

CyberSecurity

Cyber attacks are on the rise and securing networks and protecting data is posing big challenges. With Hybrid IT, IoT, Edge computing etc, extension of the IT footprint beyond secure enterprise boundaries has increased the number of attack target points manifold. IT teams need to be well versed with the nuances of security set-up in different cloud vendor environments. There is a lot of ambiguity in ownership of data integrity, in the wake of data being spread across on-premise, cloud environments, shared workstations and virtual machines. With Hybrid IT deployments, a comprehensive security plan regardless of the data’s location has gained paramount importance.

Upskilling IT Teams

With blurring lines between Dev and IT, there is increasing demand for IT professionals equipped with a broad range of cross-functional skills in addition to core IT competencies. With constant emergence of new technologies, there is usually not much clarity on the exact skillsets required by the IT team in an organization. More than expertise in one specific area, IT teams need to be open to continuous learning to adapt to changing IT environments, to close the skills gap and support their organization’s Digital Transformation goals.

READ ALSO OUR NEW UPDATES

What you need to know about AIOps?

Emergence of AIOps

There has been a gigantic growth of AIOps in the last two years. It has successfully transitioned from an emergent category to an inevitability. Companies adopted AIOps to automate and improve IT operations by applying big data and machine learning (ML). Adoption of such technologies compelled IT operations to adapt a multi-cloud infrastructure. According to Infoholic Research, the AIOps market is expected to grow at a CAGR of 33.08% during the forecast period 2018–2024.

What is AIOps?

AIOps broadly stands for Artificial Intelligence for IT Operations. With a combination of big data and ML, AIOps platform improvises IT operations and also replaces certain tasks including tracking availability, event correlation, performance monitoring, IT service management and automation. Most of these technologies are well-defined and matured.

AIOps data originates from log files, metrics, monitoring tools, helpdesk ticketing and other sources. It sorts, manages and assimilates these data to provide insight in problem areas. The goal of AIOps is to analyze data and discover patterns that can predict potential incidents in future.

Focus areas of AIOps

  • AIOps helps with open data access without letting organizational silos play a part in it.
  • AIOps upgrades data handling ability which also impacted on the scope of data analysis.
  • It has a unique ability to stay aligned to organizational goals.
  • AIOps increases the scope of risk prediction.
  • It also reduces response time.

Impact of AI in IT operations

  • Capacity planning: AIOps can support in understanding workloads and plan configuration appropriately without allowing a scope for speculation.
  • Resource utilization: AIOps allows predictive scaling where auto-scale feature of cloud IaaS can adjust itself based on historical data.
  • Storage: AIOps helps in storage activity through disk calibration, reconfiguration and allocation of new storage volumes.
  • Anomaly detection: It can detect anomalies and critical issues faster with accuracy more than humans, reducing potential threats and system downtime.
  • Threat management: It helps to analyze breaches in both internal and external environments.
  • Root-cause analysis: AIOps is effective in root-cause analysis, through which it reduces response time and creates remedy after locating the issue.
  • Forecasting outages: Outage prediction is essential for the growth of IT operations. Infact, the market of forecasting outages through AIOps, is expected to grow from $493.7 to $1.14 billion between 2016 and 2021 based on industry reports.
  • Future innovation: AIOps has played a key role in automating a major chunk of IT operations in a massive way. It frees resources to focus on crucial things aligned to strategy and organizational goals.

Problems AIOps solved

The common issues AIOps solves to enable IT operations’ adoption of digitization are as follows:

  • It has the ability to gain access over large data sets across environments while maintaining data reliability for comprehensive analysis.
  • It simplifies data analysis through automation empowered by ML.
  • Through accurate prediction mechanism, it can avoid costly downtime and improve customer satisfaction.
  • Through implementation of automation, manual tasks can be eliminated.
  • AIOps can improve teamwork and workflow activities between IT groups and other business units.

Peeping into the future

AIOps platform acts as a foundation stone in projecting future endeavors of organizations. It uses real-time analysis of data to provide insights to impact business decisions. Successful implementation of AIOps depends on key parameters index (KPIs). It can also deliver a predictive and proactive IT operation by reducing failure, detection, resolution and investigation.

READ ALSO OUR NEW UPDATES

Out of the trenches to AIOps – the Peacekeeper

The last thing an IT team wants to hear is ‘there is an issue’ which usually has them rushing to ‘battle zones’ to try and resolve – ‘problem with the apps?’, ‘is it the network?’, desperately trying to kill the problem while it grows larger within the Enterprise.  No credits for crumbling SLAs, the fire-fighting continues long and hard sometimes.

IT Operations are most times battling heavy volumes of alerts, having to deal with hundreds of incident tickets that come from the environment, from the performance of its apps and infrastructure. They are constantly overwhelmed trying to manage and respond to every alert in order to avoid the threat of outages and heavy losses.

Increasing components within the infrastructure; today a stack can have more than 10,000 metrics, and that sort of complexity runs the threat of increase in points of failure, and with the addition of speedier change cycles provided / supported by DevOps, cloud computing and so on, there really is very little time to take control or take action. Under such circumstances, AIOps is fast emerging as a powerful solution to deal with the constant battle, with the efficiency that AI and ML can bring in. We are looking more and more into unsupervised methods / processes, to read data and make it coherent, make it ‘see the unknown unknowns’, and remediate/ bring problems into focus before it impacts customers. Adopting AI into IT Operations provide an increased visibility into operations through Machine Learning and the subsequent reduction in incidents, false alarms and the advantage of predictive warnings that can do away with outages.  It means insights are implemented thru automation tools leading to saving time and effort of the concerned teams.

With AIOps gathering and processing data, we require very little or almost nil manual intervention where algorithms help automate, due diligence gets done, and rich business insights are provided. AIOps becomes the much sought-after solution to the multitudinous problems in complex IT Enterprises.

“The global AIops Platform market is expected to generate a revenue of US$ 20,428 billion with a CAGR of 36.2% by 2025. – reports Coherent Market Insights

Gartner recommends that AIOps is adopted in phases. Early adopters typically start by applying machine learning to monitoring, operations and infrastructure data, before progressing to using deep neural networks for service and help desk automation.

The greatest strength with AIOps is that it can find all the potential risks and outages that may happen in the environment which can’t be done or anticipated by humans, and these operations can be conducted with greater consistency and time to value.

The complexity of an IT Enterprise is so huge though this makes an ideal scenario of ML, Data Science and Artificial Intelligence to help solutioning with specific, machine learning algorithms which is impossible for humans to reduce them in simple instructions and remediations. AIOps becomes the real answer to tackle critical issues and at the same time, it eliminates all the false positives that usually makes up a large percentage of ‘events’ that is reflected in monitoring tools.

Gartner predicted that by this year about 25% of the enterprises, globally, would implement an AIOps platform.  And that obviously means increasing complexities and huge data volumes but deep insights and more intelligence within the environment.  Experts say that this implies that AI is going to reach right from the device or environment till the customer.

ChatOps

AIOps is fast paced; it is believed that in the next decade majority of large Enterprises will take to ‘multi-system automations’ and will host digital colleagues – we are going to have virtual engineers to attend to queries and tasks.  IT Service desks are going to be ‘manned’ by digital colleagues, and they are going to take care of the frequent and mundane tasks with almost nil or minimal human intervention.  It is predicted that this year will see the emergence of ChatOps, where enterprises are going to introduce “AI based digital colleagues into chat-based IT Operations”, and digital colleagues will make a major impact on how IT operations function.

Establishing digital service desk bots brings in speed and agility into the service.  Reports say that actions which hitherto took up to 20 steps can now be accomplished with just one phrase and a couple of clarifications from the digital colleague.  This can save human labor hours and have their skills channeled to more important areas with mundane and frequent tasks such as password resets, catalogue requests, access requests and so forth being taken care of by digital colleagues. They can be entrusted with all incoming requests and those which cannot be processed by them are automatically escalated to the right human engineers.  Even L3 & L4 issues are expected to be resolved by digital colleagues with workflows being created by them and approved by human engineers. AI is going to keep recommending better and deeper automations, and we are going to see the true power of human / machine collaboration.

Humans will collaborate more and more with digital colleagues, change requests get created on a simple command with resolutions to be had within minutes / or assigned to human colleagues.  Algorithms are expected to integrate operations more and more.  Life with AI is going to make tasks such as identifying and inviting right people into root cause analysis sessions and have post resolution meetings to ensure continuous learning.

With AIOps, IT operations is going to reconstruct most tasks with AI and automation. It is reported that 38.4% of organizations take a minimum resolution time of 30 minutes on incidents and adopting AIOps is definitely the key. We may be looking at a future where we would have the luxury of an autonomous data center, and human resources in IT can truly spend their time on strategic decisions and business growth, work on innovation and become more visible to an organization’s growth.

Reference
https://www.coherentmarketinsights.com/market-insight/aiops-platform-market-2073

READ ALSO OUR NEW UPDATES

The future of AIOps

AIOps or Artificial Intelligence based IT operations is the buzzword that’s capturing the CXO’s interest in organizations worldwide. Why? Because data explosion is here, and the traditional tools and processes are unable to completely handle its creation, storage, analysis and management. Likewise, humans are unable to thoroughly analyze this data to obtain any meaningful insights. IT teams also face the challenging task of providing speed, security and reliability in an increasingly mobile and connected world.

Add to this the complex, manual and siloed processes that the legacy IT solutions offer to the organizations. As a result, the productivity for IT remains low due to their inability to find the exact root cause of incidents. Plus, the business leaders don’t have a 360-degree view of all their IT and business services across the organization.

AIOps is the Future for IT Operations

AIOps platforms are the foundation on which the organizations will project their future endeavors. Advanced machine learning and analytics are the building blocks to enhance their IT operations through a proactive approach towards service desk, monitoring and automation. Using effective data collection methods that utilize real time analytic technologies, AIOps provide insights to impact business decisions.

Successful AIOps implementations depend on key parameters Index (KPIs) whose impact can be seen on performance variation, service degradation, revenue, customer satisfaction and brand image.

All these impacts the organization’s services including but not limited to supply chain, online or digital. One way in which AIOps can deliver a predictive and proactive IT is by decreasing the MTBF (Mean time between failure), MTTD (Mean time to detection), MTTR (Mean time to resolution) and MTTI (Mean time to investigate) factors.

The future of AIOps is already on the way in the below mentioned use cases. There is just the surface with scope for many more use cases to be added in the future.

Capacity planning

Enterprise workloads are moving to the cloud with providers such as AWS, Google and Azure setting up various configurations for running them. The complexity involved increases as new configurations are added by the architects involving parameters like disk types, memory, network and storage resources.

AIOps can reduce the guesswork in aligning the correct usage of the network, storage and memory resources with the right configurations of servers and VMs through recommendations.

Optimal resource utilization

Enterprises are leveraging cloud elasticity to improve their application scaling in or scaling out automatically. With AIOps, IT administrators can rely on predictive scaling to take the auto scale cloud to the next level. Based on historical data, the workload will automatically determine the resources required by monitoring itself.

Data store management

AIOps can also be utilized to monitor the network and the storage resources that will impact the applications in the operations. When performance degradation issues are seen, the admin will get notified. By using AI for both network and storage management, mundane tasks such as reconfiguring and recalibration can be automated. Through predictive analytics, storage capacity is automatically adjusted by adding new volumes proactively.

Anomaly detection

Anomaly detection is the most important application of AIOps. This can prevent potential outages and disruptions that can be faced by organizations. As anomalies can occur in any part of the technology stack, pinpointing them in real-time, using advanced analytics and machine learning is crucial. AIOps can accurately detect the actual source which can help IT teams in performing efficient root cause analysis almost in real-time.

Threat detection & analysis

Along with anomaly detection, AIOps will play a critical role in enhancing the security of IT infrastructure. Security systems can use ML algorithms and AI’s self-learning capabilities to help the IT teams detect data breached and violations. By correlating various internal sources like log files, network and event logs, with the external information on malicious IPs and domains, AI can be used to detect anomalies and risk events through analysis. Advanced machine learning algorithms can be used to identify unexpected and potentially unauthorized and malicious activity within the infrastructure.

Although still early in deployment, companies are taking advantage of AI and machine learning to improve tech support and manage infrastructure.  AIOps, the convergence of AI and IT ops, will change the face of infrastructure management.

READ ALSO OUR NEW UPDATES

AIOps – IT Infrastructure Services for the Digital Age

The IT infrastructure services landscape is undergoing a significant shift, driven by digitalization. As focus shifts from cost efficiency to digital enablement, organizations need to re-imagine the IT infrastructure services model to deliver the necessary back-end agility, flexibility, and fluidity. Automation, analytics, and Artificial Intelligence (AI) – comprising the “codifying elements” for driving AIOps – help drive this desired level of adaptability within IT infrastructure services. Automation, analytics, and AI – which together comprise the “codifying elements” for driving AIOps– help drive the desired level of adaptiveness within IT infrastructure services. Intelligent automation, leveraging analytics and ML, embeds powerful, real-time business and user context and autonomy into IT infrastructure services. Intelligent automation has made inroads in enterprises in the last two to three years, backed by a rapid proliferation and maturation of solutions in the market.

Artificial Intelligence Operations (AIOps) . Everest Group 2018 Report . IT Infrastructure

Benefits of codification of IT infrastructure services

Progressive leverage of analytics and AI, to drive an AIOps strategy, enables the introduction of a broader and more complex set of operational use cases into IT infrastructure services automation. As adoption levels scale and processes become orchestrated, the benefits potentially expand beyond cost savings to offer exponential value around user experience enrichment, services agility and availability, and operations resilience. Intelligent automation helps maximize value from IT infrastructure services by:

  1. Improving the end-user experience through contextual and personalized support
  2. Driving faster resolution of known/identified incidents leveraging existing knowledge, intelligent diagnosis, and reusable, automated workflows
  3. Avoiding potential incidents and improving business systems performance through contextual learning (i.e., based on relationships among systems), proactive health monitoring and anomaly detection, and preemptive healing

Although the benefits of intelligent automation are manifold, enterprises are yet to realize commensurate advantage from investments in infrastructure services codification. Siloed adoption, lack of well-defined change management processes, and poor governance are some of the key barriers to achieving the expected value.  The design should involve an optimal level of human effort/intervention targeted primarily at training, governing, and enhancing the system, rather than executing routine, voluminous tasks.  A phased adoption of automation, analytics, and AI within IT infrastructure services has the potential to offer exponential business value. However, to realize the full potential of codification, enterprises need to embrace a lean operating model, underpinned by a technology-agnostic platform. The platform should embed the codifying elements within a tightly integrated infrastructure services ecosystem with end-to-end workflow orchestration and resolution.

The market today has a wide choice of AIOps solutions, but the onus is on enterprises to select the right set of tools / technologies that align with their overall codification strategy.

Click here to read the complete whitepaper by Everest Group

READ ALSO OUR NEW UPDATES