Why is AIOps an Industrial Benchmark for Organizations to Scale in this Economy?

Business Environment Overview

In this pandemic economy, the topmost priorities for most companies are to make sure the operations costs and business processes are optimized and streamlined. Organizations must be more proactive than ever and identify gaps that need to be acted upon at the earliest.

The industry has been striving towards efficiency and effectivity in its operations day in and day out. As a reliability check to ensure operational standards, many organizations consider the following levers:

  1. High Application Availability & Reliability
  2. Optimized Performance Tuning & Monitoring
  3. Operational gains & Cost Optimization
  4. Generation of Actionable Insights for Efficiency
  5. Workforce Productivity Improvement

Organizations that have prioritized the above levers in their daily operations require dedicated teams to analyze different silos and implement solutions that provide the result. Running projects of this complexity affects the scalability and monitoring of these systems. This is where AIOps platforms come in to provide customized solutions for the growing needs of all organizations, regardless of the size.

Deep Dive into AIOps

Artificial Intelligence for IT Operations (AIOps) is a platform that provides multilayers of functionalities that leverage machine learning and analytics.  Gartner defines AIOps as a combination of big data and machine learning functionalities that empower IT functions, enabling scalability and robustness of its entire ecosystem.

These systems transform the existing landscape to analyze and correlate historical and real-time data to provide actionable intelligence in an automated fashion.

AIOps platforms are designed to handle large volumes of data. The tools offer various data collection methods, integration of multiple data sources, and generate visual analytical intelligence. These tools are centralized and flexible across directly and indirectly coupled IT operations for data insights.

The platform aims to bring an organization’s infrastructure monitoring, application performance monitoring, and IT systems management process under a single roof to enable big data analytics that give correlation and causality insights across all domains. These functionalities open different avenues for system engineers to proactively determine how to optimize application performance, quickly find the potential root causes, and design preventive steps to avoid issues from ever happening.

AIOps has transformed the culture of IT war rooms from reactive to proactive firefighting.

Industrial Inclination to Transformation

The pandemic economy has challenged the traditional way companies choose their transformational strategies. Machine learning powered automation for creating an autonomous IT environment is no longer a luxury. he usage of mathematical and logical algorithms to derive solutions and forecasts for issues have a direct correlation with the overall customer experience. In this pandemic economy, customer attrition has a serious impact on the annual recurring revenue. Hence, organizations must reposition their strategies to be more customer centric in everything they do. Thus, providing customers with the best-in-class service coupled with continuous availability and enhanced reliability has become an industry-standard.

As reliability and scalability are crucial factors for any company’s growth, cloud technologies have seen a growing demand. This shift of demand for cloud premises for core businesses has made AIOps platforms more accessible and easier to integrate. With the handshake between analytics and automation, AIOps has become a transformative technology investment that any organization can make.

As organizations scale in size, so does the workforce and the complexity of the processes. The increase in size often burdens organizations with time-pressed teams having high pressure on delivery and reactive housekeeping strategies. An organization must be ready to meet the present and future demands with systems and processes that scale seamlessly. This why AIOps platforms serve as a multilayered functional solution that integrates the existing systems to manage and automate tasks with efficiency and effectivity. When scaling results in process complexity, AIOps platforms convert the complexity to effort savings and productivity enhancements.

Across the industry, many organizations have implemented AIOps platforms as transformative solutions to help them embrace their present and future demand. Various studies have been conducted by different research groups that have quantified the effort savings and productivity improvements.

The AIOps Organizational Vision

As the digital transformation race has been in full throttle during the pandemic, AIOps platforms have also evolved. The industry did venture upon traditional event correlation and operations analytical tools that helped organizations reduce incidents and the overall MTTR. AIOps has been relatively new in the market as Gartner had coined the phrase in 2016.  Today, AIOps has attracted a lot of attention from multiple industries to analyze its feasibility of implementation and the return of investment from the overall transformation. Google trends show a significant increase in user search results for AIOps during the last couple of years.

ai automated root cause analysis solution

While taking a well-informed decision to include AIOps into the organization’s vision of growth, we must analyze the following:

  1. Understanding the feasibility and concerns for its future adoption
  2. Classification of business processes and use cases for AIOps intervention
  3. Quantification of operational gains from incident management using the functional AIOps tools

AIOps is truly visioned to provide tools that transform system engineers to reliability engineers to bring a system that trends towards zero incidents.

Because above all, Zero is the New Normal.

About the Author –

Ashish Joseph

Ashish Joseph is a Lead Consultant at GAVS working for a healthcare client in the Product Management space. His areas of expertise lie in branding and outbound product management.

He runs a series called #BizPective on LinkedIn and Instagram focusing on contemporary business trends from a different perspective. Outside work, he is very passionate about basketball, music and food.

Kappa (κ) Architecture – Streaming at Scale

We are in the era of Stream processing-as-a-service and for any data-driven organization, Stream-based computing has becoming the norm. In the last three parts https://bit.ly/2WgnILP, https://bit.ly/3a6ij2k,  https://bit.ly/3gICm88, I had explored Lambda Architecture and its variants. In this article let’s discover Streaming in the big data. ‘Real-time analytics’, ‘Real-time data’ and ‘Streaming data’ has become mandatory in any big data platform. The aspiration to extend data analysis (predictive, descriptive, or otherwise) to streaming event data has been common across every enterprise and there is a growing interest to find real-time big data architectures. Kappa (K) Architecture is one that deals with streaming. Let’s see why Real-Time Analytics matter more than ever and mandates data streaming and how streaming architecture like Kappa works. Is Kappa an alternative to lambda?

“You and I are streaming data engines.” – Jeff Hawkins

workflow automation software architecture

Questioning Lambda

Lambda architecture fits very well in many real-time use cases, mainly in re-computing algorithms. At the same time, Lambda Architecture has the inherent development and operational complexities like all the algorithms must be implemented twice, once in the cold path, the batch layer, and another execution in the hot path or the real-time layer. Apart from this dual execution path, the Lambda Architecture has the inevitable issue of debugging. Because operating two distributed multi-node services is more complex than operating one.

Given the obvious discrepancies of Lambda Architecture, Jay Kreps, CEO of Confluent, co-creator of Apache Kafka started the discussion on the need for new architecture paradigm which uses less code resource and could perform well in certain enterprise scenarios. This gave rise to Kappa (K) Architecture. The real need Kappa Architecture isn’t about efficiency at all, but rather about allowing people to develop, test, debug, and operate their systems on top of a single processing framework. In fact, Kappa is not taken as competitor to LA on the contrary it is seen as an alternative.

cognitive process automation tools for business

What is Streaming & Streaming Architecture?

Modern business requirements necessitate a paradigm shift from traditional approach of batch processing to real-time data streams. Data-centric organizations mandate the Stream first approach. Real-time data streaming or Stream first approach means at the very moment. So real-time analytics, either On-demand real-time analytics or Continuous real-time analytics, is the capability to process data right at the moment it arrives in the system. There is no possibility of batch processing of data. Not to mention, it enhances the ability to make better decision making and performing meaningful action on a timely basis. At the right place and at the right time, real-time analytics combines and analyzes data. Thus, it generates value from disparate data.

Typically, most of the streaming architectures will have the following 3 components:

  • an aggregator that gathers event streams and batch files from a variety of data sources,
  • a broker that makes data available for consumption,
  • an analytics engine that analyzes the data, correlates values and blends streams together.

Kappa (K) Architecture for Big Data era

Kappa (K) Architecture is one of the new software architecture patterns for the new Data era. It’s mainly used for processing streaming data. Kappa architecture gets the name Kappa from the Greek letter (K) and is attributed to Jay Kreps for introducing this architecture.

The main idea behind the Kappa Architecture is that both the real-time and batch processing can be carried out, especially for analytics, with a single technology stack. The data from IoT, streaming, and static/batch sources or near real-time sources like change data capture is ingested into messaging/ pub-sub platforms like Apache Kafka.

An append-only immutable log store is used in the Kappa Architecture as the canonical store. Following are the pub/sub or message buses or log databases that can be used for ingestion:

  • Amazon Quantum Ledger Database (QLDB)
  • Apache Kafka
  • Apache Pulsar
  • Amazon Kinesis
  • Amazon DynamoDB Streams
  • Azure Cosmos DB Change Feed
  • Azure EventHub
  • DistributedLog
  • EventStore
  • Chronicle Queue
  • Pravega

Distributed Stream processing engines like Apache Spark, Apache Flink, etc. will read the data from the streaming platform and transform it into an analyzable format, and then store it into an analytics database in the serving layer. Following are some of the distributed streaming computation systems

  • Amazon Kinesis
  • Apache Flink
  • Apache Samza
  • Apache Spark
  • Apache Storm
  • Apache Beam
  • Azure Stream Analytics
  • Hazelcast Jet
  • Kafka Streams
  • Onyx
  • Siddhi

In short, any query in the Kappa Architecture is defined by the following functional equation.

Query = λ (Complete data) = λ (live streaming data) * λ (Stored data)

The equation means that all the queries can be catered by applying Kappa function to the live streams of data at the speed layer. It also signifies that the stream processing occurs on the speed layer in Kappa architecture.

Pros and Cons of Kappa architecture

Pros

  • Any architecture that is used to develop data systems that doesn’t need batch layer like online learning, real-time monitoring & alerting system, can use Kappa Architecture.
  • If computations and analysis done in the batch and streaming layer are identical, then using Kappa is likely the best solution.
  • Re-computations or re-iterations is required only when the code changes.
  • It can be deployed with fixed memory.
  • It can be used for horizontally scalable systems.
  • Fewer resources are required as the machine learning is being done on the real-time basis.

Cons

Absence of batch layer might result in errors during data processing or while updating the database that requires having an exception manager to reprocess the data or reconciliation.

On finding the right architecture for any data driven organizations, a lot of considerations were taken in. Like most successful analytics project, which involves streaming first approach, the key is to start small in scope with well-defined deliverables, then iterate. The reason for considering distributed systems architecture (Generic Lambda or unified Lambda or Kappa) is due to minimized time to value.

Sources

About the Author

Bargunan Somasundaram

Bargunan Somasundaram

Bargunan is a Big Data Engineer and a programming enthusiast. His passion is to share his knowledge by writing his experiences about them. He believes “Gaining knowledge is the first step to wisdom and sharing it is the first step to humanity.”

Customize Business Outcomes with ZIFTM

Zero Incident Framework™ (ZIF) is the only AIOps platform that is powered with true machine learning algorithms with the capability to self-learn and adapt to today’s modern IT infrastructure.

ZIF’s goal has always been to deliver the right business outcomes for the stakeholders. Return on investment can be measured based on the outcomes the platform has delivered. Users get to choose what business outcomes are expected from the platform and the respective features are deployed in the enterprise to deliver the chosen outcome.

Single Pane of Action – Unified View across IT Enterprise

The biggest challenge IT Operations teams have been trying to tackle over the years is to get a bird’s eye view on what is happening across their IT landscape. The more complex the enterprise becomes the harder it becomes for the IT Operations team to understand what is happening across their enterprise. ZIF solves this issue with ease.

digital transformation company in usa

The capability to ingest data from any source monitoring or ITSM tool has helped IT organizations to have a real-time view of what is happening across their landscape. Enormous time can be saved by the IT engineers with ZIF’s unified view, who would otherwise be traversing between multiple monitoring tools.

ZIF can integrate with 100+ tools to ingest (static/dynamic) data in real-time via ZIF Universal Connector. This is a low code component of ZIF and dataflows within the connector can also be templatized for reuse. 

AIOps based Analytics Platform

Intelligence – Reduction in MTTR – Correlation of Alerts/Events

Approximately 80% of the time is lost by IT engineers in identifying the problem statement for an incident. This has been costing billions of dollars for enterprises. ZIF, with the help of Artificial Intelligence, can reduce the mean time to identify the probable root cause of the incident within seconds. The high-performance correlation engine that runs under the hood of the platform process millions of patterns that the platform has learned from the historical data and correlates the sequences that are happening in real-time and creates cases. These cases are then assigned to IT engineers with the probable root cause for them to fix the issue. This increases the productivity of the IT engineers resulting in better revenue for organizations.

best aiops solutions in usa

best aiops products tools and products

Intelligence – Predictive Analytics

AIOps platforms are incomplete without the Predictive Analytics capability. ZIF has adopted unsupervised machine learning algorithms to perform predictive analytics on the utilization data that is ingested into the platform. These algorithms can learn trends and understand the symptoms of an incident by analyzing tons of data that the platform had consumed over a period. Based on the analysis, the platform generates opportunity cards that help IT engineers take proactive measures on the forecasted incident. These opportunity cards are generated a minimum of 60 minutes in advance which gives the engineers a lead time to fix an issue before it strikes the landscape.

Visibility – Auto-Discovery of IT Assets & Applications

ZIF agentless discovery is a seamless discovery component, that helps in identifying all the IP assets that are available in an enterprise. Just not discovering the assets, but the component also plots a physical topology & logical map for better consumption of the IT engineers. This gives a very detailed view of every asset in the IT landscape. The logical topology gives in-depth insights into the workload metrics that can be utilized for deep analytics.

predictive analytics using ai applications

ai data analytics monitoring tools

Visibility – Cloud Monitoring

ai devops platform management services

In today’s digital transformation journey, cloud is inevitable. To have a better control over the cloud orchestrated application, enterprises must depend on the monitoring tools provided by cloud providers. The lack of insights often leads to the unavailability of applications for end-users. More than monitoring, insights that help enterprises take better-informed decisions are the need of the hour.

ZIF’s cloud monitoring components can monitor any cloud instance. Data that are generated from the provider provided monitoring tools are ingested into ZIF to further analyze the data. ZIF can connect to Azure, AWS & Google Cloud to derive data-driven insights.

Optimization – Remediation – Autonomous IT Operations

ZIF does not stop by just providing insights. The platform deploys the right automation bot to remediate the incident.

ZIF has 250+ automation bots that can be deployed to fast-track the resolution process by a minimum of 90%. Faster resolutions result in increased uptime of applications and better revenue for the enterprise.

Sample ZIF bots:

  • Service Restart / VM Restart
  • Disk Space Clean-up
  • IIS Monitoring App Pool
  • Dynamic Resource Allocation
  • Process Monitoring & Remediation
  • DL & Security Group Management
  • Windows Event Log Monitoring
  • Automated phishing control based on threat score
  • Service request automation like password reset, DL mapping, etc.
best aiops solutions in usa

For more information on ZIF, please visit www.zif.ai

About the Author –

Anoop Aravindakshan

An evangelist of Zero Incident FrameworkTM, Anoop has been a part of the product engineering team for long and has recently forayed into product marketing. He has over 14 years of experience in Information Technology across various verticals, which include Banking, Healthcare, Aerospace, Manufacturing, CRM, Gaming, and Mobile.

Addressing Web Application Performance Issues

With the use of hybrid technologies and distributed components, the applications are becoming increasingly complex. Irrespective of the complexity, it is quite important to ensure the end-user gets an excellent experience in using the application. Hence, it is mandatory to monitor the performance of an application to provide greater satisfaction to the end-user.

External factors

When the web applications face performance issues, here are some questions you need to ask:

  • Does the application always face performance issues or just during a specific period?
  • Whether a particular user or group of users face the issue or is the problem omnipresent for all the users?
  • Are you treating your production environment as real production environment or have you loaded it with applications, services, and background processes running without any proper consideration?
  • Was there any recent release to any of the application stack like Web, Middle Tier, API, DB, etc., and how was the performance before this release?
  • Have there been any hardware or software upgrades recently?

Action items on the ground

Answering the above set of questions would have brought you closer to the root cause. If not, given below are some steps you can do to troubleshoot the performance issue:

  • Look at the number of incoming requests, is the application facing unusual load?
  • Identify how many requests are delaying more than a usual level, say more than 5000 milliseconds to serve a request, or a web page.
  • Is the load getting generated by a specific or group of users – is someone trying to create intentional load?
  • Look at the web pages/methods/functions in the source code which are taking more time. Check the logs of the web server, this can be identified provided the application does that level of custom logging.
  • Identify whether any 3rd party links or APIs which are being used in the application is causing slowness.
  • Check whether the database queries are taking more time.
  • Identify whether the problem is related to a certain browser.
  • Check if the server side or client side is facing any uncaught exceptions which are impacting the performance.
  • Check the performance of the CPU, Memory, and Disk of the server(s) in which the application is hosted.
  • Check the sibling processes which are consuming more Memory/CPU/Disk in all servers and take appropriate action depending on whether those background processes need to be in that server or can be moved somewhere or can be removed totally.
  • Look at the web server performance to fine tune the Cache, Session time out, Pool size, and Queue-length.
  • Check for deadlock, buffer hit ratio, IO Busy, etc. to fine tune the performance.

Challenges 

  • Doing all these steps exactly when there is a performance issue may not be practically all the time. By the time you collect some of these, you may lose important data for the rest of the items unless the history data is collected and stored for reference.
  • Even if the data is collected, correlating them to arrive at the exact root cause is not an easy task
  • You need to be tech savvy across all layers to know what parameters to collect and how to collect.

And the list of challenges goes on…

Think of an ideal situation where you have metrics of all these action items described above, right in front of you. Is there such magic bullet available? Yes, Zero Incident FrameworkTM Application Performance Monitoring (ZIF APM), it gives you the above details at your fingertips, thereby makes troubleshooting a simple task.

ZIF APM has more to offer than other regular APM. The APM Engine has built-in AI features. It monitors the application across all layers, starting from end-user, web application, web server, API layers, databases, underlying infrastructure that includes the OS and performance factors, irrespective of whether these layers are hosted on cloud or on-premise or both. It also applies the AI for monitoring, mapping, tracing and analyze the pattern to provide the Observability and Insights. Given below is a typical representation of distributed application and its components. And the rest of the section covers, how ZIF APM provides such deep level of insights.

ZIF APM

Once the APM Engine is installed/run on portfolio servers, the build-in AI engine does the following automatically: 

  1. Monitors the performance of the application (Web) layer, Service Layer, API, and Middle tier and Maps the insights from User <–> Web <–> API <–> Database for each and every applications – No need to manually link Application 1 in Web Server A with API1 in Middle Tier B and so on.
  2. Traces the end-to-end user transaction journey for all transactions with Unique ID.
  3. Monitors the performance of the 3rd party calls (e.g. web service, API calls, etc.), no need to map them.
  4. Monitors the End User Experience through RUM (Real User Monitoring) without any end-user agent.

<A reference screenshot of how APM maps the user transaction journey across different nodes. The screenshot also gives the Method level performance insights>

Why choose ZIF APM? Key Features and Benefits

  1. All-in-One – Provides the complete insight of the underlying Web Server, API server, DB server related infrastructure metrics like CPU, Memory, Disk, and others.
  2. End-user experience (RUM) – Captures performance issues and anomalies faced by end-user at the browser side.
  3. Anomalies detection – Offers deeper insights on the exceptions faced by the application including the line number in the source code where the issue has occurred.
  4. Code-level insights – Gives details about which method and function calls within the source code is taking more time or slowing down the application.
  5. 3rd Party and DB Layer visibility – Provides the details about 3rd party APIs or Database calls and Queries which are delaying the web application response.
  6. AHI – Application Health Index is a scorecard based on A) End User Experience, B) Application Anomalies, C) Server Performance and D) Database performance factors that are applicable in the given environment or application. Weightage and number of components A, B, C, D are variables. For instance, if ‘Web server performance’ or ‘Network Performance’ needs to be brought in as new variable ‘E’, then accordingly the weightage will be adjusted/calculated against 100%.
  7. Pattern Analysis – Analyzes unusual spikes through pattern matching and alerts are provided.
  8. GTrace – Provides the transaction journey of the user transaction and the layers it is passing through and where the transaction slows down, by capturing the performance of each transaction of all users.
  9. JVM and CLR – Provides the Performance of the underlying operating system, Web server, and run time (JVM, CLR).
  10. LOG Monitoring – Provides deeper insight on the application logs.
  11. Problem isolation– ZIF APM helps in problem isolation by comparing the performance with another user in the same location at the same time.

Visit www.zif.ai for more details.

About the Author –

Suresh Kumar Ramasamy

Suresh heads the Monitor component of ZIF at GAVS. He has 20 years of experience in Native Applications, Web, Cloud, and Hybrid platforms from Engineering to Product Management. He has designed & hosted the monitoring solutions. He has been instrumental in conglomerating components to structure the Environment Performance Management suite of ZIF Monitor. Suresh enjoys playing badminton with his children. He is passionate about gardening, especially medicinal plants.

Cloud Adoption, Challenges, and Solution Through Monitoring, AI & Automation

Cloud Adoption

Cloud computing is the delivery of computing services including Servers, Database, Storage, Networking & others over the internet. Public, Private & Hybrid clouds are different ways of deploying cloud computing.  

  • In public cloud, the cloud resources are owned by 3rd party cloud service provider
  • A private cloud consists of computing resources exclusively by one business or organization
  • Hybrid provides the best of both worlds, combines on-premises infrastructure, private cloud with public cloud

Microsoft, Google, Amazon, Oracle, IBM, and others are providing cloud platform to users to host and experience practical business solution. The worldwide public cloud services market is forecast to grow 17% in 2020 to total $266.4 billion and $354.6 billion in 2022, up from $227.8 billion in 2019, per Gartner, Inc.

There are various types of Instances, workloads & options available as part of cloud ecosystem, i.e. IaaS, PaaS, SaaS, Multi-cloud, Serverless.

Challenges

When very large, large and medium enterprise decides to move their IT environment from on-premise to cloud, they try to move some/most of their on-premises into cloud and keep the rest under their control on-premise. There are various factors that impact the decision, to name a few,

  1. ROI vs Cost of Cloud Instance, Operation cost
  2. Architecture dependency of the application, i.e. whether it is monolithic or multi-tier or polyglot or hybrid cloud
  3. Requirement and need for elasticity and scalability
  4. Availability of right solution from the cloud provider
  5. Security of some key data

After crossing all, once the IT environment is cloud-enabled, the challenge comes in ensuring the monitoring of the Cloud-enabled IT environment. Here are some of the business and IT challenges

1. How to ensure the various workloads & Instances are working as expected?

While the cloud provider may give high availability & up time depending on the tier we choose, it is important that our IT team monitors the environment, as in the case of IaaS and to some extent in PaaS as well.

2. How to ensure the Instances are optimally used in terms of compute and storage?

Cloud providers give most of the metrics around the Instances, though it may not provide all metrics that we may need to make decision in all scenarios.

The disadvantage with this model is, cost, latency & not straight forward, e.g. the LOG analytics which comes in Azure involves cost for every MB/GB of data that is stored and the latency in getting the right metrics at right time, if there is latency/delay, you may not get a right result

3. How to ensure the Application or the components of a single solution that are spread across on-premise and Cloud environment is working as expected?

Some cloud providers give tools for integrating the metrics from on-premise to cloud environment to have a shared view.

The disadvantage with this model is, it is not possible to bring in all sorts of data together to get the insights straight. That is, observability is always a question. The ownership of getting the observability lies with the IT team who handles the data.

4. How to ensure the Multi-Cloud + On-Premise environment is effectively monitored & utilized to ensure the best End-user experience?

Multi-Cloud environment – With rapid growing Microservices Architecture & Container based cloud enabled model, it is quite natural that the Enterprise may choose the best from different cloud providers like Azure, AWS, Google & others.

There is little support from cloud provider on this space. In fact, some cloud providers do not even support this scenario.

5. How to get a single panel of view for troubleshooting & root cause analysis?

Especially when problem occurs in Application, Database, Middle Tier, Network & 3rd party layers that are spread across multi-cluster, multi-cloud, elastic environment, it is very important to get a Unified view of entire environment.

ZIF (Zero Incident FrameworkTM), provides a single platform for Cloud Monitoring.

ZIF has Discovery, Monitoring, Prediction & Remediate that seamlessly fits for a cloud enabled solution. ZIF provides the unified dashboard with insights across all layers of IT infrastructure that is distributed across On-premise host, Cloud Instance & Containers.

Core features & benefits of ZIF for Cloud Monitoring are,

1. Discovery & Topology

  • Discovers and provides dynamic mapping of resources across all layers.
  • Provides real-time mapping of applications and its dependent layers irrespective of whether the components live on-premise, or on cloud or containerized in cloud.
  • Dynamically built topology of all layers which helps in taking effective decisions.

2. Observability across Multi-Cloud, Hybrid-Cloud & On-Premise tiers

  • It is not just about collecting metrics; it is very important to analyze the monitored data and provide meaningful insights.
  • When the IT infrastructure is spread across multiple cloud platform like Azure, AWS, Google Cloud, and others, it is important to get a unified view of your entire environment along with the on-premise servers.
  • Health of each layers are represented in topology format, this helps to understand the impact and take necessary actions.

3. Prediction driven decision for resource optimization

  • Prediction engine analyses the metrics of cloud resources and predicts the resource usage. This helps the resource owner to make proactive action rather than being reactive.
  • Provides meaningful insights and alerts in terms of the surge in the load, the growth in number of VMs, containers, and the usage of resource across other workloads.
  • Authorize the Elasticity & Scalability through real-time metrics.

4. Container & Microservice support

  • Understand the resource utilization of your containers that are hosted in Cloud & On-Premise.
  • Know the bottlenecks around the Microservices and tune your environment for the spikes in load.
  • Provides full support for monitoring applications distributed across your local host & containers in cloud in a multi-cluster setup.

5. Root cause analysis made simple

  • Quick root cause analysis by analysing various causes captured by ZIF Monitor instead of going through layer by layer. This saves time to focus on problem-solving and arresting instead of spending effort on identifying the root cause.
  • Provides insights across your workload including the impact due to 3rd party layers as well.

6. Automation

  • Irrespective of whether the workload and instance is on-premise or on Azure or AWS or other provider, the ZIF automation module can automate the basics to complex activities

7. Ensure End User Experience

  • Helps to improve the end-user experience who gets served by the workload from cloud.
  • The ZIF tracing helps to trace each & every request of each & every user, thereby it is quite natural for ZIF to unearth the performance bottleneck across all layers, which in turn helps to address the problem and thereby improve the User Experience

Cloud and Container Platform Support

ZIF Seamlessly integrates with following Cloud & Container environments,

  • Microsoft Azure
  • AWS
  • Google Cloud
  • Grafana Cloud
  • Docker
  • Kubernetes

About the Author

Suresh Kumar Ramasamy-Picture

Suresh Kumar Ramasamy


Suresh heads the Monitor component of ZIF at GAVS. He has 20 years of experience in Native Applications, Web, Cloud, and Hybrid platforms from Engineering to Product Management. He has designed & hosted the monitoring solutions. He has been instrumental in conglomerating components to structure the Environment Performance Management suite of ZIF Monitor.

Suresh enjoys playing badminton with his children. He is passionate about gardening, especially medicinal plants.

Machine Learning: Building Clustering Algorithms

Gireesh Sreedhar KP


Clustering is a widely-used Machine Learning (ML) technique. Clustering is an Unsupervised ML algorithm that is built to learn patterns from input data without any training, besides being able of processing data with high dimensions. This makes clustering the method of choice to solve a wide range and variety of ML problems.

Since clustering is widely used, for Data Scientists and ML Engineer’s it is critical to understand how to practically build clustering algorithms even though many of us have a high-level understanding of clustering. Let us understand the approach to build a clustering algorithm from scratch.

What is Clustering and how does it work?

Clustering is finding groups of objects (data) such that objects in the same group will be similar (related) to one another and different from (unrelated to) objects in other groups.

Clustering works on the concept of Similarity/Dissimilarity between data points. The higher similarity between data points, the more likely these data points will belong to the same cluster and higher the dissimilarity between data points, the more likely these data points will be kept out of the same cluster.

Similarity is the numerical measure of how alike two data objects are. Similarity will be higher when objects are more alike. Dissimilarity is the numerical measure of how different two data objects. Dissimilarity is lower when objects are more alike.

We create a ‘Dissimilarity Matrix’ (also called Distance Matrix) as an input to a clustering algorithm, where the dissimilarity matrix gives algorithm the notion of dissimilarity between objects. We build a dissimilarity matrix for each attribute of data considered for clustering and then combine the dissimilarity matrix for each data attribute to form an overall dissimilarity matrix. The dissimilarity matrix is an NxN square matrix where N is the number of data points considered for clustering and each element of the NxN square matrix gives dissimilarity between two objects.

Building Clustering Algorithm

Building a clustering algorithm involve the following:

  • Selection of most suited clustering techniques and algorithms to solve the problem. This step needs close collaboration among SMEs, business users, data scientists, and ML engineers. Based on inputs and data study, a possible list of algorithms (one or more) is selected for modeling and development along with tuning parameters are decided (to give algorithm more flexibility for tuning and learning from SME).
  • The selection of data attributes for the formulation of the dissimilarity matrix and methodology for the formation of the dissimilarity matrix (discussed later).
  • Building algorithms and doing the Design of experiments to select the best-suited algorithm and algorithm parameters for implementation.
  • Implementation of algorithm and fine-tuning of parameters as required.

Building a Dissimilarity matrix:

There are different approaches to build a dissimilarity matrix, here we consider building a dissimilarity matrix containing the distance (called Distance Matrix) between data objects (another alternative approach is to feed in coordinate points and let the algorithm compute distance). Let us consider a group of N data objects to be clustered based on three data attributes of each data object. The steps for building a Distance matrix are:

Build a Distance matrix for individual data attributes. Here we build three individual distance matrices (one for each attribute) containing distance between data objects calculated for each attribute. The data is always scaled between [0,1] using one of the standard normalization methods such as Min-Max Scalar. Here is how the distance matrix for an attribute looks like.

Properties of Distance Matrix:

  1. Distance Matrix is NxN square matrix (N – number of objects in clustering space)
  2. Matrix is symmetric with diagonal as zero (zero diagonal as distance of an object from itself is zero)
  3. For categorical data, distance between two points = 0, if both are same; =1 otherwise
  4. For numeric/ordered data, distance between two points = difference between scaled attribute values of two points.

Build Complete Distance matrix. Here we build a complete distance matrix combining distance matrix of individual attributes forming the input for clustering algorithm.

Complete distance matrix = (element-wise sum of individual attribute level matrix)/3;

Generalized Complete distance matrix = (element-wise sum of individual attribute level matrix)/M, where M is the number of attribute level matrix formed.

Considerations for the selection of clustering algorithms:

Before the selection of a clustering algorithm, the following considerations need to be evaluated to identify the right clustering algorithms for the given problem.

  • Partition criteria: Single Level vs hierarchical portioning
  • Separation of clusters: Exclusive (one data point belongs to only one class) vs non-exclusive (one data point can belong to more than one class)
  • Similarity measures: Distance-based vs Connectivity-based
  • Clustering space: Full space (used when low dimension data is processed) vs Subspace (used when high dimension data is processed, where only subspace can be processed and interesting clustering can be formed)
  • Attributes processing: Ability to deal with different types of attributes: Numerical, Categorical, Text, Media, a combination of data types in inputs
  • Discovery of clusters: Ability to form a predefined number of clusters or an arbitrary number of clusters
  • Ability to deal with noise in data
  • Scalability to deal with huge volumes of data, high dimensionality, incremental, or streaming data.
  • Ability to deal with constraints on user preference and domain requirements.

Application of Clustering

There are broadly two applications of clustering.

As an ML tool to get insight into data. Like building Recommendation Systems or Customer segmentation by clustering like-minded users or similar products, Social network analysis, Biological data analysis like Gene/Protein sequence analysis, etc.

As a pre-processing or intermediate step for other classes of algorithms. Like some Pattern-mining algorithms use clustering to group patterns mined and select most representative patterns instead of selecting entire patterns mined.

Conclusion

Building ML algorithm is teamwork with a team consisting of SMEs, users, data scientists, and ML engineers, each playing their part for success. The article gives steps to build a clustering algorithm, this can be used as reference material while attempting to build your algorithm.

About the Author:

Gireesh is a part of the projects run in collaboration with IIT Madras for developing AI solutions and algorithms. His interest includes Data Science, Machine Learning, Financial markets, and Geo-politics. He believes that he is competing against himself to become better than who he was yesterday. He aspires to become a well-recognized subject matter expert in the field of Artificial Intelligence.

Assess Your Organization’s Maturity in Adopting AIOps

Artificial Intelligence for IT operations (AIOps) is adopted by organizations to deliver tangible Business Outcomes. These business outcomes have a direct impact on companies’ revenue and customer satisfaction.

A survey from AIOps Exchange 2019, reports that 84% of Business Owners who attended the survey, confirmed that they are actively evaluating AIOps to be adopted in their organizations.

So, is AIOps just automation? Absolutely NOT!!

Artificial Intelligence for IT operations implies the implementation of true Autonomous Artificial Intelligence in ITOps, which needs to be adopted as an organization-wide strategy. Organizations will have to assess their existing landscape, processes, and decide where to start. That is the only way to achieve the true implementation of AIOps.

Every organization trying to evaluate AIOps as a strategy should read through this article to understand their current maturity, and then move forward to reach the pinnacle of Artificial Intelligence in IT Operations.

The primary Success Factor in adopting AIOps is derived from the Business Outcomes the organization is trying to achieve by implementing AIOps –that is the only way to calculate ROI.

There are 4 levels of Maturity in AIOps adoption. Based on our experience in developing an AIOps platform and implementing the platform across multiple industries, we have arrived at these 4 levels. Assessing an organization against each of these levels helps in achieving the goal of TRUE Artificial Intelligence in IT Operations.

Level 1: Knee-jerk

Events, logs are generated in silos and collected from various applications and devices in the infrastructure. These are used to generate alerts that are commissioned to command centres to escalate as per the SOPs (standard operating procedures) defined. The engineering teams work in silos, not aware of the business impact that these alerts could potentially create. Here, operations are very reactive which could cost the organization millions of dollars.

Level 2: Unified

Have integrated all events, logs, and alerts into one central locale. ITSM process has been unified. This helps in breaking silos and engineering teams are better prepared to tackle business impacts. SOPs have been adjusted since the process is unified, but this is still reactive incident management.

Level 3: Intelligent

Machine Learning algorithms (either supervised or unsupervised) have been implemented on the unified data to derive insights. There are baseline metrics that are calibrated and will be used as a reference for future events. With more data, the metrics get richer. IT operations team can correlate incidents/events with business impacts by leveraging AI & ML. If Mean Time To Resolve (MTTR) an incident has been reduced by automated identification of the root cause, then the organization has attained level 3 maturity in AIOps.

Level 4: Predictive & Autonomous

The pinnacle of AIOps is level 4. If incidents and performance degradation of applications can be predicted by leveraging Artificial Intelligence, it implies improved application availability. Autonomousremediation bots can be triggered spontaneously based on the predictive insights, to fix incidents that are prone to happen in the enterprise. Level 4 is a paradigm shift in IT operations – moving operations entirely from being reactive, to becoming proactive.

Conclusion:

As IT operations teams move up each level, the essential goal to keep in mind is the long-term strategy that needs to be attained by adopting AIOps. Artificial Intelligence has matured over the past few decades, and it is up to AIOps platforms to embrace it effectively. While choosing an AIOps platform, measure the maturity of the platform’s artificial intelligent coefficient.

About the Author:

Anoop Aravindakshan (Principal Consultant Manager) at GAVS Technologies.


An evangelist of Zero Incident FrameworkTM, Anoop has been a part of the product engineering team for long and has recently forayed into product marketing. He has over 14 years of experience in Information Technology across various verticals, which include Banking, Healthcare, Aerospace, Manufacturing, CRM, Gaming, and Mobile.

Prediction for Business Service Assurance

Artificial Intelligence for IT operations or AIOps has exploded over the past few years. As more and more enterprises set about their digital transformation journeys, AIOps becomes imperative to keep their businesses running smoothly. 

AIOps uses several technologies like Machine Learning and Big Data to automate the identification and resolution of common Information Technology (IT) problems. The systems, services, and applications in a large enterprise produce volumes of log and performance data. AIOps uses this data to monitor the assets and gain visibility into the behaviour and dependencies among these assets.

According to a Gartner publication, the adoption of AIOps by large enterprises would rise to 30% by 2023.

ZIF – The ideal AIOps platform of choice

Zero Incident FrameworkTM (ZIF) is an AIOps based TechOps platform that enables proactive detection and remediation of incidents helping organizations drive towards a Zero Incident Enterprise™.

ZIF comprises of 5 modules, as outlined below.

At the heart of ZIF, lies its Analyze and Predict (A&P) modules which are powered by Artificial Intelligence and Machine Learning techniques. From the business perspective, the primary goal of A&P would be 100% availability of applications and business processes.

Let us understand more about thePredict module of ZIF.

Predictive Analytics is one of the main USP of the ZIF platform. ZIF encompassesSupervised, Unsupervised and Reinforcement Learning algorithms for realization of various business use cases (as shown below).

How does the Predict Module of ZIF work?

Through its data ingestion capabilities, the ZIF platform can receive and process all types of data (both structured and unstructured) from various tools in the enterprise. The types of data can be related to alerts, events, logs, performance of devices, relations of devices, workload topologies, network topologies etc. By analyzing all these data, the platform predicts the anomalies that can occur in the environment. These anomalies get presented as ‘Opportunity Cards’ so that suitable action can be taken ahead of time to eliminate any undesired incidents from occurring. Since this is ‘Proactive’ and not ‘Reactive’, it brings about a paradigm shift to any organization’s endeavour to achieve 100% availability of their enterprise systems and platforms. Predictions are done at multiple levels – application level, business process level, device level etc.

Sub-functions of Prediction Module

How does the Predict module manifest to enterprise users of the platform?

Predict module categorizes the opportunity cards into three swim lanes.

  1. Warning swim lane – Opportunity Cards that have an “Expected Time of Impact” (ETI) beyond 60 minutes.
  2. Critical swim lane – Opportunity Cards that have an ETI within 60 minutes.
  3. Processed / Lost– Opportunity Cards that have been processed or lost without taking any action.

Few of the enterprises that realized the power of ZIF’s Prediction Module

  • A manufacturing giant in the US
  • A large non-profit mental health and social service provider in New York
  • A large mortgage loan service provider in the US
  • Two of the largest private sector banks in India

For more detailed information on GAVS’ Analyze, or to request a demo please visithttps://zif.ai/products/predict/

References:https://www.gartner.com/smarterwithgartner/how-to-get-started-with-aiops/

About the Author:

Vasudevan Gopalan

Vasu heads Engineering function for A&P. He is a Digital Transformation leader with ~20 years of IT industry experience spanning across Product Engineering, Portfolio Delivery, Large Program Management etc. Vasu has designed and delivered Open Systems, Core Banking, Web / Mobile Applications etc.

Outside of his professional role, Vasu enjoys playing badminton and focusses on fitness routines.

Discover, Monitor, Analyze & Predict COVID-19

Uber, the world’s largest taxi company, owns no vehicles. Facebook, the world’s most popular media owner, creates no content. Alibaba, the most valuable retailer, has no inventory. Netflix, the world’s largest movie house, own no cinemas. And Airbnb, the world’s largest accommodation provider, owns no real estate. Something interesting is happening.”

– Tom Goodwin, an executive at the French media group Havas.

This new breed of companies is the fastest growing in history because they own the customer interface layer. It is the platform where all the value and profit is. “Platform business” is a more wholesome termfor this model for which data is the fuel; Big Data & AI/ML technologies are the harbinger of new waves of productivity growth and innovation.

With Big data and AI/ML is making a big difference in the area of public health, let’s see how it is helping us tackle the global emergency of coronavirus formally known as COVID-19.

“With rapidly spreading disease, a two-week lag is an eternity.”

DISCOVERING/ DETECTING

Chinese technology giant Alibaba has developed an AI system for detecting the COVID-19 in CT scans of patients’ chests with 96% accuracy against viral pneumonia cases. It only takes 20 seconds for the AI to decide, whereas humans generally take about 15 minutes to diagnose the illness as there can be upwards of 300 images to evaluate.The system was trained on images and data from 5,000 confirmed coronavirus cases and has been tested in hospitals throughout China. Per a report, at least 100 healthcare facilities are currently employing Alibaba’s AI to detect COVID-19.

Ping An Insurance (Group) Company of China, Ltd (Ping An) aims to address the issue of lack of radiologists by introducing the COVID-19 smart image-reading system. This image-reading system can read the huge volumes of CT scans in epidemic areas.

Ping An Smart Healthcare uses clinical data to train the AI model of the COVID-19 smart image-reading system. The AI analysis engine conducts a comparative analysis of multiple CT scan images of the same patient and measures the changes in lesions. It helps in tracking the development of the disease, evaluation of the treatment and in prognosis of patients.Ultimately it assists doctors to diagnose, triage and evaluate COVID-19 patients swiftly and effectively.

Ping An Smart Healthcare’s COVID-19 smart image-reading system also supports AI image-reading remotely by medical professionals outside the epidemic areas.Since its launch, the smart image-reading system has provided services to more than 1,500 medical institutions. More than 5,000 patients have received smart image-reading services for free.

The more solutions the better. At least when it comes to helping overwhelmed doctors provide better diagnoses and, thus, better outcomes.

MONITORING

  • AI based Temperature monitoring & scanning

In Beijing, China, subway passengers are being screened for symptoms of coronavirus, but not by health authorities. Instead, artificial intelligence is in-charge.

Two Chinese AI giants, Megvii and Baidu, have introduced temperature-scanning. They have implemented scanners to detect body temperature and send alerts to company workers if a person’s body temperature is high enough to constitute a fever.

Megvii’s AI system detects body temperatures for up to 15 people per second andup to 16 feet. It monitors as many as 16 checkpoints in a single station. The system integrates body detection, face detection, and dual sensing via infrared cameras and visible light. The system can accurately detect and flag high body temperature even when people are wearing masks, hats, or covering their faces with other items. Megvii’s system also sends alerts to an on-site staff member.

Baidu, one of the largest search-engine companies in China, screens subway passengers at the Qinghe station with infrared scanners. It also uses a facial-recognition system, taking photographs of passengers’ faces. If the Baidu system detects a body temperature of at least 99-degrees Fahrenheit, it sends an alert to the staff member for another screening. The technology can scan the temperatures of more than 200 people per minute.

  • AI based Social Media Monitoring

An international team is using machine learning to scour through social media posts, news reports, data from official public health channels, and information supplied by doctors for warning signs of the virus across geographies.The program is looking for social media posts that mention specific symptoms, like respiratory problems and fever, from a geographic area where doctors have reported potential cases. Natural language processing is used to parse the text posted on social media, for example, to distinguish between someone discussing the news and someone complaining about how they feel.

The approach has proven capable of spotting a coronavirus needle in a haystack of big data. This technique could help experts learn how the virus behaves. It may be possible to determine the age, gender, and location of those most at risk quicker than using official medical sources.

PREDICTING

Data from hospitals, airports, and other public locations are being used to predict disease spread and risk. Hospitals can also use the data to plan for the impact of an outbreak on their operations.

Kalman Filter

Kalman filter was pioneered by Rudolf Emil Kalman in 1960, originally designed and developed to solve the navigation problem in the Apollo Project. Since then, it has been applied to numerous cases such as guidance, navigation, and control of vehicles, computer vision’s object tracking, trajectory optimization, time series analysis in signal processing, econometrics and more.

Kalman filter is a recursive algorithm which uses time-series measurement over time, containing statistical noise and produce estimations of unknown variables.

For the one-day prediction Kalman filter can be used, while for the long-term forecast a linear model is used where its main features are Kalman predictors, infected rate relative to population, time-depended features, and weather history and forecasting.

The one-day Kalman prediction is very accurate and powerful while a longer period prediction is more challenging but provides a future trend.Long term prediction does not guarantee full accuracy but provides a fair estimation following the recent trend. The model should re-run daily to gain better results.

GitHub Link: https://github.com/Rank23/COVID19

ANALYZING

The Center for Systems Science and Engineering at Johns Hopkins University has developed an interactive, web-based dashboard that tracks the status of COVID-19 around the world. The resource provides a visualization of the location and number of confirmed COVID-19 cases, deaths and recoveries for all affected countries.

The primary data source for the tool is DXY, a Chinese platform that aggregates local media and government reports to provide COVID-19 cumulative case totals in near real-time at the province level in China and country level otherwise. Additional data comes from Twitter feeds, online news services and direct communication sent through the dashboard. Johns Hopkins then confirms the case numbers with regional and local health departments. This kind of Data analytics platform plays a pivotal role in addressing the coronavirus outbreak.

All data from the dashboard is also freely available in the following GitHub repository.

GitHub Link:https://bit.ly/2Wmmbp8

Mobile version: https://bit.ly/2WjyK4d

Web version: https://bit.ly/2xLyT6v

Conclusion

One of AI’s core strengths when working on identifying and limiting the effects of virus outbreaks is its incredibly insistent nature. AIsystems never tire, can sift through enormous amounts of data, and identify possible correlations and causations that humans can’t.

However, there are limits to AI’s ability to both identify virus outbreaks and predict how they will spread. Perhaps the best-known example comes from the neighboring field of big data analytics. At its launch, Google Flu Trends was heralded as a great leap forward in relation to identifying and estimating the spread of the flu—until it underestimated the 2013 flu season by a whopping 140 percent and was quietly put to rest.Poor data quality was identified as one of the main reasons Google Flu Trends failed. Unreliable or faulty data can wreak havoc on the prediction power of AI.

References:

About the Author:

Bargunan Somasundaram

Bargunan Somasundaram

Bargunan is a Big Data Engineer and a programming enthusiast. His passion is to share his knowledge by writing his experiences about them. He believes “Gaining knowledge is the first step to wisdom and sharing it is the first step to humanity.”

AI in Healthcare

The Healthcare Industry is going through a quiet revolution. Factors like disease trends, doctor demographics, regulatory policies, environment, technology etc. are forcing the industry to turn to emerging technologies like AI, to help adapt to the pace of change. Here, we take a look at some key use cases of AI in Healthcare.

Medical Imaging

The application of Machine Learning (ML) in Medical Imaging is showing highly encouraging results. ML is a subset of AI, where algorithms and models are used to help machines imitate the cognitive functions of the human brain and to also self-learn from their experiences.

AI can be gainfully used in the different stages of medical imaging- in acquisition, image reconstruction, processing, interpretation, storage, data mining & beyond. The performance of ML computational models improves tremendously as they get exposed to more & more data and this foundation on colossal amounts of data enables them to gradually better humans at interpretation. They begin to detect anomalies not perceptible to the human eye & not discernible to the human brain!

What goes hand-in-hand with data, is noise. Noise creates artifacts in images and reduces its quality, leading to inaccurate diagnosis. AI systems work through the clutter and aid noise- reduction leading to better precision in diagnosis, prognosis, staging, segmentation and treatment.

At the forefront of this use case is Radio genomics- correlating cancer imaging features and gene expression. Needless to say, this will play a pivotal role in cancer research.

Drug Discovery

Drug Discovery is an arduous process that takes several years from the start of research to obtaining approval to market. Research involves laboring through copious amounts of medical literature to identify the dynamics between genes, molecular targets, pathways, candidate compounds. Sifting through all of this complex data to arrive at conclusions is an enormous challenge. When this voluminous data is fed to the ML computational models, relationships are reliably established. AI powered by domain knowledge is slashing down time & cost involved in new drug development.

Cybersecurity in Healthcare

Data security is of paramount importance to Healthcare providers who need to ensure confidentiality, integrity, and availability of patient data. With cyberattacks increasing in number and complexity, these formidable threats are giving security teams sleepless nights! The main strength of AI is its ability to curate massive quantities of data- here threat intelligence, nullify the noise, provide instant insights & self-learn in the process. Predictive & Prescriptive capabilities of these computational models drastically reduces response time.

Virtual Health assistants

Virtual Health assistants like Chatbots, give patients 24/7 access to critical information, in addition to offering services like scheduling health check-ups or setting up appointments. AI- based platforms for wearable health devices and health apps come armed with loads of features to monitor health signs, daily activities, diet, sleep patterns etc. and provide alerts for immediate action or suggest personalized plans to enable healthy lifestyles.

AI for Healthcare IT Infrastructure

Healthcare IT Infrastructure running critical applications that enable patient care, is the heart of a Healthcare provider. With dynamically changing IT landscapes that are distributed, hybrid & on-demand, IT Operations teams are finding it hard to keep up. Artificial Intelligence for IT Ops (AIOps) is poised to fundamentally transform the Healthcare Industry. It is powering Healthcare Providers across the globe, who are adopting it to Automate, Predict, Remediate & Prevent Incidents in their IT Infrastructure. GAVS’ Zero Incident FrameworkTM (ZIF) – an AIOps Platform, is a pure-play AI platform based on unsupervised Machine Learning and comes with the full suite of tools an IT Infrastructure team would need. Please watch this video to learn more.

READ ALSO OUR NEW UPDATES