To survive in the current climate, most enterprises have already embarked on their digital transformation journeys. This is leading to uncertainty in the way applications and services supporting the applications are being monitored and managed. Inadequate information is leading to downtime in service availability for end-users eventually resulting in unhappy users and revenue loss.
Zero Incident Framework™ has been architected to address the IT Ops issues of today and tomorrow.
Leveraging the power of Artificial Intelligence on telemetry data ingested in real-time, ZIF can provide insights and resolve forecasted issues – resulting in the availability of application service when end-user wants the service at the right time.
Business Value delivered to customers from ZIF
Minimum 40% reduction in capital expenses and a minimum 50% reduction in IT operational cost
Faster resolution by 60% (MTTR)
Service availability of 99.99%
ZIF bots to increase productivity by a minimum of 80%
Increased user experience measured by metrics (UEI) User Experience Index
ICEBERG STATE IN ITOps
Many IT operations are in an ‘ICEBERG’ state even today. Do not be surprised if your organization is also one of them. Issues and incidents that surfaces to the top are the ones that are known to the team. But the unknown issues are not uncovered.
Therefore, enterprises have started to embark on artificial intelligence to help them identify and track the unknown issues within the complex IT landscape.
OBSERVABILITY USING ZIF
ZIF, architected and developed on the premise of observability, not only helps with visibility but also enables discovering deeper insights, thus freeing up more time for more strategic initiatives. This becomes critical to the overall success of Site Reliability Engineering (SRE) in enterprises.
Externalizing the internal state of systems, services, and application to the maximum, helps in complete observability.
Monitoring Vs. Observability?
Pillars of Observability – Events | Metrics | Traces
Ensure SERVICE RELIABILITY
“Reliability is defined as the probability that an application, system, or service will perform its intended function adequately for a specified period or will operate in a defined environment without failure.”
ZIF has mastered the art of predicting device, application & service failure, or performance degradation. This unique proposition from ZIF gives IT engineers the edge on service reliability of all applications, systems, or services that they are responsible for. ZIF’s auto-remediation bots can resolve predicted issues to make sure the intended function performs as and when expected by users.
Availability is measured as the percentage of time your service or system or application is available.
A small variation in availability percentage will have to be addressed on priority. A 99.999% availability allows only 5.26 minutes of downtime a year, whereas 99% availability allows downtime of 3.65 days a year.
ZIF helps IT engineers achieve the agreed-upon availability of application or system by learning the usage of the system and application from the metrics that are collected from the environment. Collecting the right metrics helps in getting the right availability. With the help of unsupervised algorithms, patterns are learned which helps in discovering when the application or system is required the most and then predicting any potential downtime. With above 95% accuracy in prediction, ZIF can achieve 99.99% availability for application and devices which allows 52.56 minutes downtime a year.
ZIF’s goal has always been to deliver the right business outcomes for the stakeholders. Users have the privilege to choose what business outcomes are expected from the platform and the respective features are deployed in the enterprise to deliver the chosen outcome.
About the Author –
An evangelist of Zero Incident FrameworkTM, Anoop has been a part of the product engineering team for long and has recently forayed into product marketing. He has over 14 years of experience in Information Technology across various verticals, which include Banking, Healthcare, Aerospace, Manufacturing, CRM, Gaming, and Mobile.
Cloud computing is the delivery of computing services including Servers, Database, Storage, Networking & others over the internet. Public, Private & Hybrid clouds are different ways of deploying cloud computing.
In public cloud, the cloud resources are owned by 3rd party cloud service provider
A private cloud consists of computing resources exclusively by one business or organization
Hybrid provides the best of both worlds, combines on-premises infrastructure, private cloud with public cloud
Microsoft, Google, Amazon, Oracle, IBM, and others are providing cloud platform to users to host and experience practical business solution. The worldwide public cloud services market is forecast to grow 17% in 2020 to total $266.4 billion and $354.6 billion in 2022, up from $227.8 billion in 2019, per Gartner, Inc.
There are various types of Instances, workloads & options available as part of cloud ecosystem, i.e. IaaS, PaaS, SaaS, Multi-cloud, Serverless.
When very large, large and medium enterprise decides to move their IT environment from on-premise to cloud, they try to move some/most of their on-premises into cloud and keep the rest under their control on-premise. There are various factors that impact the decision, to name a few,
ROI vs Cost of Cloud Instance, Operation cost
Architecture dependency of the application, i.e. whether it is monolithic or multi-tier or polyglot or hybrid cloud
Requirement and need for elasticity and scalability
Availability of right solution from the cloud provider
Security of some key data
After crossing all, once the IT environment is cloud-enabled, the challenge comes in ensuring the monitoring of the Cloud-enabled IT environment. Here are some of the business and IT challenges
1. How to ensure the various workloads & Instances are working as expected?
While the cloud provider may give high availability & up time depending on the tier we choose, it is important that our IT team monitors the environment, as in the case of IaaS and to some extent in PaaS as well.
2. How to ensure the Instances are optimally used in terms of compute and storage?
Cloud providers give most of the metrics around the Instances, though it may not provide all metrics that we may need to make decision in all scenarios.
The disadvantage with this model is, cost, latency & not straight forward, e.g. the LOG analytics which comes in Azure involves cost for every MB/GB of data that is stored and the latency in getting the right metrics at right time, if there is latency/delay, you may not get a right result
3. How to ensure the Application or the components of a single solution that are spread across on-premise and Cloud environment is working as expected?
Some cloud providers give tools for integrating the metrics from on-premise to cloud environment to have a shared view.
The disadvantage with this model is, it is not possible to bring in all sorts of data together to get the insights straight. That is, observability is always a question. The ownership of getting the observability lies with the IT team who handles the data.
4. How to ensure the Multi-Cloud + On-Premise environment is effectively monitored & utilized to ensure the best End-user experience?
Multi-Cloud environment – With rapid growing Microservices Architecture & Container based cloud enabled model, it is quite natural that the Enterprise may choose the best from different cloud providers like Azure, AWS, Google & others.
There is little support from cloud provider on this space. In fact, some cloud providers do not even support this scenario.
5. How to get a single panel of view for troubleshooting & root cause analysis?
Especially when problem occurs in Application, Database, Middle Tier, Network & 3rd party layers that are spread across multi-cluster, multi-cloud, elastic environment, it is very important to get a Unified view of entire environment.
ZIF (Zero Incident FrameworkTM), provides a single platform for Cloud Monitoring.
ZIF has Discovery, Monitoring, Prediction & Remediate that seamlessly fits for a cloud enabled solution. ZIF provides the unified dashboard with insights across all layers of IT infrastructure that is distributed across On-premise host, Cloud Instance & Containers.
Core features & benefits of ZIF for Cloud Monitoring are,
1. Discovery & Topology
Discovers and provides dynamic mapping of resources across all layers.
Provides real-time mapping of applications and its dependent layers irrespective of whether the components live on-premise, or on cloud or containerized in cloud.
Dynamically built topology of all layers which helps in taking effective decisions.
2. Observability across Multi-Cloud, Hybrid-Cloud & On-Premise tiers
It is not just about collecting metrics; it is very important to analyze the monitored data and provide meaningful insights.
When the IT infrastructure is spread across multiple cloud platform like Azure, AWS, Google Cloud, and others, it is important to get a unified view of your entire environment along with the on-premise servers.
Health of each layers are represented in topology format, this helps to understand the impact and take necessary actions.
3. Prediction driven decision for resource optimization
Prediction engine analyses the metrics of cloud resources and predicts the resource usage. This helps the resource owner to make proactive action rather than being reactive.
Provides meaningful insights and alerts in terms of the surge in the load, the growth in number of VMs, containers, and the usage of resource across other workloads.
Authorize the Elasticity & Scalability through real-time metrics.
4. Container & Microservice support
Understand the resource utilization of your containers that are hosted in Cloud & On-Premise.
Know the bottlenecks around the Microservices and tune your environment for the spikes in load.
Provides full support for monitoring applications distributed across your local host & containers in cloud in a multi-cluster setup.
5. Root cause analysis made simple
Quick root cause analysis by analysing various causes captured by ZIF Monitor instead of going through layer by layer. This saves time to focus on problem-solving and arresting instead of spending effort on identifying the root cause.
Provides insights across your workload including the impact due to 3rd party layers as well.
Irrespective of whether the workload and instance is on-premise or on Azure or AWS or other provider, the ZIF automation module can automate the basics to complex activities
7. Ensure End User Experience
Helps to improve the end-user experience who gets served by the workload from cloud.
The ZIF tracing helps to trace each & every request of each & every user, thereby it is quite natural for ZIF to unearth the performance bottleneck across all layers, which in turn helps to address the problem and thereby improve the User Experience
Cloud and Container Platform Support
ZIF Seamlessly integrates with following Cloud & Container environments,
About the Author –
Suresh Kumar Ramasamy
Suresh heads the Monitor component of ZIF at GAVS. He has 20 years of experience in Native Applications, Web, Cloud, and Hybrid platforms from Engineering to Product Management. He has designed & hosted the monitoring solutions. He has been instrumental in conglomerating components to structure the Environment Performance Management suite of ZIF Monitor.
Suresh enjoys playing badminton with his children. He is passionate about gardening, especially medicinal plants.
Business services are a set of business activities delivered to an outside party, such as a customer or a partner. Successful delivery of business services often depends on one or more IT services. For example, an IT business service that would support “order to cash”, as an example could be “supply chain service”. The supply chain service could be delivered by an application such as SAP, with the customer of that service being an employee in finance/accounting using the application to perform customer-facing services such as accounts receivable, or the collection of cash from an outside party. A business service is not simply the application that the end-user sees – it is the entire chain that supports the delivery of the service, including physical and virtualized servers, databases, middleware, storage, and networks. A failure in any of these can affect the service – and so it is crucial that IT organizations have an integrated, accurate, and up-to-date view of these components and of how they work together to provide the service.
The technologies for Social Networking, Mobile Applications, Analytics, Cloud (SMAC), and Artificial Intelligence (AI) are redefining the business and the services that businesses provide. Their widespread usage is changing the business landscape, increasing reliability and availability to levels that were unimaginable even a few years ago.
Availability versus Reliability
At first glance, it might seem that if a service has a high availability then it should also have high reliability. However, this is not necessarily the case. Availability and Reliability have different meanings, serve different purposes, and require different strategies to maintain desired standards of service levels. Reliability is the measure of how long a business service performs its intended function, whereas availability is the measure of the percentage of time a business service is operable. For example, a business service may be available 90% of the time, but reliable only 75% of the time from a performance standpoint. Service reliability can be seen as:
Probability of success
Quality over time
Availability to perform a
Merely having a service available isn’t sufficient. When a business service is available, it should actually serve the intended purpose under varying and unexpected conditions. One way to measure this performance is to evaluate the reliability of the service that is available to consume. The performance of a business service is now rated not by its availability, but by how consistently reliable it is. Take the example of mobile services – 4 bars of signal strength on your smartphone does not guarantee that the quality of the call you received or going to make. Organizations need to measure how well the service fulfills the necessary business performance needs.
Recognizing the importance of reliability,
Google initiated Site Reliability Engineering (SRE) practices with a mission to
protect, provide for, and progress the software and systems behind all of
Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App
Engine, to name just a few — with an ever-watchful eye on their availability,
latency, performance, and capacity.
Zero Incident FrameworkTM (ZIF)
GAVS Technologies developed an AIOps based
TechOps platform – Zero Incident FrameworkTM (ZIF) that enables
proactive detection and remediation of incidents. The ZIF Platform is,
available in two versions for our customers to evaluate
and experience the power of AI-driven Business Service Reliability:
Business Xpress: ZIF Business Xpress has been
engineered for enterprises to evaluate AIOps before adoption. 10 to 40 devices
can be connected to ZIFBusiness Xpress, to experiment with the value
ZIF Business: Targeted for enterprise-wide adoption.
Infrastructure today has grown beyond the physical confines of the traditional data center, has spread its wings to the cloud, and is increasingly distributed, virtual, and abstract. With the cloud gaining wide acceptance, most enterprises have their workloads spread across data centers, colocations, multi-cloud, and edge locations. On-premise infrastructure is also being replaced by Hyperconverged Infrastructure (HCI) where software-defined, virtualized compute, storage, and network are in one single system, greatly simplifying IT operations. Infrastructure is also becoming increasingly elastic, scales & shrinks on demand and doesn’t have to be provisioned upfront.
Let’s look at a few interesting technologies that are steering the modern IT landscape.
Containers and Serverless
Traditional application deployment on physical servers comes with the overhead of managing the infrastructure, middleware, development tools, and everything in between. Application developers would rather have this grunt work be handled by someone else, so they could focus on just their applications. This is where containers and serverless technologies come into picture. Both are cloud-based offerings and provide different levels of abstraction, in a way that hides layers beyond the front end, from the developer. They typically deploy smaller components of monolithic applications, microservices, and functions.
A Container is like an all-in-one-box, containing the app, and all its dependencies like libraries, executables & config files. The containerized application is highly portable, will run anywhere the container runtime is installed, and behave the same regardless of the OS or hardware it is deployed on. Containers give developers great flexibility and control since they cater to specific application requirements like the OS, S/W versions. The flip side is that there is still a need for manual maintenance of the runtime environment, like security patches, software updates, etc. Secondly, the flexibility it affords translates into high operational costs, since it lacks agility in scaling.
Serverless technologies provide much greater abstraction of the OS and infrastructure. ‘Serverless’ though, does not imply that there are no servers, it just means application developers do not have to worry about the underlying OS, the server environment, or the infra that their applications will be deployed on. Serverless is event-driven and is based on the premise that the application is split into functions that get executed based on events. The developer only needs to deploy function code and define the event(s) that will trigger them! The rest of the magic is done by the cloud service provider (with the help of third parties).
The biggest advantage of serverless is that consumers are billed only for the running time of the function instances or the number of times the function gets executed, depending on the provider. Since it has zero administrative overhead, it guarantees rapid iterative deployment and faster time to market. Since the architecture is intrinsically auto-scaling, it is a perfect fit for applications with undefinable usage patterns. The other side of the coin is that developers need to deal with a black box back-end environment, so, holistic testing, debugging of the application becomes a challenge. Vendor lock-in is a real problem since the consumer is restricted by the technology stack supported by the vendor. Since serverless best practices dictate light, isolated functions with limited scope, building complex applications can get difficult. Function as a Service (FaaS) is a subset of serverless computing.
Internet of Things (IoT)
IoT is about connecting everyday things – beyond just computing devices or smartphones – to the internet. It is possible to convert practically anything into an IoT device, with a computer chip installation & internet access, and have it communicate independently with the internet – without any human intervention. But why would we want everyday things like for instance a watch or a light bulb, to become IoT devices? It’s in a bid to bridge the chasm between the physical and digital worlds and make the environment around us more intelligent, communicative, and responsive to our needs.
IoT’s use cases are just about everywhere; in personal devices, self-driving cars, smart homes, smart workspaces, smart cities, and industries across all verticals. For instance, live data from sensors in products while in use, gives good visibility into their operations on the ground, helps remediate issues proactively & aids improvements in design/manufacturing processes.
The Industrial Internet of Things (IIoT) is the use of IoT data in business, in tandem with Big Data, AI, Analytics, Cloud, and High-speed networks, with the primary goal of finding efficient business models to improve productivity & optimize expenditure. The need for real-time response to sensor data and advanced analytics to power insights has increased the demand for 5G networks for speed, cloud technologies for storage and computing, edge computing to reduce latency, and hyper-scale data centers for rapid scaling.
With IoT devices extending an organization’s infrastructure landscape, and the likelihood that IT staff may not even be aware of all the IoT devices in it is a security nightmare that could open corporate networks & sensitive data for attacks. Global standards and regulations for IoT device security are in the works. Until then, it is up to the enterprise security team to safeguard against IoT-related vulnerabilities.
The ability of infrastructure to rapidly scale out on a massive level is called hyperscaling.
Unprecedented needs for high-power computing and on-demand massive scalability has given rise to a new breed of hyperscale computing architectures, where traditional elements are replaced by hyper-converged, software-defined infrastructure with a high degree of virtualization. These hyperscale environments are characterized by high-density server racks, with software designed and specifically built for scale-out environments. Since high-density implies heavy power consumption, heating problems need to be handled by specialized cooling solutions like liquid cooling. Hyperscale data centre operators usually look for renewable energy options to save on power & cooling.
Today, there are several
hundred hyperscale data centers in the world,
with the dominant players being Microsoft, Google, Apple, Amazon &
Edge computing as the name indicates means moving data processing away from distant servers or the cloud, closer to the source of data. This is to reduce latency and network bandwidth used for back & forth communication between the data source and the server. Edge, also called the network edge refers to where the data source connects to the internet. The explosive growth of IoT and applications like self-driving cars, virtual reality, smart cities for instance, that require real-time computing and analytics are paving the way for edge computing. Most cloud providers now provide geographically distributed edge servers. As with IoT devices, data at the edge can be a ticking security time bomb necessitating appropriate security mechanisms.
The evolution of IT technologies continuously raises the bar for the IT team. IT personnel have been forced to move beyond legacy practices and mindsets & constantly up-skill themselves to be able to ride the wave. For customers pampered by sophisticated technologies, round the clock availability of systems and immersive experiences have become baseline expectations. With more & more digitalization, there is increasing reliance on IT infrastructure and hence lesser tolerance for outages. The responsibilities of maintaining a high-performing IT infrastructure with near-zero downtime fall on the shoulders of the IT operations team.
This has underscored the importance of AI in IT operations since IT needs have now surpassed human capabilities. Gavs’ AI-powered Platform for IT operations, ZIF, caters to the entire ITOps spectrum, right from automated discovery of the landscape, monitoring, to predictive and prescriptive analytics that proactively drive the organization towards zero incidents. For more details, please visit https://zif.ai
About the Author:
Priya is part of the Marketing team at GAVS. She is passionate about Technology, Indian Classical Arts, Travel, and Yoga. She aspires to become a Yoga Instructor someday!
It is important to have the right monitoring solution for an enterprise’s IT environment. More than that, it is imperative to leverage the right solution and deploy it for the appropriate requirements. In this context, the IT environment includes but is not limited to Applications, Servers, Services, End-User Devices, Network devices, APIs, Databases, etc. Towards that, let us understand the need and importance of Proactive Monitoring. This has a direct role in achieving the journey towards Zero Incident EnterpriseTM. Let us unravel the difference between reactive and proactive monitoring.
Monitoring – When a problem occurs in an IT environment,
it gets notified through monitoring and the concerned team acts on it to
resolve the issue.The problem could be as simple as slowness/poor performance,
or as extreme as the unavailability of services like web site going down or
server crashing leading to loss of business and revenue.
Monitoring – There are two levels of proactive
Symptom-based proactive monitoring is all about identifying the signals and symptoms of an issue in advance and taking appropriate and immediate action to nip the root-cause in the bud.
Synthetic-based proactive monitoring is achieved through Synthetic Transactions. Performance bottlenecks or failures are identified much in advance; even before the actual user or the dependent layer encounters the situation
Symptom-based proactive monitoring
is a USP of the ZIF Monitor module. For example, take the case of CPU
related monitoring. It is common to monitor the CPU utilization and act based
on that. But Monitor doesn’t just focus on CPU utilization, there
are a lot of underlying factors which causes the CPU utilization to go high. To
name a few,
Processor queue length
Processor context switches
Processes that are contributing to high CPU utilization
important to arrest these brewing factors at the right time, i.e., in the case
of Processor Queue length, continuous or sustained queue of greater than 2
threads is generally an indication of congestion at processor level.Of course,
in a multiple processor environment, we need to divide the queue length by the number
of processors that are servicing the workload. As a remedy, the following can
1) the number
of threads can be limited at the application level
processes can be killed to help close the queued items
the processor will help in keeping the queue length under control, which
eventually will control the CPU utilization.
Above is a sample demonstration of finding the symptom and signal and arrest them proactively. ZIF’s Monitor not only monitors these symptoms, but also suggests the remedy through the recommendation from SMEs.
(SM) is done by
simulating the transactions through the tool without depending on the end-user
to do the transactions. The advantages of synthetic monitoring are,
it uses automated transaction simulation technology
it helps to monitor the environment round-the-clock
it helps to validate from across different geographic locations
it provides options to choose the number of flows/transactions to be verified
it is proactive – identifies performance bottlenecks or failures much in advance even before the actual user or the dependent layer encounters the situation
How does Synthetic Monitoring(SM) work?
It works through 3 simple steps,
1) Record key
transactions – Any number of transactions can be recorded, if required, all
the functional flows can be recorded. An example of transaction in an e-commerce
website could be, as simple as login and view the product catalogue, or,as
elaborate as login, view product catalogue, move item to cart, check-out, make-payment
and logout. For simulation purpose, dummy credit cards are used during payment
2) Schedule the
transactions – Whether it should run every 5 minutes or x hours or minutes.
3) Choose the
location from which thesetransactions need to be triggered – The SM is
available as on-premise or cloud options. Cloud SM provides the options to
choose the SM engines available across globe (refer to the green dots in the
This is applicable mainly for web based applications, but
can also be used for the underlying APIs as well.
SM solution has engines which run the recorded transactions against the target application. Once scheduled, the SM engine hosted either on-premise or remotely (refer to the green dots in the figure shown as sample representation), will run the recorded transactions at a predefined interval. The SM dashboard provides insights as detailed under the benefits section below.
Benefits of SM
As the SM does the synthetic transactions, it provides
various insights like,
The latency in the transactions, i.e. the speed
at which the transaction is happening. This also gives a trend analysis of how
the application is performing over a period.
If there are any failures during the
transaction, SM provides the details of the failure including the stack trace
of the exception. This makes fixing the failure simpler, by avoiding the time
spent in debugging.
In case of failure, SM provides insights into
the parameter details that triggered the failure.
Unlike real user monitoring, there is the flexibility
to test all flows or at least all critical flows without waiting for the user
to trigger or experience it.
This not only unearths the problem at the
application tier but also provides deeper insights while combining it with
Application, Server, Database, Network Monitoring which are part of the ZIF
Applications working fine under one geography
may fail in a different geography due to various factors like network,
connectivity, etc. SM will exactly pinpoint the availability and performance
Suresh heads the Monitor component of ZIF at GAVS. He has 20 years of experience in Native Applications, Web, Cloud and Hybrid platforms from Engineering to Product Management. He has designed & hosted the monitoring solutions. He has been instrumental in conglomerating components to structure the Environment Performance Management suite of ZIF Monitor.
Suresh enjoys playing badminton with his children. He is passionate about gardening, especially medicinal plants.
Monitoring applications and infrastructure is a critical part of IT Operations. Among other things, monitoring provides alerts on failures, alerts on deteriorations that could potentially lead to failures, and performance data that can be analysed to gain insights. AI-led IT Ops Platforms like ZIF use such data from their monitoring component to deliver pattern recognition-based predictions and proactive remediation, leading to improved availability, system performance and hence better user experience.
The shift away from monolith applications towards microservices has posed a formidable challenge for monitoring tools. Let’s first take a quick look at what microservices are, to understand better the complications in monitoring them.
Monoliths vs Microservices
A single application(monolith) is split into a number of modular services called microservices, each of which typically caters to one capability of the application. These microservices are loosely coupled, can communicate with each other and can be deployed independently.
Quite likely the trigger for this architecture was the need for agility. Since microservices are stand-alone modules, they can follow their own build/deploy cycles enabling rapid scaling and deployments. They usually have a small codebase which aids easy maintainability and quick recovery from issues. The modularity of these microservices gives complete autonomy over the design, implementation and technology stack used to build them.
Microservices run inside containers that provide their execution environment. Although microservices could also be run in virtual machines(VMs), containers are preferred since they are comparatively lightweight as they share the host’s operating system, unlike VMs. Docker and CoreOS Rkt are a couple of commonly used container solutions while Kubernetes, Docker Swarm, and Apache Mesos are popular container orchestration platforms. The image below depicts microservices for hiring, performance appraisal, rewards & recognition, payroll, analytics and the like linked together to deliver the HR function.
Challenges in Monitoring Microservices and Containers
Since all good things come at a cost, you are probably wondering what it is here… well, the flip side to this evolutionary architecture is increased complexity! These are some contributing factors:
Exponential increase in the number of objects: With each application replaced by multiple microservices, 360-degree visibility and observability into all the services, their interdependencies, their containers/VMs, communication channels, workflows and the like can become very elusive. When one service goes down, the environment gets flooded with notifications not just from the service that is down, but from all services dependent on it as well. Sifting through this cascade of alerts, eliminating noise and zeroing in on the crux of the problem becomes a nightmare.
Shared Responsibility: Since processes are fragmented and the responsibility for their execution, like for instance a customer ordering a product online, is shared amongst the services, basic assumptions of traditional monitoring methods are challenged. The lack of a simple linear path, the need to collate data from different services for each process, inability to map a client request to a single transaction because of the number of services involved make performance tracking that much more difficult.
Design Differences: Due to the design/implementation autonomy that microservices enjoy, they could come with huge design differences, and implemented using different technology stacks. They might be using open source or third-party software that makes it difficult to instrument their code, which in turn affects their monitoring.
Elasticity and Transience: Elastic landscapes where infrastructure scales or collapses based on demand, instances appear & disappear dynamically, have changed the game for monitoring tools. They need to be updated to handle elastic environments, be container-aware and stay in-step with the provisioning layer. A couple of interesting aspects to handle are: recognizing the difference between an instance that is down versus an instance that is no longer available; data of instances that are no longer alive continue to have value for analysis of operational efficiency or past performance.
Mobility: This is another dimension of dynamic infra where objects don’t necessarily stay in the same place, they might be moved between data centers or clouds for better load balancing, maintenance needs or outages. The monitoring layer needs to arm itself with new strategies to handle moving targets.
Resource Abstraction: Microservices deployed in containers do not have a direct relationship with their host or the underlying operating system. This abstraction is what helps seamless migration between hosts but comes at the expense of complicating monitoring.
Communication over the network: The many moving parts of distributed applications rely completely on network communication. Consequently, the increase in network traffic puts a heavy strain on network resources necessitating intensive network monitoring and a focused effort to maintain network health.
What needs to be measured
This is a high-level laundry list of what needs to be done/measured while monitoring microservices and their containers.
Auto-discovery of containers and microservices:
As we’ve seen, monitoring microservices in a containerized world is a whole new ball game. In the highly distributed, dynamic infra environment where ephemeral containers scale, shrink and move between nodes on demand, traditional monitoring methods using agents to get information will not work. The monitoring system needs to automatically discover and track the creation/destruction of containers and explore services running in them.
Availability and performance of individual services
Host and infrastructure metrics
APIs and API transactions
Ensure API transactions are available and stable
Isolate problematic transactions and endpoints
Dependency mapping and correlation
Features relating to traditional APM
Detailed information relating to each container
Health of clusters, master and slave nodes
Number of clusters
Nodes per cluster
Containers per cluster
Performance of core Docker engine
Performance of container instances
Things to consider while adapting to the new IT landscape
Granularity and Aggregation: With the increase in the number of objects in the system, it is important to first understand the performance target of what’s being measured – for instance, if a service targets 99% uptime(yearly), polling it every minute would be an overkill. Based on this, data granularity needs to be set prudently for each aspect measured, and can be aggregated where appropriate. This is to prevent data inundation that could overwhelm the monitoring module and drive up costs associated with data collection, storage, and management.
Monitor Containers: The USP of containers is the abstraction they provide to microservices, encapsulating and shielding them from the details of the host or operating system. While this makes microservices portable, it makes them hard to reach for monitoring. Two recommended solutions for this are to instrument the microservice code to generate stats and/or traces for all actions (can be used for distributed tracing) and secondly to get all container activity information through host operating system instrumentation.
Track Services through the Container Orchestration Platform: While we could obtain container-level data from the host kernel, it wouldn’t give us holistic information about the service since there could be several containers that constitute a service. Container-native monitoring solutions could use metadata from the container orchestration platform by drilling into appropriate layers of the platform to obtain service-level metrics.
Adapt to dynamic IT landscapes: As mentioned earlier, today’s IT landscape is dynamically provisioned, elastic and characterized by mobile and transient objects. Monitoring systems themselves need to be elastic and deployable across multiple locations to cater to distributed systems and leverage native monitoring solutions for private clouds.
API Monitoring: Monitoring APIs can provide a wealth of information in the black box world of containers. Tracking API calls from the different entities – microservices, container solution, container orchestration platform, provisioning system, host kernel can help extract meaningful information and make sense of the fickle environment.
Watch this space for more on Monitoring and other IT Ops topics. You can find our blog on Monitoring for Success here, which gives an overview of the Monitorcomponent of GAVS’ AIOps Platform, Zero Incident FrameworkTM (ZIF). You can Request a Demo or Watch how ZIF workshere.
About the Author:
Bio – Siva is a long timer at Gavs and has been with the company for close to 15 years. He started his career as a developer and is now an architect with a strong technology background in Java, Big Data, DevOps, Cloud Computing, Containers and Micro Services. He has successfully designed & created a stable Monitoring Platform for ZIF, and designed & driven cloud assessment and migration, enterprise BRMS and IoT based solutions for many of our customers. He is currently focused on building ZIF 4.0, a new gen business-oriented TechOps platform.
Bio – Priya is part of the Marketing team at GAVS. She is passionate about Technology, Indian Classical Arts, Travel and Yoga. She aspires to become a Yoga Instructor some day!