Balancing Management Styles for a Remote Workforce

Operational Paradigm Shift

The pandemic has indeed impelled organizations to rethink the way they approach traditional business operations. The market realigned businesses to adapt to the changing environment and optimize their costs. For the past couple of months, nearly every organization implemented work for home as a mandate. This shift in operations had both highs and lows in terms of productivity. Almost a year into the pandemic, the impacts are yet to be fully understood. The productivity realized from the remote workers, month on month, shaped the policies and led to investments in different tools that aided collaboration between teams. 

Impact on Delivery Centers

Technology companies have been leading the charge towards remote working as many have adopted permanent work from home options for their employees. While identifying cost avenues for optimization, office space allocation and commuting costs are places where redundant operational cash flow can be invested to other areas for scaling.

The availability and speed of internet connections across geographies have aided the transformation of office spaces for better utilization of the budget. Considering the current economy, office spaces are becoming expensive and inefficient. The Annual Survey by JLL Enterprises in 2020 reveals that organizations spend close to $10,000 on global office real estate cost per employee per year on an average. As offices have adopted social distancing policies, the need for more space per employee would result in even higher costs during these pandemic operations. To optimize their budgets, companies have reduced their allocation spaces and introduced regional contractual sub-offices to reduce the commute expenses of their employees in the big cities. 

With this, the notion of a 9-5 job is slowly being depleted and people have been paid based on their function rather than the time they spend at work. The flexibility of working hours while linking their performance to their delivery has seen momentum in terms of productivity per resource. An interesting fact that arose out of this pandemic economy is that the number of remote workers in a country is proportional to the country’s GDP. A work from home survey undertaken by The Economist in 2020 finds that only 11% of work from home jobs can be done in Cambodia, 37% in America, and 45% in Switzerland. 

The fact of the matter is that a privileged minority has been enjoying work from home for the past couple of months. While a vast majority of the semi-urban and rural population don’t have the infrastructure to support their functional roles. For better optimization and resource utilization, India would need to invest heavily in these resources to catch up on the deficit GDP from the past couple of quarters.

Long-term work from home options challenges the foundational fabric of our industrial operations. It can alter the shape and purpose of cities, change workplace gender distribution and equality. Above all, it can change how we perceive time, especially while estimating delivery. 

Overall Pulse Analysis

Many employees prefer to work from home as they can devote extra time to their family. While this option has been found to have a detrimental impact on organizational culture, creativity, and networking. Making decisions based on skewed information would have an adverse effect on the culture, productivity, and attrition. 

To gather sufficient input for decisions, PWC conducted a remote work survey in 2020 called “When everyone can work from home, what’s the office for“. Here are some insights from the report

ai automated root cause analysis solution

ai data analytics monitoring tools

Many businesses have aligned themselves to accommodate both on-premise and remote working model. Organizations need to figure out how to better collaborate and network with employees in ways to elevate the organization culture. 

As offices are slowly transitioning to a hybrid model, organizations have decentralized how they operate. They have shifted from working in a common centralized office to contractual office spaces as per employee role and function, to better allocate their operational budget. The survey found that 72% of the workers would like to work remotely at least 2 days a week. This showcases the need for a hybrid workspace in the long run. 

Maintaining & Sustaining Productivity

During the transition, keeping a check on the efficiency of remote workers was prime. The absence of these checks would jeopardize the delivery, resulting in a severe impact on customer satisfaction and retention.

ai devops platform management services

This number however, could be far less if the scale of the survey was higher. This in turn signifies that productivity is not uniform and requires course corrective action to maintain the delivery. An initial approach from an employee’s standpoint would result in higher results. The measures to help remote workers be more productive were found to be as follows.

ai for application monitoring

Many employees point out that greater flexibility of working hours and better equipment would help increase work productivity.

Most of the productivity hindrances can be solved by effective employee management. How a particular manager supervises their team members has a direct correlation towards their productivity and satisfaction to the project delivery. 

Theory X & Theory Y

Theory X and Theory Y were introduced by Douglas McGregor in his book, “The Human Side of Enterprise”. He talks about two styles of management in his research – Authoritarian (Theory X) and Participative (Theory Y). The theory heavily believes that Employee Beliefs directly influence their behavior in the organization. The approach that is taken by the organization will have a significant impact on the ability to manage team members. 

For theory X, McGregor speculates that “Without active intervention by management, people would be passive, even resistant to organizational needs. They must therefore be persuaded, rewarded, punished, controlled and their activities must be directed”

ai in operations management service

Work under this style of management tends to be repetitive and motivation is done based on a carrot and stick approach. Performance Appraisals and remuneration are directly correlated to tangible results and are often used to control staff and keep tabs on them. Organizations with several tiers of managers and supervisors tend to use this style. Here authority is rarely delegated, and control remains firmly centralized. 

Even though this style of management may seem outdated, big organizations find it unavoidable to adopt due to the sheer number of employees on the payroll and tight delivery deadlines.

When it comes to Theory Y, McGregor firmly believes that objectives should be arranged so that individuals can achieve their own goals and happily accomplish the organization’s goal at the same time.

application performance management solutions

Organizations that follow this style of management would have an optimistic and positive approach to people and problems. Here the team management is decentralized and participative.

Working under such organizational styles bestow greater responsibilities on employees and managers encourage them to develop skills and suggest areas of improvement. Appraisals in Theory Y organizations encourage open communication rather than to exercise control. This style of management has been popular these days as it results in employees wanting to have a meaningful career and looking forward to things beyond money.

Balancing X over Y

Even though McGregor suggests that Theory Y is better than Theory X. There are instances where managers would need to balance the styles depending upon how the team function even post the implementation of certain management strategies. This is very important from a remote working context as the time for intervention would be too late before it impacts the delivery. Even though Theory Y comprises creativity and discussion in its DNA, it has its limitations in terms of consistency and uniformity. An environment with varying rules and practices could be detrimental to the quality and operational standards of an organization. Hence maintaining a balance is important.

When we look at a typical cycle of Theory X, we can find that the foundational beliefs result in controlling practices, appearing in employee resistance which in turn delivers poor results. The results again cause the entire cycle to repeat, making the work monotonous and pointless. 

applications of predictive analytics in business

Upon the identification of resources that require course correction and supervision, understanding the root cause and subsequently adjusting your management style to solve the problem would be more beneficial in the long run. Theory X must only be used in dire circumstances requiring a course correction. The balance where we need to maintain is on how far we can establish control to not result in resistance which in turn wouldn’t impact the end goal.

predictive analytics business forecasting

Theory X and Theory Y can be directly correlated to Maslow’s hierarchy of Needs. The reason why Theory Y is superior to Theory X is that it focuses on the higher needs of the employee than their foundational needs. The theory Y managers gravitate towards making a connection with their team members on a personal level by creating a healthier atmosphere in the workplace. Theory Y brings in a pseudo-democratic environment, where employees can design, construct and publish their work in accordance with their personal and organizational goals.

When it comes to Theory X and Theory Y, striking a balance will not be perfect. The American Psychologist Bruce J Avolio, in his paper titled “Promoting more integrative strategies for leadership theory-building,” speculates, “Managers who choose the Theory Y approach have a hands-off style of management. An organization with this style of management encourages participation and values an individual’s thoughts and goals. However, because there is no optimal way for a manager to choose between adopting either Theory X or Theory Y, it is likely that a manager will need to adopt both approaches depending on the evolving circumstances and levels of internal and external locus of control throughout the workplace”.

The New Normal 3.0

As circumstances keep changing by the day, organizations need to adapt to the rate at which the market is changing to envision new working models that take human interactions into account as well. The crises of 2020 made organizations build up their workforce capabilities that are critical for growth. Organizations must relook at their workforce by reskilling them in different areas of digital expertise as well as emotional, cognitive, and adaptive skills to push forward in our changing world.

Ashish Joseph

About the Author –

Ashish Joseph is a Lead Consultant at GAVS working for a healthcare client in the Product Management space. His areas of expertise lie in branding and outbound product management.

He runs two independent series called BizPective & The Inside World, focusing on breaking down contemporary business trends and Growth strategies for independent artists on his website www.ashishjoseph.biz

Outside work, he is very passionate about basketball, music, and food.

Ensure Service Availability and Reliability with ZIF

To survive in the current climate, most enterprises have already embarked on their digital transformation journeys. This is leading to uncertainty in the way applications and services supporting the applications are being monitored and managed. Inadequate information is leading to downtime in service availability for end-users eventually resulting in unhappy users and revenue loss.

Zero Incident Framework™ has been architected to address the IT Ops issues of today and tomorrow.

Leveraging the power of Artificial Intelligence on telemetry data ingested in real-time, ZIF can provide insights and resolve forecasted issues – resulting in the availability of application service when end-user wants the service at the right time.

Business Value delivered to customers from ZIF

  • Minimum 40% reduction in capital expenses and a minimum 50% reduction in IT operational cost
  • Faster resolution by 60% (MTTR)
  • Service availability of 99.99%
  • ZIF bots to increase productivity by a minimum of 80%
  • Increased user experience measured by metrics (UEI) User Experience Index
AI in operations management service

ICEBERG STATE IN ITOps

Many IT operations are in an ‘ICEBERG’ state even today. Do not be surprised if your organization is also one of them. Issues and incidents that surfaces to the top are the ones that are known to the team. But the unknown issues are not uncovered.

Therefore, enterprises have started to embark on artificial intelligence to help them identify and track the unknown issues within the complex IT landscape.

OBSERVABILITY USING ZIF

ZIF, architected and developed on the premise of observability, not only helps with visibility but also enables discovering deeper insights, thus freeing up more time for more strategic initiatives. This becomes critical to the overall success of Site Reliability Engineering (SRE) in enterprises.

Externalizing the internal state of systems, services, and application to the maximum, helps in complete observability.

Monitoring Vs. Observability?

automated discovery of networked services

Pillars of Observability – Events | Metrics | Traces

Ensure SERVICE RELIABILITY

“Reliability is defined as the probability that an application, system, or service will perform its intended function adequately for a specified period or will operate in a defined environment without failure.”

ZIF has mastered the art of predicting device, application & service failure, or performance degradation. This unique proposition from ZIF gives IT engineers the edge on service reliability of all applications, systems, or services that they are responsible for. ZIF’s auto-remediation bots can resolve predicted issues to make sure the intended function performs as and when expected by users.

SERVICE AVAILABILITY

Availability is measured as the percentage of time your service or system or application is available.

A small variation in availability percentage will have to be addressed on priority. A 99.999% availability allows only 5.26 minutes of downtime a year, whereas 99% availability allows downtime of 3.65 days a year.

ZIF helps IT engineers achieve the agreed-upon availability of application or system by learning the usage of the system and application from the metrics that are collected from the environment. Collecting the right metrics helps in getting the right availability. With the help of unsupervised algorithms, patterns are learned which helps in discovering when the application or system is required the most and then predicting any potential downtime. With above 95% accuracy in prediction, ZIF can achieve 99.99% availability for application and devices which allows 52.56 minutes downtime a year.

ZIF’s goal has always been to deliver the right business outcomes for the stakeholders. Users have the privilege to choose what business outcomes are expected from the platform and the respective features are deployed in the enterprise to deliver the chosen outcome.

About the Author

Anoop Aravindakshan

An evangelist of Zero Incident FrameworkTM, Anoop has been a part of the product engineering team for long and has recently forayed into product marketing. He has over 14 years of experience in Information Technology across various verticals, which include Banking, Healthcare, Aerospace, Manufacturing, CRM, Gaming, and Mobile.

Cloud Adoption, Challenges, and Solution Through Monitoring, AI & Automation

Cloud Adoption

Cloud computing is the delivery of computing services including Servers, Database, Storage, Networking & others over the internet. Public, Private & Hybrid clouds are different ways of deploying cloud computing.  

  • In public cloud, the cloud resources are owned by 3rd party cloud service provider
  • A private cloud consists of computing resources exclusively by one business or organization
  • Hybrid provides the best of both worlds, combines on-premises infrastructure, private cloud with public cloud

Microsoft, Google, Amazon, Oracle, IBM, and others are providing cloud platform to users to host and experience practical business solution. The worldwide public cloud services market is forecast to grow 17% in 2020 to total $266.4 billion and $354.6 billion in 2022, up from $227.8 billion in 2019, per Gartner, Inc.

There are various types of Instances, workloads & options available as part of cloud ecosystem, i.e. IaaS, PaaS, SaaS, Multi-cloud, Serverless.

Challenges

When very large, large and medium enterprise decides to move their IT environment from on-premise to cloud, they try to move some/most of their on-premises into cloud and keep the rest under their control on-premise. There are various factors that impact the decision, to name a few,

  1. ROI vs Cost of Cloud Instance, Operation cost
  2. Architecture dependency of the application, i.e. whether it is monolithic or multi-tier or polyglot or hybrid cloud
  3. Requirement and need for elasticity and scalability
  4. Availability of right solution from the cloud provider
  5. Security of some key data

After crossing all, once the IT environment is cloud-enabled, the challenge comes in ensuring the monitoring of the Cloud-enabled IT environment. Here are some of the business and IT challenges

1. How to ensure the various workloads & Instances are working as expected?

While the cloud provider may give high availability & up time depending on the tier we choose, it is important that our IT team monitors the environment, as in the case of IaaS and to some extent in PaaS as well.

2. How to ensure the Instances are optimally used in terms of compute and storage?

Cloud providers give most of the metrics around the Instances, though it may not provide all metrics that we may need to make decision in all scenarios.

The disadvantage with this model is, cost, latency & not straight forward, e.g. the LOG analytics which comes in Azure involves cost for every MB/GB of data that is stored and the latency in getting the right metrics at right time, if there is latency/delay, you may not get a right result

3. How to ensure the Application or the components of a single solution that are spread across on-premise and Cloud environment is working as expected?

Some cloud providers give tools for integrating the metrics from on-premise to cloud environment to have a shared view.

The disadvantage with this model is, it is not possible to bring in all sorts of data together to get the insights straight. That is, observability is always a question. The ownership of getting the observability lies with the IT team who handles the data.

4. How to ensure the Multi-Cloud + On-Premise environment is effectively monitored & utilized to ensure the best End-user experience?

Multi-Cloud environment – With rapid growing Microservices Architecture & Container based cloud enabled model, it is quite natural that the Enterprise may choose the best from different cloud providers like Azure, AWS, Google & others.

There is little support from cloud provider on this space. In fact, some cloud providers do not even support this scenario.

5. How to get a single panel of view for troubleshooting & root cause analysis?

Especially when problem occurs in Application, Database, Middle Tier, Network & 3rd party layers that are spread across multi-cluster, multi-cloud, elastic environment, it is very important to get a Unified view of entire environment.

ZIF (Zero Incident FrameworkTM), provides a single platform for Cloud Monitoring.

ZIF has Discovery, Monitoring, Prediction & Remediate that seamlessly fits for a cloud enabled solution. ZIF provides the unified dashboard with insights across all layers of IT infrastructure that is distributed across On-premise host, Cloud Instance & Containers.

Core features & benefits of ZIF for Cloud Monitoring are,

1. Discovery & Topology

  • Discovers and provides dynamic mapping of resources across all layers.
  • Provides real-time mapping of applications and its dependent layers irrespective of whether the components live on-premise, or on cloud or containerized in cloud.
  • Dynamically built topology of all layers which helps in taking effective decisions.

2. Observability across Multi-Cloud, Hybrid-Cloud & On-Premise tiers

  • It is not just about collecting metrics; it is very important to analyze the monitored data and provide meaningful insights.
  • When the IT infrastructure is spread across multiple cloud platform like Azure, AWS, Google Cloud, and others, it is important to get a unified view of your entire environment along with the on-premise servers.
  • Health of each layers are represented in topology format, this helps to understand the impact and take necessary actions.

3. Prediction driven decision for resource optimization

  • Prediction engine analyses the metrics of cloud resources and predicts the resource usage. This helps the resource owner to make proactive action rather than being reactive.
  • Provides meaningful insights and alerts in terms of the surge in the load, the growth in number of VMs, containers, and the usage of resource across other workloads.
  • Authorize the Elasticity & Scalability through real-time metrics.

4. Container & Microservice support

  • Understand the resource utilization of your containers that are hosted in Cloud & On-Premise.
  • Know the bottlenecks around the Microservices and tune your environment for the spikes in load.
  • Provides full support for monitoring applications distributed across your local host & containers in cloud in a multi-cluster setup.

5. Root cause analysis made simple

  • Quick root cause analysis by analysing various causes captured by ZIF Monitor instead of going through layer by layer. This saves time to focus on problem-solving and arresting instead of spending effort on identifying the root cause.
  • Provides insights across your workload including the impact due to 3rd party layers as well.

6. Automation

  • Irrespective of whether the workload and instance is on-premise or on Azure or AWS or other provider, the ZIF automation module can automate the basics to complex activities

7. Ensure End User Experience

  • Helps to improve the end-user experience who gets served by the workload from cloud.
  • The ZIF tracing helps to trace each & every request of each & every user, thereby it is quite natural for ZIF to unearth the performance bottleneck across all layers, which in turn helps to address the problem and thereby improve the User Experience

Cloud and Container Platform Support

ZIF Seamlessly integrates with following Cloud & Container environments,

  • Microsoft Azure
  • AWS
  • Google Cloud
  • Grafana Cloud
  • Docker
  • Kubernetes

About the Author

Suresh Kumar Ramasamy-Picture

Suresh Kumar Ramasamy


Suresh heads the Monitor component of ZIF at GAVS. He has 20 years of experience in Native Applications, Web, Cloud, and Hybrid platforms from Engineering to Product Management. He has designed & hosted the monitoring solutions. He has been instrumental in conglomerating components to structure the Environment Performance Management suite of ZIF Monitor.

Suresh enjoys playing badminton with his children. He is passionate about gardening, especially medicinal plants.

Growing Importance of Business Service Reliability

Business services are a set of business activities delivered to an outside party, such as a customer or a partner. Successful delivery of business services often depends on one or more IT services. For example, an IT business service that would support “order to cash”, as an example could be “supply chain service”. The supply chain service could be delivered by an application such as SAP, with the customer of that service being an employee in finance/accounting using the application to perform customer-facing services such as accounts receivable, or the collection of cash from an outside party. A business service is not simply the application that the end-user sees – it is the entire chain that supports the delivery of the service, including physical and virtualized servers, databases, middleware, storage, and networks. A failure in any of these can affect the service – and so it is crucial that IT organizations have an integrated, accurate, and up-to-date view of these components and of how they work together to provide the service.

The technologies for Social Networking, Mobile Applications, Analytics, Cloud (SMAC), and Artificial Intelligence (AI) are redefining the business and the services that businesses provide. Their widespread usage is changing the business landscape, increasing reliability and availability to levels that were unimaginable even a few years ago.

Availability versus Reliability

At first glance, it might seem that if a service has a high availability then it should also have high reliability. However, this is not necessarily the case. Availability and Reliability have different meanings, serve different purposes, and require different strategies to maintain desired standards of service levels. Reliability is the measure of how long a business service performs its intended function, whereas availability is the measure of the percentage of time a business service is operable. For example, a business service may be available 90% of the time, but reliable only 75% of the time from a performance standpoint. Service reliability can be seen as:

  • Probability of success
  • Durability
  • Dependability
  • Quality over time
  • Availability to perform a function

Merely having a service available isn’t sufficient. When a business service is available, it should actually serve the intended purpose under varying and unexpected conditions. One way to measure this performance is to evaluate the reliability of the service that is available to consume. The performance of a business service is now rated not by its availability, but by how consistently reliable it is. Take the example of mobile services – 4 bars of signal strength on your smartphone does not guarantee that the quality of the call you received or going to make. Organizations need to measure how well the service fulfills the necessary business performance needs.

Recognizing the importance of reliability, Google initiated Site Reliability Engineering (SRE) practices with a mission to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity.

Zero Incident FrameworkTM (ZIF)

GAVS Technologies developed an AIOps based TechOps platform – Zero Incident FrameworkTM (ZIF) that enables proactive detection and remediation of incidents. The ZIF Platform is, available in two versions for our customers to evaluate and experience the power of AI-driven Business Service Reliability: 

ZIF Business Xpress: ZIF Business Xpress has been engineered for enterprises to evaluate AIOps before adoption. 10 to 40 devices can be connected to ZIFBusiness Xpress, to experiment with the value proposition. 

ZIF Business: Targeted for enterprise-wide adoption.

For more details, please visit https://zif.ai

About the Author:

Sri Chaganty


Sri is a Serial Entrepreneur with over 30 years’ experience delivering creative, client-centric, value-driven solutions for bootstrapped, and venture-backed startups.

Modern IT Infrastructure

Infrastructure today has grown beyond the physical confines of the traditional data center, has spread its wings to the cloud, and is increasingly distributed, virtual, and abstract. With the cloud gaining wide acceptance, most enterprises have their workloads spread across data centers, colocations, multi-cloud, and edge locations. On-premise infrastructure is also being replaced by Hyperconverged Infrastructure (HCI) where software-defined, virtualized compute, storage, and network are in one single system, greatly simplifying IT operations. Infrastructure is also becoming increasingly elastic, scales & shrinks on demand and doesn’t have to be provisioned upfront.

Let’s look at a few interesting technologies that are steering the modern IT landscape.

Containers and Serverless

Traditional application deployment on physical servers comes with the overhead of managing the infrastructure, middleware, development tools, and everything in between. Application developers would rather have this grunt work be handled by someone else, so they could focus on just their applications. This is where containers and serverless technologies come into picture. Both are cloud-based offerings and provide different levels of abstraction, in a way that hides layers beyond the front end, from the developer. They typically deploy smaller components of monolithic applications, microservices, and functions.

A Container is like an all-in-one-box, containing the app, and all its dependencies like libraries, executables & config files. The containerized application is highly portable, will run anywhere the container runtime is installed, and behave the same regardless of the OS or hardware it is deployed on. Containers give developers great flexibility and control since they cater to specific application requirements like the OS, S/W versions. The flip side is that there is still a need for manual maintenance of the runtime environment, like security patches, software updates, etc. Secondly, the flexibility it affords translates into high operational costs, since it lacks agility in scaling.

Serverless technologies provide much greater abstraction of the OS and infrastructure. ‘Serverless’ though, does not imply that there are no servers, it just means application developers do not have to worry about the underlying OS, the server environment, or the infra that their applications will be deployed on. Serverless is event-driven and is based on the premise that the application is split into functions that get executed based on events. The developer only needs to deploy function code and define the event(s) that will trigger them! The rest of the magic is done by the cloud service provider (with the help of third parties). 

The biggest advantage of serverless is that consumers are billed only for the running time of the function instances or the number of times the function gets executed, depending on the provider. Since it has zero administrative overhead, it guarantees rapid iterative deployment and faster time to market. Since the architecture is intrinsically auto-scaling, it is a perfect fit for applications with undefinable usage patterns. The other side of the coin is that developers need to deal with a black box back-end environment, so, holistic testing, debugging of the application becomes a challenge. Vendor lock-in is a real problem since the consumer is restricted by the technology stack supported by the vendor. Since serverless best practices dictate light, isolated functions with limited scope, building complex applications can get difficult. Function as a Service (FaaS) is a subset of serverless computing.

Internet of Things (IoT)

IoT is about connecting everyday things – beyond just computing devices or smartphones – to the internet. It is possible to convert practically anything into an IoT device, with a computer chip installation & internet access, and have it communicate independently with the internet – without any human intervention. But why would we want everyday things like for instance a watch or a light bulb, to become IoT devices? It’s in a bid to bridge the chasm between the physical and digital worlds and make the environment around us more intelligent, communicative, and responsive to our needs.

IoT’s use cases are just about everywhere; in personal devices, self-driving cars, smart homes, smart workspaces, smart cities, and industries across all verticals. For instance, live data from sensors in products while in use, gives good visibility into their operations on the ground, helps remediate issues proactively & aids improvements in design/manufacturing processes.

The Industrial Internet of Things (IIoT) is the use of IoT data in business, in tandem with Big Data, AI, Analytics, Cloud, and High-speed networks, with the primary goal of finding efficient business models to improve productivity & optimize expenditure. The need for real-time response to sensor data and advanced analytics to power insights has increased the demand for 5G networks for speed, cloud technologies for storage and computing, edge computing to reduce latency, and hyper-scale data centers for rapid scaling.

With IoT devices extending an organization’s infrastructure landscape, and the likelihood that IT staff may not even be aware of all the IoT devices in it is a security nightmare that could open corporate networks & sensitive data for attacks. Global standards and regulations for IoT device security are in the works. Until then, it is up to the enterprise security team to safeguard against IoT-related vulnerabilities.

Hyperscaling

The ability of infrastructure to rapidly scale out on a massive level is called hyperscaling.

Unprecedented needs for high-power computing and on-demand massive scalability has given rise to a new breed of hyperscale computing architectures, where traditional elements are replaced by hyper-converged, software-defined infrastructure with a high degree of virtualization. These hyperscale environments are characterized by high-density server racks, with software designed and specifically built for scale-out environments. Since high-density implies heavy power consumption, heating problems need to be handled by specialized cooling solutions like liquid cooling. Hyperscale data centre operators usually look for renewable energy options to save on power & cooling.

Today, there are several hundred hyperscale data centers in the world, with the dominant players being Microsoft, Google, Apple, Amazon & Facebook.

Edge Computing

Edge computing as the name indicates means moving data processing away from distant servers or the cloud, closer to the source of data.  This is to reduce latency and network bandwidth used for back & forth communication between the data source and the server. Edge, also called the network edge refers to where the data source connects to the internet. The explosive growth of IoT and applications like self-driving cars, virtual reality, smart cities for instance, that require real-time computing and analytics are paving the way for edge computing. Most cloud providers now provide geographically distributed edge servers. As with IoT devices, data at the edge can be a ticking security time bomb necessitating appropriate security mechanisms.

The evolution of IT technologies continuously raises the bar for the IT team. IT personnel have been forced to move beyond legacy practices and mindsets & constantly up-skill themselves to be able to ride the wave. For customers pampered by sophisticated technologies, round the clock availability of systems and immersive experiences have become baseline expectations. With more & more digitalization, there is increasing reliance on IT infrastructure and hence lesser tolerance for outages. The responsibilities of maintaining a high-performing IT infrastructure with near-zero downtime fall on the shoulders of the IT operations team.

This has underscored the importance of AI in IT operations since IT needs have now surpassed human capabilities. Gavs’ AI-powered Platform for IT operations, ZIF, caters to the entire ITOps spectrum, right from automated discovery of the landscape, monitoring, to predictive and prescriptive analytics that proactively drive the organization towards zero incidents. For more details, please visit https://zif.ai

About the Author:

Padmapriya Sridhar

Priya is part of the Marketing team at GAVS. She is passionate about Technology, Indian Classical Arts, Travel, and Yoga. She aspires to become a Yoga Instructor someday!

Proactive Monitoring

Is your IT environment proactively monitored?

It is important to have the right monitoring solution for an enterprise’s IT environment. More than that, it is imperative to leverage the right solution and deploy it for the appropriate requirements. In this context, the IT environment includes but is not limited to Applications, Servers, Services, End-User Devices, Network devices, APIs, Databases, etc. Towards that, let us understand the need and importance of Proactive Monitoring. This has a direct role in achieving the journey towards Zero Incident EnterpriseTM. Let us unravel the difference between reactive and proactive monitoring.

Reactive Monitoring – When a problem occurs in an IT environment, it gets notified through monitoring and the concerned team acts on it to resolve the issue.The problem could be as simple as slowness/poor performance, or as extreme as the unavailability of services like web site going down or server crashing leading to loss of business and revenue.  

Proactive Monitoring – There are two levels of proactive monitoring, 

  • Symptom-based proactive monitoring is all about identifying the signals and symptoms of an issue in advance and taking appropriate and immediate action to nip the root-cause in the bud.
  • Synthetic-based proactive monitoring is achieved through Synthetic Transactions. Performance bottlenecks or failures are identified much in advance; even before the actual user or the dependent layer encounters the situation

Symptom-based proactive monitoring is a USP of the ZIF Monitor module. For example, take the case of CPU related monitoring. It is common to monitor the CPU utilization and act based on that. But Monitor doesn’t just focus on CPU utilization, there are a lot of underlying factors which causes the CPU utilization to go high. To name a few,

  • Processor queue length 
  • Processor context switches
  • Processes that are contributing to high CPU utilization

It is important to arrest these brewing factors at the right time, i.e., in the case of Processor Queue length, continuous or sustained queue of greater than 2 threads is generally an indication of congestion at processor level.Of course, in a multiple processor environment, we need to divide the queue length by the number of processors that are servicing the workload. As a remedy, the following can be done

1) the number of threads can be limited at the application level

2) unwanted processes can be killed to help close the queued items

3) upgrading the processor will help in keeping the queue length under control, which eventually will control the CPU utilization.

Above is a sample demonstration of finding the symptom and signal and arrest them proactively. ZIF’s Monitor not only monitors these symptoms, but also suggests the remedy through the recommendation from SMEs.

Synthetic monitoring (SM) is done by simulating the transactions through the tool without depending on the end-user to do the transactions. The advantages of synthetic monitoring are, 

  • it uses automated transaction simulation technology
  • it helps to monitor the environment round-the-clock 
  • it helps to validate from across different geographic locations 
  • it provides options to choose the number of flows/transactions to be verified
  • it is proactive – identifies performance bottlenecks or failures much in advance even before the actual user or the dependent layer encounters the situation

How does Synthetic Monitoring(SM) work?

It works through 3 simple steps,

1) Record key transactions – Any number of transactions can be recorded, if required, all the functional flows can be recorded. An example of transaction in an e-commerce website could be, as simple as login and view the product catalogue, or,as elaborate as login, view product catalogue, move item to cart, check-out, make-payment and logout. For simulation purpose, dummy credit cards are used during payment gateway transactions.

2) Schedule the transactions – Whether it should run every 5 minutes or x hours or minutes.

3) Choose the location from which thesetransactions need to be triggered – The SM is available as on-premise or cloud options. Cloud SM provides the options to choose the SM engines available across globe (refer to the green dots in the figure below).

This is applicable mainly for web based applications, but can also be used for the underlying APIs as well.

SM solution has engines which run the recorded transactions against the target application. Once scheduled, the SM engine hosted either on-premise or remotely (refer to the green dots in the figure shown as sample representation), will run the recorded transactions at a predefined interval. The SM dashboard provides insights as detailed under the benefits section below.

Benefits of SM

As the SM does the synthetic transactions, it provides various insights like,

  • The latency in the transactions, i.e. the speed at which the transaction is happening. This also gives a trend analysis of how the application is performing over a period.
  • If there are any failures during the transaction, SM provides the details of the failure including the stack trace of the exception. This makes fixing the failure simpler, by avoiding the time spent in debugging.
  • In case of failure, SM provides insights into the parameter details that triggered the failure.
  • Unlike real user monitoring, there is the flexibility to test all flows or at least all critical flows without waiting for the user to trigger or experience it.
  • This not only unearths the problem at the application tier but also provides deeper insights while combining it with Application, Server, Database, Network Monitoring which are part of the ZIF Monitor suite.
  • Applications working fine under one geography may fail in a different geography due to various factors like network, connectivity, etc. SM will exactly pinpoint the availability and performance across geographies.

For more detailed information on GAVS’Monitor, or to request a demo please visit, https://zif.ai/products/monitor/

About the Author

Suresh Kumar Ramasamy


Suresh heads the Monitor component of ZIF at GAVS. He has 20 years of experience in Native Applications, Web, Cloud and Hybrid platforms from Engineering to Product Management. He has designed & hosted the monitoring solutions. He has been instrumental in conglomerating components to structure the Environment Performance Management suite of ZIF Monitor.

Suresh enjoys playing badminton with his children. He is passionate about gardening, especially medicinal plants.

READ ALSO OUR NEW UPDATES

Monitoring Microservices and Containers

Monitoring applications and infrastructure is a critical part of IT Operations. Among other things, monitoring provides alerts on failures, alerts on deteriorations that could potentially lead to failures, and performance data that can be analysed to gain insights. AI-led IT Ops Platforms like ZIF use such data from their monitoring component to deliver pattern recognition-based predictions and proactive remediation, leading to improved availability, system performance and hence better user experience.

The shift away from monolith applications towards microservices has posed a formidable challenge for monitoring tools. Let’s first take a quick look at what microservices are, to understand better the complications in monitoring them.

Monoliths vs Microservices

A single application(monolith) is split into a number of modular services called microservices, each of which typically caters to one capability of the application. These microservices are loosely coupled, can communicate with each other and can be deployed independently.

Quite likely the trigger for this architecture was the need for agility. Since microservices are stand-alone modules, they can follow their own build/deploy cycles enabling rapid scaling and deployments. They usually have a small codebase which aids easy maintainability and quick recovery from issues. The modularity of these microservices gives complete autonomy over the design, implementation and technology stack used to build them.

Microservices run inside containers that provide their execution environment. Although microservices could also be run in virtual machines(VMs), containers are preferred since they are comparatively lightweight as they share the host’s operating system, unlike VMs. Docker and CoreOS Rkt are a couple of commonly used container solutions while Kubernetes, Docker Swarm, and Apache Mesos are popular container orchestration platforms. The image below depicts microservices for hiring, performance appraisal, rewards & recognition, payroll, analytics and the like linked together to deliver the HR function.

Challenges in Monitoring Microservices and Containers

Since all good things come at a cost, you are probably wondering what it is here… well, the flip side to this evolutionary architecture is increased complexity! These are some contributing factors:

Exponential increase in the number of objects: With each application replaced by multiple microservices, 360-degree visibility and observability into all the services, their interdependencies, their containers/VMs, communication channels, workflows and the like can become very elusive. When one service goes down, the environment gets flooded with notifications not just from the service that is down, but from all services dependent on it as well. Sifting through this cascade of alerts, eliminating noise and zeroing in on the crux of the problem becomes a nightmare.

Shared Responsibility: Since processes are fragmented and the responsibility for their execution, like for instance a customer ordering a product online, is shared amongst the services, basic assumptions of traditional monitoring methods are challenged. The lack of a simple linear path, the need to collate data from different services for each process, inability to map a client request to a single transaction because of the number of services involved make performance tracking that much more difficult.

Design Differences: Due to the design/implementation autonomy that microservices enjoy, they could come with huge design differences, and implemented using different technology stacks. They might be using open source or third-party software that makes it difficult to instrument their code, which in turn affects their monitoring.

Elasticity and Transience: Elastic landscapes where infrastructure scales or collapses based on demand, instances appear & disappear dynamically, have changed the game for monitoring tools. They need to be updated to handle elastic environments, be container-aware and stay in-step with the provisioning layer. A couple of interesting aspects to handle are: recognizing the difference between an instance that is down versus an instance that is no longer available; data of instances that are no longer alive continue to have value for analysis of operational efficiency or past performance.

Mobility: This is another dimension of dynamic infra where objects don’t necessarily stay in the same place, they might be moved between data centers or clouds for better load balancing, maintenance needs or outages. The monitoring layer needs to arm itself with new strategies to handle moving targets.

Resource Abstraction: Microservices deployed in containers do not have a direct relationship with their host or the underlying operating system. This abstraction is what helps seamless migration between hosts but comes at the expense of complicating monitoring.

Communication over the network: The many moving parts of distributed applications rely completely on network communication. Consequently, the increase in network traffic puts a heavy strain on network resources necessitating intensive network monitoring and a focused effort to maintain network health.

What needs to be measured

This is a high-level laundry list of what needs to be done/measured while monitoring microservices and their containers.

Auto-discovery of containers and microservices:

As we’ve seen, monitoring microservices in a containerized world is a whole new ball game. In the highly distributed, dynamic infra environment where ephemeral containers scale, shrink and move between nodes on demand, traditional monitoring methods using agents to get information will not work. The monitoring system needs to automatically discover and track the creation/destruction of containers and explore services running in them.

Microservices:

  • Availability and performance of individual services
  • Host and infrastructure metrics
  • Microservice metrics
  • APIs and API transactions
    • Ensure API transactions are available and stable
    • Isolate problematic transactions and endpoints
  • Dependency mapping and correlation
  • Features relating to traditional APM

Containers:

  • Detailed information relating to each container
    • Health of clusters, master and slave nodes
  • Number of clusters
  • Nodes per cluster
  • Containers per cluster
    • Performance of core Docker engine
    • Performance of container instances

Things to consider while adapting to the new IT landscape

Granularity and Aggregation: With the increase in the number of objects in the system, it is important to first understand the performance target of what’s being measured – for instance, if a service targets 99% uptime(yearly), polling it every minute would be an overkill. Based on this, data granularity needs to be set prudently for each aspect measured, and can be aggregated where appropriate. This is to prevent data inundation that could overwhelm the monitoring module and drive up costs associated with data collection, storage, and management.    

Monitor Containers: The USP of containers is the abstraction they provide to microservices, encapsulating and shielding them from the details of the host or operating system. While this makes microservices portable, it makes them hard to reach for monitoring. Two recommended solutions for this are to instrument the microservice code to generate stats and/or traces for all actions (can be used for distributed tracing) and secondly to get all container activity information through host operating system instrumentation.    

Track Services through the Container Orchestration Platform: While we could obtain container-level data from the host kernel, it wouldn’t give us holistic information about the service since there could be several containers that constitute a service. Container-native monitoring solutions could use metadata from the container orchestration platform by drilling into appropriate layers of the platform to obtain service-level metrics. 

Adapt to dynamic IT landscapes: As mentioned earlier, today’s IT landscape is dynamically provisioned, elastic and characterized by mobile and transient objects. Monitoring systems themselves need to be elastic and deployable across multiple locations to cater to distributed systems and leverage native monitoring solutions for private clouds.

API Monitoring: Monitoring APIs can provide a wealth of information in the black box world of containers. Tracking API calls from the different entities – microservices, container solution, container orchestration platform, provisioning system, host kernel can help extract meaningful information and make sense of the fickle environment.

Watch this space for more on Monitoring and other IT Ops topics. You can find our blog on Monitoring for Success here, which gives an overview of the Monitorcomponent of GAVS’ AIOps Platform, Zero Incident FrameworkTM (ZIF). You can Request a Demo or Watch how ZIF works here.

About the Author:

Sivaprakash Krishnan


Bio – Siva is a long timer at Gavs and has been with the company for close to 15 years. He started his career as a developer and is now an architect with a strong technology background in Java, Big Data, DevOps, Cloud Computing, Containers and Micro Services. He has successfully designed & created a stable Monitoring Platform for ZIF, and designed & driven cloud assessment and migration, enterprise BRMS and IoT based solutions for many of our customers. He is currently focused on building ZIF 4.0, a new gen business-oriented TechOps platform.

Padmapriya Sridhar


Bio – Priya is part of the Marketing team at GAVS. She is passionate about Technology, Indian Classical Arts, Travel and Yoga. She aspires to become a Yoga Instructor some day!