Enterprise Observability
By Eric Tice
Observability, or the capacity to understand systems based on their outputs, is a crucial need across enterprises to make problem resolution and process improvement more efficient.
Organizations often struggle to effectively scale and maintain the implementation and delivery of their enterprise applications and infrastructure due to a lack of information. This is certainly not a revelation to most, but COVID has opened the eyes of many leaders to the need for more proactive information gathering. The use of reactive project monitoring tools and inadequate metrics has given way to a more proactive approach to monitoring that can help not only provide insights into technical deficiencies, but also process and resource issues. This shift has been toward providing deeper visibility into distributed applications across the enterprise, allowing for faster, automated identification of issues and their resolution. This is known as observability.
This article provides a perspective on observability and an approach to implementation using leading-edge open-source solution blueprints. This approach can dramatically improve resource efficiency and accuracy of system scaling for enterprise modernization, cloud, or digital transformation projects.
Observability has become a common topic in the industry of late. The key consideration for observability is the need to produce a more holistic view of the working of end-to-end application performance and the stability of systems. Observability is very different from simply monitoring applications or infrastructure; it provides true visibility into the enterprise processes and tooling not just error resolution through log monitoring. The key to building a holistic observability solution includes:
- Retrieving all data from across the enterprise and storing it into a single indexed representable data set.
- Providing a flexible, extensible, and user-friendly dashboard to access and manipulate relevant metrics.
- Enabling business insight as well as traditional operations insights to be able to make key business decisions related to future scaling and resources allocation.
- Automating all data collection and integration with key 3rd party components for detailed analysis and faster issue resolution.
- Continuously innovating to improve the ingestion and data aggregation by incorporating capabilities such as AIOps, Event Correlation, Anomaly detection, and other AI/ML advancements.
Multiple tools seek to fill part or all of these gaps including log monitoring tools, like Splunk, to APM tools, like Dynatrace, New Relic, or App Dynamics. These tools can often provide a fair amount of data by themselves but require integration and longer lead times to get key enhancements to the platform. Elastic Search (https://www.elastic.co/observability) has taken observability a step further by building out an offering initially based on their BELK (Beats — Elastic Search — Logstash — Kibana) tooling, incorporating User Experience Data, Log Data, Metrics, APM, Uptime and other data into a unified solution.
This solution is offered primarily as part of the Elastic commercial enterprise offering and is fairly new, but can provide a solution that reduces the integration of multiple proprietary tools while achieving a similar result.
Leveraging Open-Source
Wipro has built a production-ready observability solution that has been gaining a lot of continued traction for its flexibility, high quality, innovation, cost savings, and standards-based integration using open-source tooling. Leveraging Elastic’s community release of Elastic Search as a foundation, the solution provides a blueprint for building an AIOps-based contextual solution.
In the last year, Wipro has designed and implemented custom observability solutions for clients in the travel, banking, and telecom verticals. Each of these industries has experienced challenges with prolonged issue resolution and instability of their applications or infrastructure, due to either a lack of monitoring or a scope too limited to identify the issues as they occur.
Wipro’s Observability Solution (Unified Monitoring) has helped these organizations:
- Achieve faster issue resolution and ticket automation,
- Increase visibility into all tiers of their application ecosystems, and
- Optimize their customer experience by incorporating advanced features, like anomaly detection.
The Wipro Observability Solution is a holistic solution that incorporates complete end-to-end data collection with fully customizable components to build a contextualized solution.
The observability solution is broken into 4 layers:
Monitored Assets
The solution can pull data from any asset from networks, IoT devices, containers, database components, and many others. Using multiple agents and APIs (such as Elastic Beats, Zabbix Agents, Prometheus Exporters, etc.) the data will be collected and pushed to the data acquisition layer.
Data Acquisition Layer
The Data Acquisition layer provides a way to process or parse data to manage potential points of failure as well as store data in transit to assure there is no single point of failure. Building on proven event-driven queueing solutions with Kafka the solution handles continuous high throughput without loss of data and segregates data to be managed more efficiently by the processing layer.
Data Processing Layer
The Data Processing layer will ingest, parse and correlate the data using custom filters built using microservices, Logstash, FluentD, and analytics streaming with tools like Apache Spark. Incorporating continued AIOps capabilities in the observability solution, such as event correlation, self-healing, and anomaly detection, the solution’s blueprint provides advanced capabilities that allow for more predictive, real-time data aggregation to improve insights and stability. Data is stored in Elastic Search for optimal data retention and retrieval.
Visualization Layer
The visualization layer provides customizable dashboards, using Kibana or Grafana, which will allow business and operations users to build and view comprehensive data to view the end-to-end operations across the enterprise.
The image below shows an example of a Kibana Observability dashboard displaying data from across systems, application logs, APM, and aggregating business metrics.
Case
An example customization from a project for one of our travel industry clients included:
- BELK (Beats/ElasticSearch/Logstash/Kibana),
- Elastalert for alerting, and
- Elastic’s APM capabilities to gather data for the Tomcat application server logs, system, ActiveMQ, and several Java Applications.
As you can see from the image below, it is a subset of capabilities offered in the complete solution contextualized to the client’s strategic goals.
The team also defined detailed metrics and designed custom dashboards to visualize the data in Kibana to provide a more holistic view into the production application ecosystem stability.
Beyond Application to Process and Resource Observability
Organizations are typically more interested in focusing on the application, user experience, and system health data that allows them to understand issues as they occur post-production implementation. This strategy often neglects one of the key considerations in building a stable enterprise: the resources engaged on projects and their processes during the SDLC. While the observability tooling mentioned can be enhanced or leveraged to solve some of the challenges, such as slow time to market or recurring issues during CICD, it is important to look at what data is being tracked and how it is different from production application data. Using open-source tooling such as Hygieia, organizations can provide a holistic view of the complete development and delivery process.
Hygieia provides a fully customizable, next-generation approach to end-to-end DevSecOps observability visualization. Hygieia is open-source and is maintained by a growing community that allows companies to monitor performance and efficiency through the entire development and delivery lifecycle, using a single pane of glass.
Hygieia provides the ability to collect data from any DevSecOps tooling, exposing the data via APIs and configurable dashboards, allowing you to visualize key metrics for your organization. These targeted dashboards provide insights into the resource and system performance and stability to both developers and executives.
The developer dashboard is focused on providing operations staff and developers with real-time metrics. It provides key metrics on the status of each step in the CICD process that can improve time to market and developer efficiency. Technical users can more proactively manage system uptime and stability which improves future scalability planning using one comprehensive view.
The executive dashboard provides broader insights into the SDLC process, focusing on resource activities and efficiency that are relevant to senior leadership. The dashboard provides enterprise-level metrics that will help provide future strategic scaling of resources and infrastructure needed for the growth and stability of DevSecOps processes.
Many large organizations such as Verizon, Walmart, American Airlines, Wells Fargo, Ford, and others have implemented Hygieia to provide greater insights into their DevSecOps processes across the SDLC.
The increase in the desire for observability across the industry opens up opportunities to innovate new solutions to meet the enterprise's need for visibility into an organization’s data. Providing these key insights across the enterprise has become more important for application portfolios and DevSecOps alike. Wipro and many of our clients have already taken an active role in adopting and contributing to open-source solutions in observability.