How to control costs of your Observability stack
Observability is mandatory but really expensive. Commercial tools provide convenience at a price. Open source tools are now mature enough to challenge those commercial offerings and provide considerable savings.
Observability is the triumvirate of metrics, logs and traces produced by infrastructure, components and services of an application. Even a moderate application environment can produce a lot of data: one million or more metrics, hundreds of gigabytes of logs and traces. Processing all that data in near real time requires abundant compute resources and utilises a chunk of storage for retention. For these reasons many organisations choose to use a commercial observability service. However, with great convenience comes great cost.
Do It Yourself
Open source observability tools have matured considerably over the recent years. Prometheus graduated from the CNCF in 2018 and Jaeger tracing in 2019. Running open source observability tools on Kubernetes makes installation and continuous operation easier.
Prometheus has become the de facto standard for time series metrics for cloud native environments. Many components in the CNCF Landscape already provide Prometheus metrics and there is an extensive set of exporters for those that do not. A single instance of Prometheus can easily manage a few million metrics with 4CPU 32GB of compute and 500GB of storage.
The ELK (Elasticsearch / Opensearch, Logstash, Kibana) stack is the traditional solution. However, Grafana Loki is the new kid on the block offering the advantages of simplicity and cloud storage. Loki has a lighter compute requirement because it only indexes the log metadata / labels therefore just requiring similar resources to a Prometheus instance. The index and log data chunks are then stored on AWS S3 / Minio, Google Cloud Storage, Azure Blob Storage resulting in a very cost efficient solution.
Tracing is the problem child being the most resource intensive, it’s equivalent to structured logging at DEBUG level. It produces a high volume stream of data which requires significant processing and storage.
The first decision is, do you really need tracing? Entity relationship information can be extracted from metric labels provided by service meshes (Istio, Linkerd) or an eBPF probe (Asserts). Logging can be improved to record slow queries and third party API calls.
If you decide to implement tracing the question of sampling will soon pop up. The overhead of tracing every request is too great for any application with more than trivial traffic volume. Open Telemetry instrumentation libraries provide simple ratio based sampling for example, one in one hundred requests are traced.
Jaeger is the most mature open source tracing engine. It receives spans from Open Telemetry, indexes and stores them in Cassandra or Elasticsearch / Opensearch. For high volume implementations Kafka can be used for the ingestion pipeline.
Grafana Tempo does for traces what Loki does for logs. The Open Telemetry trace labels are indexed and stored with the trace data on cloud storage. Providing a low complexity and cost effective tracing solution.
Dashboards & Alerting
All that observability data requires a means to visualise it and notify on abnormal or out of bounds values. Grafana is the de facto standard dashboarding tool providing close integration with Prometheus. Prometheus uses alert rules written in PromQL to send notifications via Alertmanager. Using PromQL for anomaly detection is not easily achieved, choosing appropriate thresholds for alert rule triggers is a delicate balancing act between too much alert noise or failure to notify before it’s too late.
Asserts - The Extra Layer
The various open source projects mentioned above provide excellent point solutions for the triumvirate of metrics, logs and traces. However they still leave numerous challenges to overcome in order to provide a complete solution.
Most organisations have more than one Kubernetes cluster plus legacy VMs resulting in multiple Prometheus instances to provide complete coverage. Aggregation of metric data across those multiple instances is tricky to achieve on dashboards. Prometheus is designed as a short term metric store, a separate solution is required for providing metric roll up and storage for long term reporting.
Asserts provides a layer of automation and intelligence on top of open source observability tooling. Asserts Metric Intelligence™ queries your existing Prometheus instances, normalises the metric data, calculates baselines and persists it in its own Prometheus compatible metric storage. The normalised data is 10 - 20 times smaller in storage requirements than the raw data meaning that keeping it for 18 months to provide long term reporting is not a problem.
Kibana and Grafana provide the ability to search for logs via their indexed tags / labels. Automatic query generation for direct navigation from component metrics to log entries would eliminate the need for engineers to learn yet another query language.
Asserts provides deep link integration to popular log aggregation tools, going straight from Asserts root cause insights to the relevant log lines without the need to manually enter a log search query.
Sampling is required to reduce the volume of trace data to manageable levels. However, a simple ratio based algorithm is a blunt instrument. It would be better to capture slow and erroneous requests as they are more valuable during root cause analysis.
Asserts Trace Intelligence™ uses the baseline data from Asserts Metric Intelligence™ to only capture slower than normal traces, those with the error flag set will also be captured, it also provides entity relationship data. Finally every trace is inspected to extract latency, call and error count metrics for each endpoint (RED). With Asserts Trace Intelligence™ the volume of trace data is minimised, reducing the amount of compute and storage resources required and therefore the cost.
Dashboards & Alerting
Creating and maintaining a library of dashboards and alert rules is a nontrivial task requiring expertise on every component in the application stack and excellent knowledge of PromQL. Using a third party library of dashboards and alert rules covering all popular components would significantly reduce the level of toil.
Asserts provides a curated library of dashboards and alert rules that dynamically act on the normalised metric data, saving you the considerable effort of learning PromQL and creating and maintaining your own set of dashboards and alert rules.
See for yourself how Asserts enables you to do more with less data, sign up for the free plan. Install the Helm chart on your own Kubernetes cluster and start saving on your observability costs today.