Five years evolution of open-source distributed tracing

It has been 5 years since I have started working on open-source (OSS) distributed tracing projects. It is certainly a long time and in this article I would like to summarize what I have learned and show you the evolution in OSS distributed tracing space and peek into the future. We will talk about projects like Hawkular-APM, Zipkin, Jaeger, OpenTracing, OpenTelemetry, SigNoz and Hypertrace. Quite a lot right!? So let’s start!

First let’s have a look at the timeline when these projects were open-sourced:

Note that the timeline and article does not contain all OSS distributed tracing projects and all the content represents my subjective view on this space. There is missing OpenCensus, Expedia Haystack and many more. The first pull request to OpenCensus dates in 2016, but the project was based on existing Google’s instrumentation-java project.

2016: The beginning with Hawkular and Zipkin

For me it all started in June 2016 when I submitted my first pull request to the Hawkular APM project. Many of you probably don’t know the project so let me give you a quick introduction. Hawkular APM is a performance monitoring tool. It is a “lightweight” Java EE application deployed on Wildfly and it heavily uses JMS, Infinispan and Elasticsearch as a storage backend. The major differentiator between Zipkin and other similar tools is that the project focuses more on data aggregation. For instance it provides service metrics visualisations, flow diagrams, ability to compare two services (e.g. canary deployment) and much more.

The project also provided native and OpenTracing based instrumentation libraries. Building instrumentation clients is an enormous effort, there are a lot of languages, frameworks and application runtimes to support. Hence we have started looking at existing solutions, more about this in the Zipkin section.

For me the Hawkular project was quite a short party and the project was sunset in mid 2017 for several reasons. To make a long story short, we had decided to join forces with Jaeger Tracing project which seemed like a better fit for the Kubernetes ecosystem that we were pursuing.

2016: Zipkin

In 2016 Zipkin was the only robust OSS tracing tool with a solid and proven ecosystem. The project was open-sourced in 2012 by Twitter. The original version was written in Scala and later in 2016 rewritten to Java (Spring Boot). The design of the system was inspired by Google’s Dapper paper released in 2010.

The ecosystem was robust, it had a lot of instrumentation libraries, just to name a few: brave, zipkin-js, zipkin-ruby, zipkin-go and many more. The beauty of the Zipkin project was its technical excellence in Java, simple and stable data model, its own B3 context propagation (it stands for Big Brother Bird, the original Zipkin name at Twitter) and great API level documentation. All these aspects made it clear that Zipkin was set to create a rich ecosystem. This was all great and in the Hawkular project we had decided to expose Zipkin ingestion endpoint that would allow existing Zipkin users to try out Hawkular and vice-versa Hawkular users leverage Zipkin instrumentation libraries.

I did a proof of concept — a polyglot application written in Java, Ruby, Javascript and Python to test out instrumentation interoperability. I was very surprised to find out that some instrumentations did not propagate or decode the context properly. This was a huge problem resulting in broken traces. The other issue was that the instrumentation libraries were not using tags coherently. For instance, some libraries were recording full URLs and others only HTTP path with query parameters. The data quality and consistency is very important for an observability platform to work properly.

Ultimately these issues were fixed over time, they only stress how important it is to standardize interoperability APIs like context-propagation and data specification.

2016: OpenTracing

OpenTracing was created at the end of 2015 and it started gaining significant traction at the end of 2016 as it joined Cloud Native Computing Foundation. The promise was great: a vendor neutral instrumentation API and specification that defines data collection. Clearly a huge win on paper because instrumentation is the hardest part when deploying a distributed tracing system. The project also included a contrib organization with instrumentation libraries for many frameworks.

One of the main goals of the project was to incorporate OpenTracing API directly into RPC frameworks. This again sounds great but it didn’t pan out how we wanted. The RPC framework authors are very sensitive to using 3rd party dependencies. Some frameworks (e.g. Vertx) decided to create their own tracing API and only provide an OpenTracing bridge. The other significant issue was that application authors (e.g. Envoy) didn’t know which OpenTracing implementation should be included in the final compiled distribution. Including multiple implementations created unnecessary maintenance overhead and it complicated the final distribution.

Despite many issues, OpenTracing achieved great things and moved OSS distributed tracing forward into a better place. It paved the way for OpenTelemetry to be even more successful. Personally, I would like to summarize the outcome of the project.

Pros:

  • specification for data collection
  • it helped to promote distributed tracing

Cons:

  • API only, no standard implementation. It was a huge confusion for end-users and duplicated effort for vendors to create implementations.
  • when switching the implementation need to recompile and redeploy applications
  • no reporting format definition
  • no context-propagation format

2017: Open-sourced Jaeger

In early 2017 Uber engineering open-sourced Jaeger. The first version was very similar to Zipkin and with a couple of key differences:

  • written in Golang
  • native support for full OpenTracing API. For instance, Zipkin for a long time didn’t support span logs
  • first-class support for OpenTracing implementations in multiple languages. All the clients were hosted directly in the Jaeger organization and properly tested.
  • support for endpoint-based sampling and remotely controlled sampling

Over the years the Jaeger project evolved and many features were added. We have always made sure the tool is super easy to use, no matter if directly as a compiled binary or deployed on Kubernetes. When building an OSS ecosystem it’s important to not to say no and rather find a generic solution that works for everybody.

When looking to the future, Jaeger will embrace OpenTelemetry and the V2 version will be based on the OpenTelemetry collector. I would like to also highlight that the Jaeger project expands with new “APM” capabilities like metrics support.

2019: OpenTelemetry

The OpenTelemetry project was announced in 2019 as a merger of OpenTracing and OpenCensus projects. The project promised to combine the best from both of these projects:

  • API and official standard implementation (SDK)
  • data collection specification
  • mandatory support for W3C-Trace Context
  • coming support for metrics and logs
  • standardized data model (OTLP)
  • a local collector that accepts the most popular tracing formats (OTLP, Zipkin, Jaeger) and exports data to vendor systems
  • auto instrumentation technology for various languages

The features mentioned above make it clear that OpenTelemetry promises to deliver robust end-to-end data collection solution. The major differentiator over OpenTracing is the standardized data model, official SDKs and collector. All these combined means that users do not have to touch their workloads when changing observability vendors. This is a huge win as it avoids vendor lock-in!

I am pretty confident that OpenTelemetry will be a successful project and it will take OSS distributed tracing to the next level. Note that the project is the second most active CNCF project after Kubernetes. A lot of active members and vendors has its toll as well. The project velocity is impacted and getting things done takes more time.

2020/2021: Hypertrace, Grafana Tempo and SigNoz

The year 2020 and 2021 was a really interesting one. In 2020 the API observability project Hypertrace was open-sourced. The major differentiator to Jaeger is that it makes APIs and databases/backends first-class entities in the system and provides a powerful slice and dice explorer. For instance, a user can only by a few clicks see the performance of database calls or 3rd party service. The project also includes OpenTelemetry compatible agents that extend the upstream project by capturing 100% of traffic (e.g. payloads and headers). The clients or agents have a pluggable filter mechanism that allows users to run arbitrary code to analyze requests, which is pretty cool! The clients are licensed under Apache 2.0 and the server has a hybrid license model with Apache 2.0 and proprietary Traceable Community License 1.0.

In 2021 a similar tool called SigNoz was born. Similarly to Hypertrace, its goal is to be a full observability system with metrics, logs and traces. The platform also provides rich aggregates on filtered traces which is possible due to the OLAP database. The project uses OSS Apache 2 license.

Another important project that was open-sourced in 2020 is Grafana Tempo. Although it is not a full observability system it brings nice innovation to the ecosystem. It’s a local trace data store highly inspired by Prometheus. Its biggest downside is the query capabilities. It only allows returning trace by ID.

Conclusion and peek into the {bright} future

As we have seen open-source tracing projects made a big step forward since 2016. At the beginning Zipkin was the only major OSS tracing system that developers could use. It definitely paved the way and had a great design like a stable data model and context propagation.

Now, developers have a choice to choose from multiple OSS tracing systems, use robust libraries to collect data from their applications by using OpenTelemetry and use standardized context propagation W3C Trace-Context.

So what is next? I wish I had the answer, but I can certainly express what I think. The OpenTelemetry project will be around and it will become more robust and provide well-tested instrumentation libraries to collect various signals including metrics, logs and traces. We will also see more instrumentation for capturing profiling data. The tracing systems will be adding more APM-like capabilities — e.g. better metrics support and profiling. The cloud vendors will integrate with OpenTelemetry and will embed instrumentation capabilities directly in their platforms. Last but not least we will see runtime security projects being integrated into observability.

References

Software engineer working in observability space. Working on Hypertrace, OpenTelemetry, Jaeger, OpenTracing, MicroProfile projects.