Booking.com arguably runs the biggest Graphite metrics system setup in the world. We write around 50 million data points per second via relays and store about 500 million metrics in total. The data comes from our event system, apps, and other sources. We used the carbon-c-relay for routing our metrics. It was complex. This made maintenance hard, resiliency and performance suffered. So, we designed Nanotube (github.com/bookingcom/nanotube), which we built in Go, focused on simplicity, performance, and concurrency. Even though the language is of a higher level, we could achieve x3 better performance in our production setup. We made it more resilient, easier to use, and support.
Furthermore, we added support for metrics submission from Kubernetes in a well-governed manner. We added support for the OpenTelemetry gRPC protocol too. It enables the use of Nanotube as a bridge in migrations and hybrid systems.
This talk will cover how we solve the problem of putting down tens of millions of metric data points per second. We will cover the design, how we achieve resiliency, and manage high loads on hundreds of boxes.