We have an unusual case. We tried to use OTel but what we wanted to trace was no...

We have an unusual case. We tried to use OTel but what we wanted to trace was not "normal." In the end, we removed that kind of tracing.

This was for a large email platform. We receive billions of API calls to send mail. Any particular call may send out to a thousand recipients. Recipient inboxes may not respond for 3 days, and we let you schedule your send 3 days out. We wanted all traces so any customer could contact us and we could debug their issue. OpenTracing is not good for this problem as it likes smaller time windows and does better with one request to one response and often requires sampling.

We were able to make some choices and were able to instrument a lot of the system. From the instrumentation, we cut out retries, delayed responses, multiplexed requests, and the like. But running this was wildly expensive still, so we started sampling more and more aggressively. It was an upward fight the entire time to implement and get data from the traces. Multiple teams had to work together to connect traces from start to finish across different services and stacks. It didn't help that management would only give us token amounts of time to keep implementing it as other projects would be prioritized.

We ended up keeping our older system. Each message has metadata, including a list of times and/or durations of internal actions. This metadata is recorded to our metric systems after we are done handling the request. We don't get cool flamegraphs, interesting relationships exposed, and a whole lot of cool things that OTel can do. But we know what parts of our system cause a bottleneck and when and for who, though sometimes we have to dig to see it.