OpenTelemetry Tracing in < 200 lines of code

241 points by masterj 2 days ago

vrosas a day ago

While the libraries and the documentation for otel are bloated messes, I maintain that any platform that isn’t using some sort of tracing system is practically negligent in their engineering duty. If you’re still out there querying logs with some giant sql statements you’re missing out. The pure wonder of being able to click on an http request and seeing every service it touched, every application log outputted, and every database query it ran, and all the timings of each of those is magical.

KronisLV a day ago

> I maintain that any platform that isn’t using some sort of tracing system is practically negligent in their engineering duty.
For some, it's difficult because many of the self-hostable out there are rather complex and have high requirements, like https://github.com/getsentry/self-hosted/blob/master/docker-...
Personally I found Apache Skywalking to be something that you can setup without too many issues https://skywalking.apache.org/ but it's not exactly ideal either.
I wonder what other good options are out there, something that you can have up and running on a 5$ VPS within an hour or two, to not cause friction.
Where's the OpenTelemetry equivalent of launching an (opinionated) Docker Compose stack that has everything you need on the server side, running against SQLite, MariaDB, PostgreSQL, ClickHouse, ElasticSearch or another data store?
Of course, when SaaS is an option, many will just go for that.

Ciantic 2 days ago

The utility of tracing is great, I've been using Azure Application Insights with NodeJS (and of course in .NET). This is relatively simple because it monkey patches itself everywhere if you go through the "classic" SDK route. Then adding your own data to logs is just a few simple functions trackTrace, trackException, trackEvent, etc.

However, if you want to figure out how it works you might be scared, it is not lightweight. I just spent a few days digging through the Azure Application Insights NodeJS code base which integrates with OpenTelemetry packages. It's an utter mess, a huge amount of abstractions. Adding it to the project brought 100 MB and around 40 extra packages.

the_duke 2 days ago

This isn't just a problem in JS.
In every language I looked the otel libraries were a bloated, over-abstracted and resource-hungry mess.
I think that's partially because it is actually difficult and complex to implement, and partially because the implementations are often written by devs without a long history of implementing similar systems.
- mcronce a day ago
  
  It's been a bit since I've added it to an existing project, but at least as of a year or so ago, the Rust implementation (tracing + tracing-opentelemetry + opentelemetry-jaeger specifically for that project) was similar.
  The impact on compile time and code size wasn't bad (for a project that was large and already pulling in a lot of crates), but it had a huge runtime cost - mostly allocator pressure in the form of temp allocations from what I could see. For a mostly I/O bound workload, it more than doubled CPU utilization and ballooned OS-measured memory consumption by >30%
- deathanatos a day ago
  
  The OpenTelemetry spec is a mess. There's so much … abstract blah blah blah? … and very little actual details.
  If I actually go to the part of the spec that I think gets down to "here is how to concretely write OpenTelemetry stuff [1], that seems to have the various attributes camelCased, for example, whereas the article has named them "spanID" and "traceID".
  AFAICT the "spec" also just links you to the implementation. "Just" read this protobuf definition, translate that to JSON in your mind's eye. I "POST" this to a hard-coded path tacked onto a URL… but do I post individual traces/logs? Can I batch them? I'm sure there's a gRPC thing I could start guessing from…
  But it seems like the JSON stuff is a second class citizen from the gRPC interface. Unless that's just as bad, too…
  Actually getting set up in Python isn't too terrible, though there are a few classes that you're like "what's the point of this?" and most of them are apparently just undoc'd. (E.g., [2], ^F TraceProvider, get nothing.)
  It is a bit depressing how this seems to be becoming The Chosen Spec.
  I also sort of have 64-bit integers for span IDs (TFA never mentions it, but AFAICT this is required by spec). I'd much rather have "/span/ids/are/a/tree" span IDs, as this integrates much better with any logging system: I can easily ask my log viewer to filter to a specific span (span_id == "/spans/a/b/c") or to a subtree (span_id regex-matches /^\/spans\/a\/.*/)
  (And the spec bizarrely focuses on some sort of language-abstract API, instead of … actual data types / media types?)
  [1]: https://opentelemetry.io/docs/specs/otlp/#otlphttp
  [2]: https://opentelemetry-python.readthedocs.io/en/latest/api/tr...
- snuxoll 2 days ago
  
  The .net implementation is about as clean as it can get, but a lot of that has to do with Microsoft caring very deeply about this kind of performance data for a very long time (thus having the entire System.Diagnostics namespace).
  There’s certainly some abstraction that is gratuitous still, but it’s better than most of the architect astronaut code I’ve seen targeting the CLR.
chucklenorris 2 days ago

Yes, this is exactly my impression too.. the code for opentelemetry-js is over engineered and adds a scary amount of dependency code. There are quite a bunch of libraries which I'm not sure what they do and in which scenarios I might need them. The documentation is not very helpful either. I look forward to someone implementing a opentelemetry-nano package with only the minimum stuff needed and allow me to choose extra support for my dependencies or an easy way of adding my own wrappers.
- pimeys 2 days ago
  
  Also badly documented. If you try to implement something non-standard with it, good luck. I once needed to write code where trace started in node an continued inside a node api native library. Getting these two traces to connect must be one of the most frustrating things I've worked on.
  At least on the Rust side you have types to help you out, but it is still quite complex and the crates have bugs open for years, impossible to solve with the current architecture.
lastartes 2 days ago

I had a lot of fun wading through that mess in the past trying to determine why something wasn't working. A fun fact that I just learned is that the node sdk is now just a shim over https://www.npmjs.com/package/@azure/monitor-opentelemetry. It seems like the future is just using that package directly which hopefully improves the situation. One benefit is you can extend it with OTel instrumentation packages.
wordofx 2 days ago

What’s your plans for applications insights sunsetting?
- MuffinFlavored 2 days ago
  
  https://azure.microsoft.com/en-us/updates/we-re-retiring-cla...
  Do you have a link for what you are speaking of?
  - wordofx a day ago
    
    There’s no public announcement yet but from what reps say to customers and what people working on azure say is app insights is more or less being wound down in favor of building out open source solutions because it’s more favourable and less maintaince / dev than building out their own solution. Think more OTEL/Grafana. Basically word on the inside is MS doesn’t want to pay to build out app insights.

thewisenerd 2 days ago

the "true spec is the data" is very powerful.

for example, we translate our loosely OTEL-based telemetry into a format which was consumable by any otel collector.

shim a few fields, et voila! can be read by Jaeger-UI. free trace tree visualization.

alisonatwork 2 days ago

I agree. The hardest work on OpenTelemetry (and OpenCensus/OpenTracing before it) was looking at all the different vendors and trying to come up with a common set of semantic conventions[0].
If your team is new to metrics or tracing - or even just structured logging - it's worth to start adding fields in the general structure of the otel semantic conventions, because then whichever third party service you eventually decide to push to, it won't take much of a shim to adapt your data to get there. And if you just stick with JSON logs pushed to ELK (or whatever) you at least build up a useful set of fields to query on.
[0] https://opentelemetry.io/docs/concepts/semantic-conventions/

sandelz a day ago

While otel is really nice and easy to integrate (at least on .net and node) into software, the collector/UI side seems to be overly complex.

I have used application insights on Azure on my day job but I was wondering is there a simple self hosted collector/UI to use?

krashidov 2 days ago

It's so easy in node. I miss node.

Setting this up in the mess that is Python/Gunicorn/asgi/wsgi/celery/Django has not been as easy

viraptor 2 days ago

You can do almost 1:1 the same thing (compared to the post) in Python and use it as a wsgi/whatever middleware. It's really not any different. The callback changes to a resource manager, but that's about it.
etimberg a day ago

Agreed. There are a ton of edge cases and auto instrumentation basically doesn't work. I love Honeycomb but the setup for DataDog is 100x easier
tempest_ 2 days ago

Sentry has been nice.
I do not know what unholy monkey patching they do with that sentry_sdk.init call and I try not to think about it but for web apps it is fire and forget.
- optiomal_isgood 2 days ago
  
  > I do not know what unholy monkey patching they do
  A year ago I had to patch its package for an internal use case. Their codebase is fairly well-written I thought (at least the JS SDK).
  e.g. this is where they add the `fetch` breadcrumb https://github.com/getsentry/sentry-javascript/blob/develop/...
  e.g. where the actual monkey patching happens https://github.com/getsentry/sentry-javascript/blob/e1783a65...
  - tempest_ a day ago
    
    My experience in principally in python and the number of integrations that are auto used on common dependent libs and frameworks is long https://docs.sentry.io/platforms/python/integrations/
- krashidov 2 days ago
  
  hmm do you use Sentry for logging? We use sentry as well but only for errors. Also the trace ids don't match with the logging traces so I have to fix that too
rtpg 2 days ago

Honeycombs old tooling for Python was miles better than the Otel nonsense. Embarrassing to see so many companies drop their well designed libs for Otel instead of something like wrapping Otel so that their better libs can still be used at the top level
jgalt212 2 days ago

Is that because js is inherently async, and Python is synchronous?
- phillipcarter a day ago
  
  The python problems with opentelemetry are mostly due to python being a bit of a mess. Yes, the async model is weird and that in turn makes it harder to instrument compared to something like .NET. But then the combination of libraries and environments change out from under you, you update your autoinstrumentation agent, and now the app crashes. And you find some GitHub issue saying that flask brings in this new dependency that somehow breaks something else and so now you can't instrument flask apps.
  ^^^ most of the above is just what I've dealt with as a maintainer in OTel, and it's maddening. FWIW that's all been finally dealt with, but python just feels like something ready to keel over at a moment's notice to me. Far more fragile a language and ecosystem than anything else.
  - krashidov a day ago
    
    Agreed. This could be a personal skill issue, but I've started a new job where Django is the main codebase and getting decent OTel has been a struggle.
    In fact, everything with Python has been a struggle

h1fra 2 days ago

I have one biff against otel, it's not possible to stream a trace. You have to build a big object in memory that is not suitable for a long-running process. And because of that it's not possible to start it somewhere and finish it somewhere else

thephyber 2 days ago

I'm under the impression this is wrong. A trace is made from one or more spans. So long as the context is propagated[1] from one service to another, any number of spans referring to the same trace can be generated anywhere at any time. The trace+spans don't have to be created in the order of a stack.
[1] https://opentelemetry.io/docs/concepts/context-propagation/
- phillipcarter 2 days ago
  
  This is correct. Traces are made up of spans that can be created within the same process, different processes, different machines, etc. and all emitted asynchronously.
eterm 2 days ago

I don't think that's a limitation of the specification. You can create spans and emit events for those spans without holding objects in memory.
malkia 2 days ago

There is no concept of streaming trace. You are propagating a random ID, and expect other NODES (yours, or outside of your control) to re-emit these, and if need to create new ones.
The agreement is that these nodes would either emit all these events to (eventually) a common place (push), or something is going to gather them (pull).
I think at Google, some of these used to be still on the machines, and the tools would pull them directly, and back then it was possible to mark certain for preservance (that was long time ago - 2014, so I'm sure things have changed).
Also I was on the over-excited edge, because I was not aware what was this, but I was on call (a small team in ads), and had to page up to the Bigtable/Megastore or was it Spanner team, and they simply asked me to bump some tracing bits up for like 30 seconds, then something magically showed up - and I was - wtf!
I think it was then it clicked with me how useful this is, .... but also how much wasteful (in terms of resources) it could be if you don't end up looking there.

caseyw 2 days ago

OpenTelemetry (OTel) and the OpenTelemetry Protocol (OTLP) are immensely powerful tools. The ability to emit telemetry data from any source, coupled with a receiver that can sample, filter, pipe, and potentially reshape the data to suit any need, is a game changer. This flexibility revolutionizes how we approach observability and monitoring across diverse systems.