OpenTelemetry is a marketing-driven project, designed by committee, implemented ...

brainbag · on Aug 28, 2023

It looks like every other comment in this thread is favorable to very positive, can you go into more detail about what specifically isn't good about it?

arp242 · on Aug 28, 2023

Not the previous poster, but I had to implement it a few years ago, and I found it unbelievable complex with dense and difficult to read specifications. I've implemented plenty of protocols and formats from scratch using just the specification, but rarely have I had such difficulty than with OpenTelemetry.

I guess this is something you don't notice as merely a "user", but IMHO it's horribly overengineered for what it does and I'm absolutely not a fan.

I also disliked the Go tooling for it, which is "badly written Java in Go syntax", or something along these lines.

This was 2 years ago. Maybe it's better now, but I doubt it.

In our case it was 100% a "tick off some boxes on their strategy roadmap documents" project too and we had much much better solutions.

OTel is one of those "yeah, it works ... I guess" but also "ewwww".

phillipcarter · on Aug 28, 2023

I'd recommend trying it out today. In 2021, very few things in OTel were GA and there wasn't nearly as much automatic instrumentation. One of the reasons why you had to dive into the spec was because there was also very little documentation, too, indicative of a heavily in-progress project. All of these things are now different.

arp242 · on Aug 28, 2023

I'll be happy to take your word that some implementation issues are now improved, but things like "overengineered" and "way to complex for what it needs to do" really are foundational, and can't just be "fixed" without starting from scratch (and presumably this is all by design in the first place).

phillipcarter · on Aug 28, 2023

That's fair. I find that to be a bit subjective anyways, so I don't have much to comment on there. Most languages are pretty lightweight. For example, initializing instrumentation packages and creating some custom instrumentation in Python is very lightweight. Golang is far more verbose, though. I see that as part and parcel of different cultures for different languages, though (I've always loved the brevity of Python API design and disliked the verbosity of Go API design).

kiitos · on Aug 29, 2023

One of the main reasons I became disillusioned with OTel was that the project treated "automatic instrumentation" as a core assumption and design goal for all supported languages, regardless of any language-specific idioms or constraints.

I'm not an expert in every language, but I am an expert in a few, and this just isn't something that you can assume. Languages like Go deliberately do not provide the sorts of features needed to support "automatic instrumentation" in this sense. You have to fold those concerns into the design of the program itself, via modules or packages which authors explicitly opt-in to at the source level.

I completely understand the enormous value of a single, cross-language, cross-backend set of abstractions and patterns for automatic instrumentation. But (IMO and IME) current technology makes that goal mutually exclusive with performance requirements at any non-trivial scale. You have to specialize -- by language, by access pattern (metrics, logs, etc.), by concrete system (backend), and so on -- to get any kind of reasonable user experience.

hinkley · on Aug 29, 2023

The Spec itself is 'badly written Java'. I haven't been a Java dev for about ten years. At this point it's a honey pot for architectural astronauts - a great service to humanity.

That is, until some open standard is defined by said Java astronauts.

podoman · on Aug 28, 2023

> OpenTelemetry is a marketing-driven project, designed by committee, implemented naively and inefficiently, and guided by the primary goal of allowing Fortune X00 CTOs to tick off some boxes on their strategy roadmap documents.

I'm the founder of highlight.io. On the consumer side as a company, we've seen a lot of value of from OTEL; we've used it to build out language support for quite a few customers at this point, and the community is very receptive.

Here's an example of us putting up a change: https://github.com/open-telemetry/opentelemetry-js/pull/4049

Do you mind sharing why you think no-one should be using it? Some reasoning would be nice.

jrockway · on Aug 28, 2023

I don't think that's true. It seems like it's more of a "oh shit, all this open source software emits Prometheus metrics and Jaeger traces, but we want to sell our proprietary alternatives to these and don't want to upstream patches to every project". (Datadog had a literal army of people adding datadog support to OSS projects. Honestly, probably a great early-career job; diving into unfamiliar codebases is a superpower.)

OTel lets the open source projects use an abstraction layer so that you can buy instead of self-host.

None of this has ever made me feel super great, but in the end I would probably consider OTel today for services that people other than my company operate. That way if some user wants to use Datadog, we're not in their way.

I used OTel in the very very early days and was rather disappointed; the Go APIs were extremely inefficient (a context.Context is needed to increment a counter? no IO in my request path please), and abstracted leakily (no way to set histogram buckets when exporting to Prometheus). I assume they probably fixed that stuff at some point, though.

the_duke · on Aug 28, 2023

What helps hosted data collectors helps self-hosting setups just as much.

More and more solutions are getting built in OTEL support, which means you can relatively seamlessly switch between backends without changing anything in your application code.

morelisp · on Aug 28, 2023

This only makes sense if you're in a world where you're switching backends more than once, which means you're not seriously programming, you're just burning VC money for lottery tickets.

jrockway · on Aug 28, 2023

I agree with this. For internal apps, pick a system and stick with it. If you're excited by Datadog's marketing pitch, just buy it and use it. It will not make or break your startup; like if your Datadog bill is what's standing between you and profitability, then you probably didn't actually find product/market fit. Switching to Prometheus at that point also won't help you find product/market fit.

In the 2 jobs where I've set up the production environment, I just picked Prometheus/Jaeger/cloud provider log storage/Grafana on day 1 and have never been disappointed. You explode the helm chart into your cluster over the course of 30 minutes, and then move on to making something great (or spending a week debugging Webpack; can't help you with that one).

esafak · on Aug 28, 2023

You had better have PMF if you pick Datadog!

https://thenewstack.io/datadogs-65m-bill-and-why-developers-...

nijave · on Aug 29, 2023

Also useful for local dev (you can still use it locally without a SaaS backend) and it helps with interoperability. Infrastructure like Envoy and nginx can emit spans that integrate with 1st party code and other 3rd party code. OSS libraries are more likely to implement an open standard so they just plug and play and emit data for internal things they're doing (especially helpful for things like DB and HTTP)

morelisp · on Aug 28, 2023

OTel is the backend, in-program equivalent of "we need all of five analytics systems on our frontend to figure out that users bounce when our page takes 10s to load because it has five analytics systems in it".

esafak · on Aug 28, 2023

What do you use?

l-albertovich · on Aug 28, 2023

That's overly harsh, they are doing good work I think their data model is a step forward in the right direction.

Their processors are quite capable and the entire receiver and exporter contrib collection is pretty good.

I'm not saying it's the best solution out there because that clearly depends on each use case but I don't think such harsh criticism makes sense.

Disclaimer: I'm part of the fluent-bit maintainer team.

conor- · on Aug 29, 2023

Being able to have services speak OTLP and having my application configurations simplified to sending data to the OTEL collector is great.

From an ops point-of-view devs can add whatever observability to their code and I can enforce certain filtering in a central place as well as only needing one central ingress path that applications talk to.

Also because everything emits OTLP if we ever want to move to new backends it's just a matter of changing a yaml file and not rewriting applications to support a new logging backend.

Given the choice of going back to the old way of using vendor-specific logging libraries, I will continue using OTEL 10/10 times because even given its warts, it's still a lot nicer than the alternatives.

kiitos · on Aug 29, 2023

Switching observability backends is like switching databases -- feasible in theory, impossible in practice for anything but the most trivial use cases.

You can't build a sound product if that's one of the design requirements.

conor- · on Aug 29, 2023

Except it isn't impossible because using OTLP as the data format means you're decoupled from any single backend.

Switching to a new backend is as simple as deploying the new backend, changing 1 line in the OTEL Collector yaml, then having your front-end pull from the new backend. 0 changes to application code necessary.

kiitos · on Aug 29, 2023

Metrics, logs, and traces are abstract categories of telemetry data, representing the most common modalities of how that data is produced and consumed. They are explicitly not specific or concrete types that can be precisely defined by an e.g. Protobuf schema

These domain concepts are descriptive, not proscriptive. They don't, and can't possibly, have specific wire-level definitions. So another way to phrase my point might be to say that OTel is asserting definitions which doesn't actually exist.

Telemetry necessarily requires specialization between producer (application/service) and consumer (observability backend) in order to deliver acceptable performance. It's core to the program as it is written: more like error handling than e.g. containerization.

anbotero · on Aug 28, 2023

What should people use?

With basic parameters in place so it doesn’t eat your billing, it’s been working great with me for years. Initially with New Relic, then Datadog, now a setup with OpenTelemetry is good enough.

kiitos · on Aug 28, 2023

Instrumentation isn't solved by any single specific thing. It's a praxis that you apply to your code as your write it, like I guess error handling; it's not a product that you can deploy, like I guess Splunk or New Relic or whatever else.

You should "use" metrics, logs, and traces thru dependencies that are specific to your organization. The interface between your business logic and operational telemetry should be abstract, essentially the same as a database or a remote HTTP endpoint or etc. The concrete system(s) collecting and serving telemetry data are the responsibility of your dev/ops or whatever team.

Main point: instrumentation is part of the development process, not something that's automatic or that can be bolted-on.

pgwhalen · on Aug 28, 2023

Have you worked with OTEL before? Basically all of your points about instrumentation are actually quite sympathetic to OTEL's view of the world. The whole point of OTEL is to provide some standards around how these pieces fit together - not to solve for them automatically.

kiitos · on Aug 28, 2023

I've been deeply involved with OTel from even before it was a CNCF jam. My experiences with the project, over time, have made me basically abandon the project as unsound and infeasible since a year or two. Those experiences also inform comments like the ones I've made here.

pgwhalen · on Aug 28, 2023

Can you elaborate on what unsound and infeasible mean? I'm newer to OTel than you (~6 months of working with it in depth), and don't really understand what you're getting at. It's solving real problems in my organization, with only a "regular" amount of pain for a component of its size.

klysm · on Aug 28, 2023

Okay so what’s the interface? Sounds like what OTEL provides to me

kiitos · on Aug 28, 2023

There are well-defined interfaces for specific sub-classes of telemetry data. Prometheus provides a set of interfaces for metrics which are pretty battle-tested by now. There are similar interfaces for logs and traces, authored by various different parties, and with various different capabilities, trade-offs, etc.

There is no one true interface! The interface is a function of the sub-class of telemetry data it serves, the specific properties of the service(s) it supports, the teams it's used by, the organization that maintains it, etc. etc.

OTel tries to assert a general-purpose interface. But this is exactly the issue with the project. That interface doesn't exist.

klysm · on Aug 28, 2023

OTEL is a set of interfaces, so I’m not sure your last point applies. I do agree that battle tested things like Prometheus work great, but why not have a set of standardized interfaces? There is clearly a cost to having them; for some projects this may be too much. For the projects I’ve used it in it let me spin up all the traces and telemetry without thinking hard.

KronisLV · on Aug 28, 2023

> What should people use?

I recall Apache Skywalking being pretty good, especially for smaller/medium scale projects: https://skywalking.apache.org/

The architecture is simple, the performance is adequate, it doesn't make you spend days configuring it and it even supports various different data stores: https://skywalking.apache.org/docs/main/v9.5.0/en/setup/back...

The problems with it are that it isn't super popular (although has agents for most popular stacks), the docs could be slightly better and I recall them also working on a new UI so there is a little bit of churn: https://skywalking.apache.org/downloads/

Still better versus some of the other options when you need something that just works instead of spending a lot of time configuring something (even when that something might be superior in regards to the features): https://github.com/apache/skywalking/blob/master/docker/dock...

Sentry comes to mind (OpenTelemetry also isn't simpler due to how much it tries to do, given all the separate parts), compare its complexity to Skywalking: https://github.com/getsentry/self-hosted/blob/master/docker-...

I wish there was more self-hosted software like that out there, enough to address certain concerns in a simple way on day 1 and leave branching out to more complex options like OpenTelemetry once you have a separate team for that and the cash is rolling in.

hinkley · on Aug 29, 2023

I'm honestly thinking that one of the statsd variants with label support would have been just fine if I'd had a time machine. The complexity overhead of labels in OpenTelemetry does not make it the slam-dunk it appears to be.

Internally, OTEL has to keep track of every combination of labels it's seen since process start, which can easily come to dominate the processing time in an existing project. It's another in a long line of tools that dovetail with my overall software development philosophy which is that you can make pretty much any process work for 18 months before the wheels fall off.

By the time you notice OpenTelemetry is a problem, you've got 18 months of work to start trying to roll back.

kiitos · on Aug 29, 2023

> Internally, OTEL has to keep track of every combination of labels it's seen since process start, which can easily come to dominate the processing time in an existing project.

Well, every unique combination of labels represents a discrete time series of telemetry data, and the total set of all time series in your entire organization always has to be of finite and reasonable cardinality. This means that label values always have to be finite e.g. enumerations, and never e.g. arbitrary values from user input.

> my overall software development philosophy which is that you can make pretty much any process work for 18 months before the wheels fall off.

The size of the set of labels in your process after (say) 1d of regular traffic should be basically the same size as after (say) 18m of regular traffic. If this isn't the case, it usually signals that you're stuffing invalid data into label values.

adra · on Aug 29, 2023

I have no idea why you think that all attributes need to be buffered in process forever. Most metrics systems simply keep key sets in ram cached for as long as they're still being emitted. Many drop unused key sets after like 10 minutes. But like all metrics processing, you should ideally keep cardinality to a bound set in order to avoid these types of issues both client and server side.

I'm sure there are valid qualms with OTEL in general, but this ain't one of them. Any and all metrics telemetry systems can fall into the same design constraint you pointed out.

hinkley · on Aug 29, 2023

I don’t know which one implementations support invalidation but it’s not happening for the nodejs impl.

Push implementations do not have this problem at the client end.

hankchinaski · on Aug 28, 2023

I kind of agree with you. Clueless managers just asking “open telemetry” on the roadmap without contextualising costs/benefits

dboreham · on Aug 28, 2023

Well, roll up your sleeves and fix the performance bugs that affect you (source: I have).

kiitos · on Aug 28, 2023

I have no reason to do so, because I don't believe that OpenTelemetry is a project that was created, or is maintained, in good faith to its stated goals.

Dopameaner · on Aug 28, 2023

Care to elaborate a bit more on the goals contrast?

kiitos · on Aug 29, 2023

https://opentelemetry.io

> OpenTelemetry is a collection of APIs, SDKs, and tools. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

You can absolutely categorize telemetry into these high-level pillars, true. But the specifics on how that data is captured, exported, collected, queried, etc. is necessarily unique to each pillar, programming language, backend system, organization, etc.

That's because telemetry data is always larger than the original data it represents: a production request will be of some well-defined size, but the metadata about that request is potentially infinite. Consequently, the main design constraint for telemetry systems is always efficiency.

Efficiency requires specialization, which is in direct tension with features that generalize over backends and tools, e.g.

> Traces, Metrics, Logs -- Create and collect telemetry data from your services and software, then forward them to a variety of analysis tools.

and features that generalize over languages, e.g.

> Drop-In Instrumentation -- OpenTelemetry integrates with popular libraries and frameworks such as Spring, ASP.NET Core, Express, Quarkus, and more! Installation and integration can be as simple as a few lines of code.

I think OTel treats these goals -- which are very valuable to end users!! -- as inviolable core requirements, and then does whatever is necessary to implement them. But these goals are not actually valid, and so the resulting code is often inefficient, incoherent, or even unsound.