GitHub CI/CD observability with OpenTelemetry step by step guide

124 points by ankit01-oss 4 days ago

hrpnk 15 hours ago

Has anyone seen OTel being used well for long-running batch/async processes? Wonder how the suggestions stack up to monolith builds for Apps that take about an hour.

makeavish 14 hours ago

You can use SpanLinks to analyse your async processes. This guide might be helpful introduction: https://dev.to/clericcoder/mastering-trace-analysis-with-spa...
Also SigNoz supports rendering practically unlimited number of spans in trace detail UI and allows filtering them as well which has been really useful in analyzing batch processes: https://signoz.io/blog/traces-without-limits/
You can further run aggregation on spans to monitor failures and latency.
PS: I am SigNoz maintainer
- brunoqc 11 minutes ago
  
  Is SpanLinks supported by jaeger?
- ai-christianson 7 hours ago
  
  Is this better than Honeycomb?
  - mdaniel 6 hours ago
    
    "Better" is always "for what metric" but if nothing else having the source code to the stack is always "better" IMHO even if one doesn't choose to self-host, and that goes double for SigNoz choosing a permissive license, so one doesn't have to get lawyers involved to run it
    ---
    While digging into Honeycomb's open source story, I did find these two awesome toys, one relevant to the otel discussion and one just neato
    https://github.com/honeycombio/refinery (Apache 2) -- Refinery is a tail-based sampling proxy and operates at the level of an entire trace. Refinery examines whole traces and intelligently applies sampling decisions to each trace. These decisions determine whether to keep or drop the trace data in the sampled data forwarded to Honeycomb.
    https://github.com/honeycombio/gritql (MIT) -- GritQL is a declarative query language for searching and modifying source code
zdc1 14 hours ago

I've tried and failed at tracing transactions that span multiple queues (with different backends). At the end I just published some custom metrics for the transaction's success count / failure count / duration and moved on my with life.
sethammons 5 hours ago

We had a hell of a time attempting to roll out OTel for that kind of work. Our scale was also billions of requests per day.
We ended up taking tracing out of these jobs, and only using on requests that finish in short order, like UI web requests. For our longer jobs and fanout work, we started passing a metadata object around that appended timing data related that specific job and then at egress, would capture the timing metadata and flag abnormalities.
dboreham 12 hours ago

It doesn't matter how long things take. The best way to understand this is to realize that OTel tracing (and all other similar things) are really "fancy logging systems". Some agent code emits a log message every time something happens (e.g. batch job begins, batch job ends). Something aggregates those log messages into some place they can be coherently scanned. Then something scans those messages generating some visualization you view. Everything could be done with text messages in text files and some awk script. A tracing system is just that with batteries included and a pretty UI. Understood this way it should now be clear why the duration of a monitored task is not relevant -- once the "begin task" message has been generated all that has to happen is the sampling agent remembers the span ID. Then when the "end task" message is emitted it has the same span ID. That way the two can be correlated and rendered as a task with some duration. There's always a way to propagate the span ID from place to place (e.g. in a http header so correlation can be done between processes/machines). This explains sibling comments about not being able to track tasks between workflows: the span ID wasn't propagated.
- imiric 12 hours ago
  
  That's a good way of looking at it, but it assumes that both start and end events will be emitted and will successfully reach the backend. What happens if one of them doesn't?
  - candiddevmike 10 hours ago
    
    AIUI, there aren't really start or end messages, they're spans. A span is technically an "end" message and will have parent or child spans.
    
    BoiledCabbage 7 hours ago
    
    I don't know the details but does a span have a beginning?
    Is that beginning "logged" at a separate point in time from when the span end is logged?
    > AIUI, there aren't really start or end messages,
    Can you explain this sentence a bit more? How does it have a duration without a start and end?
    
    nijave 6 hours ago
    
    A span is a discrete event emitted on completion. It contains arbitrary metadata (plus a few mandatory fields if you're following the OTEL spec).
    As such, it doesn't really have a beginning or end except that it has fields for duration and timestamps.
    I'd check out the OTEL docs since I think seeing the examples as JSON helps clarify things. It looks like they have events attached to spans which is optional. https://opentelemetry.io/docs/concepts/signals/traces/
    
    hinkley 7 hours ago
    
    It’s been a minute since I worked on this but IIRC no, which means that if the request times out you have to be careful to end the span, and also all of the dependent calls show up at the collector in reverse chronological order.
    The thing is that at scale you’d never be able to guarantee that the start of the span showed up at a collector in chronological order anyway, especially due to the queuing intervals being distinct per collection sidecar. But what you could do with two events is discover spans with no orderly ending to them. You could easily truncate traces that go over the span limit instead of just dropping them on the floor (fuck you for this, OTEL, this is the biggest bullshit in the entire spec). And you could reduce the number of traceids in your parsing buffer that have no metadata associated with them, both in aggregate and number of messages in the limbo state per thousand events processed.
  - lijok 11 hours ago
    
    Depends on the visualization system. It can either not display the entire trace or communicate to the user that the start of the trace hasn’t been received or the trace hasn’t yet concluded. It really is just a bunch of structured log lines with a common attribute to tie them together.
  - hinkley 7 hours ago
    
    Ugh. One of the reasons I never turned on the tracing code I painstakingly refactored into our stats code was discovering that OTEL makes no attempts to introduce a span to the collector prior to child calls talking about it. Is that really how you want to do event correlation? Time traveling seems like an expensive operation when you’re dealing with 50,000 trace events per second.
    The other turns out to be our OPs teams problem more than OTEL’s. Well a little of both. If a trace goes over a limit then OTEL just silently drops the entire thing, and the default size on AWS is useful for toy problems not retrofitting onto live systems. It’s the silent failure defaults of OTEL that are giant footguns. Give me a fucking error log on data destruction, you asshats.
    I’ll just use Prometheus next time, which is apparently what our OPs team recommended (except one individual who was the one I talked to).
    
    nijave 6 hours ago
    
    You can usually turn logging on but a lot of the OTEL stack defaults to best effort and silently drops data.
    We had Grafana Agent running which was wrapping the reference implementation OTEL collector written in go and it was pretty easy to see when data was being dropped via logs.
    I think some limitation is also on the storage backend. We were using Grafana Cloud Tempo which imposes limits. I'd think using a backend that doesn't enforce recency would help.
    With the OTEL collector I'd think you could utilize some processors/connectors or write your own to handle individual spans that get too big. Not sure on backends but my current company uses Datadog and their proprietary solution handles >30k spans per trace pretty easily.
    I think the biggest issue is the low cohesion, high DIY nature of OTEL. You can build powerful solutions but you really need to get low level and assemble everything yourself tuning timeouts, limits, etc for your use case.
    
    hinkley 6 hours ago
    
    > I think the biggest issue is the low cohesion, high DIY nature of OTEL
    OTEL is the SpringBoot of telemetry and if you think those are fighting words then I picked the right ones.
- hinkley 7 hours ago
  
  Every time people talk about OTel I discover half the people are talking about spans rather that stats. For stats it’s not a ‘fancy logger’ because it’s condensing the data at various steps.
  And if you’ve ever tried to trace a call tree using correlationIDs and Splunk queries and still say OTEL is ‘just a fancy’ then you’re in dangerous territory, even if it’s just by way of explanation. Don’t feed the masochists. When masochists derail attempts at pain reduction they become sadists.
madduci 14 hours ago

I use Otel running in a GKE cluster and tracking Jenkins jobs, whose spans/traces can track long time running jobs pretty well

totetsu 12 hours ago

I spent some time working on this. First I tried to make a GitHub action that was triggered on completion of your other actions and passed along the context of the triggering action in the environment, then used the GitHub api to call out extra details of the steps and tasks etc, and the logs and make that all into a process trace and send it via an otel connection to like jaeger or grafana, to get flamchart views of performance of steps. I thought maybe it would be better to do this directly from the runner hosts by watching log files, but the api has more detailed information.

remram 8 hours ago

I have thought about that before, but I was blocked by the really poor file support for OTel. I couldn't find an easy way to dump a file from the collector running in my CI job and load it on my laptop for analysis, which is the way I would like to go.

Maybe this has changed?

sweetgiorni 4 hours ago

https://github.com/open-telemetry/opentelemetry-collector-co...
- remram 3 hours ago
  
  And the receiver: https://github.com/open-telemetry/opentelemetry-collector-co...
  I'll have to try this!
  edit: actually Jaeger can just read those files directly, so no need to run a collector with the receiver. This is great!

reactordev 16 hours ago

As someone who has some experience in observability at scale, the issue with SigNoz, Prom, etc is that they can only operate on the data that is exposed by the underlying infrastructure where the IaaS has all the information to provide a better experience. Hence CloudWatch.

That said, if you own your infrastructure, I’d build out a signoz cluster in a heartbeat. Otel is awesome but once you set down a path for your org, it’s going to be extremely painful to switch. Choose otel if you’re a hybrid cloud or you have on premises stuff. If you’re on AWS, CloudWatch is a better option simply because they have the data. Dead simple tracing.

FunnyLookinHat 14 hours ago

I think you're looking at OTel from a strictly infrastructure perspective - which Cloudwatch does effectively solve without any added effort. But OTel really begins to shine when you instrument your backends. Some languages (Node.js) have a whole slew of auto-instrumentation, giving you rich traces with spans detailing each step of the http request, every SQL query, and even usage of AWS services. Making those traces even more valuable is that they're linked across services.
We've frequently seen a slowdown or error at the top of our stack, and the teams are able to immediately pinpoint the problem as a downstream service. Not only that, they can see the specific issue in the downstream service almost immediately!
Once you get to that level of detail, having your infrastructure metrics pulled into your Otel provider does start to make some sense. If you observe a slowdown in a service, being able to see that the DB CPU is pegged at the same time is meaningful, etc.
[Edit - Typo!]
- makeavish 13 hours ago
  
  Agree with you on this. OTel agents allows exporting all host/k8s metrics correlated with your logs and traces. Though exporting AWS service specific metrics with OTel is not easy. To solve this SigNoz has 1-Click AWS Integrations: https://signoz.io/blog/native-aws-integrations-with-autodisc...
  Also SigNoz has native correlation between different signals out of the box.
  PS: I am SigNoz Maintainer
- reactordev 5 hours ago
  
  Not confusing anything. Yes you can meter your own applications, generate your own metrics, but most organizations start their observability journey with the hardware and latency metrics.
  Otel provides a means to sugar any metric with labels and attributes which is great (until you have high cardinality) but there are still things that are at the infrastructure level that only CloudWatch knows of (on AWS). If you’re running K8s on your own hardware - Otel would be my first choice.
- elza_1111 13 hours ago
  
  FYI for anyone reading, OTel does have great auto-instrumentation for Python, Java and .NET also
mdaniel 8 hours ago

A child comment mentioned k8s but I also have been chomping at the bit to try out the eBPF hooks in https://github.com/pixie-io/pixie (or even https://github.com/coroot/coroot or https://github.com/parca-dev/parca ) all of which are Apache 2 licensed
The demo for https://github.com/draios/sysdig was also just amazing, but I don't have any idea what the storage requirements would be for leaving it running
elza_1111 13 hours ago

There are integrations that let you monitor your AWS resources also on SigNoz. That said, I personally think CloudWatch is painful in so many other ways as well,
Check this out, https://signoz.io/blog/6-silent-traps-inside-cloudWatch-that...
6r17 14 hours ago

I did have some bad experiences with OTEL and have lot of freedom on deployment ; I never read of Signoz will definitely check it out ; SigNoz is working with OTEL I suppose ?
I wonder if there are any other adapters for trace injest instead of OTEL ?
- bbkane 10 hours ago
  
  There are a few: I've played with https://uptrace.dev and https://openobserve.ai/ . OpenObserve is a single binary, so easy to set up
  - mdaniel 8 hours ago
    
    be cognizant of their licenses (AGPLv3), it matters in some shops
    https://github.com/uptrace/uptrace/blob/v1.7.6/LICENSE
    https://github.com/openobserve/openobserve/blob/v0.14.7/LICE...
- darkstar_16 14 hours ago
  
  Jaeger collector perhaps but then you'd have to use the Jaeger UI. Signoz has a much nicer UI that feels more integrated but last I checked had annoying bugs in the UI like not keeping the time selection when I navigated between screens.
  - 6r17 14 hours ago
    
    Definitely should look up the tech more ; i lazily commented as Signoz clearly state it ingest most than 50 different sources ;
- elza_1111 13 hours ago
  
  yep, SigNoz is OpenTelemetry native. You can instrument your application with OpenTelemetry and send telemetry data direclty to signoz.

sali0 14 hours ago

noob question, i'm currently adding telemetry to my backend.

I was at first implementing otel throughout my api, but ran into some minor headaches and a lot of boilerplate. I shopped a bit around and saw that Sentry has a lot of nice integrations everywhere, and seems to have all the same features (metrics, traces, error reporting). I'm considering just using Sentry for both backend and frontend and other pieces as well.

Curious if anyone has thoughts on this. Assuming Sentry can fulfill our requirements, the only thing taht really concerns me is vendor-lockin. But I'm wondering other people's thoughts

srikanthccv 14 hours ago

>I was at first implementing otel throughout my api, but ran into some minor headaches and a lot of boilerplate
OTeL also has numerous integrations https://opentelemetry.io/ecosystem/registry/. In contrast, Sentry lacks traditional metrics and other capabilities that OTeL offers. IIRC, Sentry experimented with "DDM" (Delightful Developer Metrics), but this feature was deprecated and removed while still in alpha/beta.
Sentry excels at error tracking and provides excellent browser integration. This might be sufficient for your needs, but if you're looking for the comprehensive observability features that OpenTelemetry provides, you'd likely need a full observability platform.
vrosas 10 hours ago

Think of otel as just a standard data format for your logs/traces/metrics that your backend(s) emit, and some open source libraries for dealing with that data. You can pipe it straight to an observability vendor that accepts these formats (pretty much everyone does - datadog, stackdriver, etc) or you can simply write the data to a database and wire up your own dashboards on top of it (i.e. graphana).
Otel can take a little while to understand because, like many standards, it's designed by committee and the code/documentation will reflect that. LLMs can help but the last time I was asking them about otel they constantly gave me code that was out of date with the latest otel libraries.
stackskipton 7 hours ago

Ops type here, Otel is great but if your metrics are not there, please fix that. In particular, consider just import prometheus_client and going from there.
Prometheus is bog easy to run, Grafana understands it and anything involving alerting/monitoring from logs is bad idea for future you, I PROMISE YOU, PLEASE DON'T!
- avtar 5 hours ago
  
  > anything involving alerting/monitoring from logs is bad idea for future you
  Why is issuing alerts for log events a bad idea?
  - stackskipton 27 minutes ago
    
    Couple of reasons.
    Biggest one, sample rate is much higher (every log) and this can cause problems if service goes haywire and starts spewing logs everywhere. Logging pipelines tend to be very rigid as well for various reasons. Metrics are easier to handle as you can step back sample rate, drop certain metrics or spin up additional Prometheus instances.
    Logging format becomes very rigid and if the company goes multiple languages, this can be problematic as different languages can behave differently. Is this exception something we care about or not? So we throw more code in attempt to get logging alerting into state that does not drive everyone crazy where if we were just doing "rate(critical_errors[5m] > 10" in Prometheus, we would be all set!
  - _kblcuk_ 5 hours ago
    
    It’s trivial to alter or remove log lines without knowing or realizing that it affects some alerting or monitoring somewhere. That’s why there are dedicated monitoring and alerting systems to start with.
    
    sethammons 5 hours ago
    
    Same with metrics.
    If you need an artifact from your system, it should be tested. We test our logs and many types of metrics. Too many incidents from logs or metrics changing and no longer causing alerts. Never got to build out my alert test bed that exercises all know alerts in prod, verifying they continue to work.
whatevermom 14 hours ago

Sentry isn’t really a full on observability platform. It’s for error reporting only (that is annotated with traces and logs). It turns out that for most projects, this is sufficient. Can’t comment on the vendor lock-in part.
dboreham 12 hours ago

You can run your own sentry server (or at least last time I worked with it you could). But as others have noted sentry is not going to provide the same functionality as OTel.
- mdaniel 7 hours ago
  
  The word "can" is doing a lot of work in your comment, based on the now horrific number of moving parts[1] and I think David has even said the self-hosting story isn't a priority for them. Also, don't overlook the license, if your shop is sensitive to non-FOSS licensing terms
  1: https://github.com/getsentry/self-hosted/blob/25.5.1/docker-...

candiddevmike 10 hours ago

How does SigNoz compare to the other "all-in-one" OTel platforms? What part of the open-core bit is behind a paywall?

makeavish 9 hours ago

Only SAML, Multiple ingestion keys and Premium Support is under paywall. SSO is not under paywall. Check pricing page for detailed comparison: https://signoz.io/pricing/

127dot1 8 hours ago

That's a poor title: the article is not about CI/CD, it is particularly about GitHub CI/CD and thus is useless for the most CI/CD cases.

dang 4 hours ago

Ok, we've added Github to the title above.

bravesoul2 15 hours ago

That's a genius idea. So obvious in retrospect.