The 'upstream connect error or disconnect/reset before headers. reset reason: connection failure' error when using javaagent in OpenShift infrastructure
We are running a jaeger collector (1.48.1) in OpenShift, and we send the telemetry from our Java Spring applications using opentelemetry javaagent ver 1.28.0. We are using service endpoints in OpenShift as in http://jaeger-collector-headless.qa-app-monitoring.svc:4317, and the issue is when the collector pod restarts the javaagent can't reconnect again. What happens instead the javaagent reports the error below continuously:
ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export spans. Server is UNAVAILABLE. Make sure your collector is running and reachable from this network. Full error message: upstream connect error or disconnect/reset before headers. reset reason: connection failure
What could be the cause of the issue in terms of how the error maps to the networking/infrastructure problems?
If I simulate the situation locally, i.e. I run the local jaeger-all-in-one container then drop it, restart it, the error is different, and the javaagent restores the connection successfully, that's the error on local that I get:
[otel.javaagent 2023-12-27 20:19:37:802 -0800] [OkHttp http://localhost:4317/...] ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export spans. The request could not be executed. Full error message: Failed to connect to localhost/0:0:0:0:0:0:0:1:4317
It's straightforward. It can't connect as the container is intentionally down, then it restores the connection as soon as I restart it.
What is different with the "upstream connect error or disconnect/reset before headers. reset reason: connection failure" compare to "Failed to connect to localhost/0:0:0:0:0:0:0:1:4317" error?
PS: To the javaagent we are passing OTEL_TRACES_EXPORTER=otlp OTEL_METRICS_EXPORTER=none OTEL_EXPORTER_OTLP_ENDPOINT=http://our-openshift.svc:4317 OTEL_EXPORTER_OTLP_PROTOCOL=grpc