Fix agent lifecycle events: Start instance name and Shutdown delivery#810
Merged
Conversation
The BootService priority contract (higher priority is started earlier and shut down later) was inverted in ServiceManager: prepare()/startup() sorted ascending and shutdown() sorted descending. As a result: - ServiceInstanceGenerator (priority MAX_VALUE) prepared last, so the Start event was built before INSTANCE_NAME was generated, reporting an empty serviceInstance. - GRPCChannelManager (priority MAX_VALUE) shut down first and closed the gRPC channel before EventReportServiceClient could send the Shutdown event. Sort prepare()/startup() descending and shutdown() ascending to match the documented contract, so foundational services come up first and go down last. Align EventReportServiceClient with the TraceSegmentServiceClient / ServiceManagementClient idiom: volatile status/stub, a plain stub built on CONNECTED before the status is published, and a fresh gRPC deadline applied per send. Previously the Shutdown event reused a stub whose absolute deadline was fixed at connect time and had already expired for any JVM alive longer than the upstream timeout. The Start-event gate now uses explicit readiness (bootCompleted + CONNECTED) with reported meaning sent/in-flight, so a failed send can retry on the next connect instead of being dropped. KafkaProducerManager no longer derives its priority from GRPCChannelManager; it uses a constant priority higher than the default-priority Kafka reporters that share its producer, so the producer is closed only after they stop. Add ServiceManagerOrderingTest and a KafkaProducerManager priority test.
wankai123
approved these changes
Jul 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix agent lifecycle events: the Start event missed the service instance name, and the Shutdown event was never delivered
Why the bug exists
BootService#priority()is documented as "BootServices with higher priorities will be started earlier, and shut down later", butServiceManagersorted the phases the opposite way —prepare()/startup()ascending andshutdown()descending. So theInteger.MAX_VALUEservices ran last on boot and first on shutdown, breaking two lifecycle events:ServiceInstanceGenerator(MAX) prepared last, soEventReportServiceClientbuilt the Start event beforeINSTANCE_NAMEwas generated → emptyserviceInstance.GRPCChannelManager(MAX) shut down first, closing the gRPC channel beforeEventReportServiceClient.shutdown()could send the Shutdown event. A second defect compounded it: the event stub's deadline was fixed once at connect time (withDeadlineAfter), so for any JVM alive longer than the upstream timeout the Shutdown RPC failed withDEADLINE_EXCEEDEDeven on a live channel.How it's fixed
ServiceManager: sortprepare()/startup()descending andshutdown()ascending, matching the documented contract (foundational services come up first and go down last).EventReportServiceClient: align with the stableTraceSegmentServiceClient/ServiceManagementClientidiom —volatilestatus/stub, a plain stub built onCONNECTEDbefore the status is published, and a fresh deadline applied per send. The Start-event gate now uses explicit readiness (bootCompleted+CONNECTED), withreportedmeaning sent/in-flight, so a failed send retries on the next connect instead of being dropped (this also fixes the previously invertedcompareAndSetgate).KafkaProducerManager: no longer derives its priority fromGRPCChannelManager— that coupling flipped under the corrected order and would close the shared producer before the Kafka reporters stop. It now uses a constant priority higher than the default-priority Kafka reporters, so the producer is closed only after they stop sending.Verification
Added
ServiceManagerOrderingTest(pins boot descending / shutdown ascending) and aKafkaProducerManagerpriority test;apm-agent-coreandkafka-reporter-pluginsuites pass with 0 checkstyle violations. Also verified end-to-end against a local OAP + BanyanDB: a service instrumented with the built agent reports a Start event carrying the instance name and, on graceful shutdown (including JVMs running well past the upstream timeout), a Shutdown event.CHANGESlog.