Azure Event Hub binding stops working if storage act returns 503 #3573

MattCosturos · 2024-10-18T14:26:29Z

Expected Behavior

Azure Event Hub binding will continue working after receiving a 503 error from the storage act used for checkpointing.

Actual Behavior

The event hub binding dies after receiving a 503 error from table storage. The component does not appear to retry the calls to the storage act, and just gives up. Logs become filled with Error from event processor: the Processor has been stopped. Create a new instance to start processing again

We are using a standard storage act and not a premium act. I could switch to a premium act for better throughput, but I would still have concerns that the component could still fail to process messages if it experiences any sort of transient error.

Can we somehow configure a retry policy to restart this component instead of filling the logs with an error?

time="2024-10-17T19:48:21.031356514Z" level=warning msg="The default value for 'spec.metric.http.increasedCardinality' will change to 'false' in Dapr 1.15 or later" app_id=component-telemetry instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.runtime.diagnostics type=log ver=1.14.4
time="2024-10-17T19:48:24.243756578Z" level=warning msg="The default value for 'spec.metric.http.increasedCardinality' will change to 'false' in Dapr 1.15 or later" app_id=component-telemetry instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.runtime.http type=log ver=1.14.4
time="2024-10-18T01:25:52.104204839Z" level=error msg="Error processing events from partition client: error receiving events: client has been closed by user" app_id=component-telemetry component="component-telemetry-1 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4
time="2024-10-18T01:25:56.922083584Z" level=error msg="Error processing events from partition client: error receiving events: client has been closed by user" app_id=component-telemetry component="component-telemetry-0 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4
time="2024-10-18T01:26:02.932347006Z" level=error msg="Error from event processor: GET https://my-storage-act.blob.core.windows.net/dapr-component-telemetry\n--------------------------------------------------------------------------------\nRESPONSE 503: 503 The server is busy.\nERROR CODE: ServerBusy\n--------------------------------------------------------------------------------\n\ufeff<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>ServerBusy</Code><Message>The server is busy.\nRequestId:9e9a9f71-c01e-0015-15fc-202deb000000\nTime:2024-10-18T01:25:52.0107247Z</Message></Error>\n--------------------------------------------------------------------------------\n" app_id=component-telemetry component="component-telemetry-1 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4
time="2024-10-18T01:26:07.643955727Z" level=error msg="Error from event processor: GET https://my-storage-act.blob.core.windows.net/dapr-component-telemetry\n--------------------------------------------------------------------------------\nRESPONSE 503: 503 The server is busy.\nERROR CODE: ServerBusy\n--------------------------------------------------------------------------------\n\ufeff<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>ServerBusy</Code><Message>The server is busy.\nRequestId:9e9aad9c-c01e-0015-75fc-202deb000000\nTime:2024-10-18T01:25:56.8329114Z</Message></Error>\n--------------------------------------------------------------------------------\n" app_id=component-telemetry component="component-telemetry-0 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4
time="2024-10-18T01:26:07.933024175Z" level=error msg="Error from event processor: the Processor has been stopped. Create a new instance to start processing again" app_id=component-telemetry component="component-telemetry-1 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4
time="2024-10-18T01:26:12.644981572Z" level=error msg="Error from event processor: the Processor has been stopped. Create a new instance to start processing again" app_id=component-telemetry component="component-telemetry-0 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4

Steps to Reproduce the Problem

Unable to reproduce on demand. Leave the component running long enough and this can happen.

The text was updated successfully, but these errors were encountered:

oising · 2024-11-14T21:30:07Z

This just happened to us -- on DAPR 1.14.4, except it was failure to connect to the storage account that triggered it.

time="2024-11-12T01:47:38.209787491Z" level=error msg="Error from event processor: Get \"https://REDACTED.blob.core.windows.net/checkpoints?comp=list&include=metadata&prefix=REDACTED.servicebus.windows.net%2Fdevice-out%2Fdispatcher%2Fownership%2F&restype=container\": dial tcp:REDACTED:443: connect: connection refused" app_id=dispatcher component="binding-device-out-input (bindings.azure.eventhubs/v1)" instance=dispatcher-848b9dd7fd-2cg2w scope=dapr.contrib type=log ver=1.14.4

MattCosturos · 2024-11-25T22:14:31Z

Ok Did some research now that I have a basic understanding of Golang.

eventhubs.go creates the Event Hub Processor on Line 246

The main loop is declared further down on Line 304

This loop does try to resume from errors that get returned by the call to processor.Run.
However, the Azure SDK only returns unrecoverable errors.

The docs state

// Run handles the load balancing loop, blocking until the passed in context is cancelled
// or it encounters an unrecoverable error. On cancellation, it will return a nil error.
//
// This function should run for the lifetime of your application, or for as long as you want
// to continue to claim and process partitions.
//
// Once a Processor has been stopped it cannot be restarted and a new instance must
// be created.

The main loop has a 5 second sleep time before retrying, which I believe lined up with the timestamps in my logs.

This component has to be handle the fact that the errors returned by processor.Run are not recoverable, and this component needs to create a new event hub processor.
It might be as simple as moving the call to aeh.getProcessorForTopic into the subscriberLoop func

yaron2 · 2024-11-25T23:58:48Z

Assigned to 1.15

MattCosturos added the kind/bug Something isn't working label Oct 18, 2024

yaron2 self-assigned this Nov 25, 2024

yaron2 added this to the v1.15 milestone Nov 25, 2024

MattCosturos linked a pull request Nov 26, 2024 that will close this issue

Recreate AEH Processor in the event of an error before retrying the processing operation #3614

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Event Hub binding stops working if storage act returns 503 #3573

Azure Event Hub binding stops working if storage act returns 503 #3573

MattCosturos commented Oct 18, 2024

oising commented Nov 14, 2024 •

edited

Loading

MattCosturos commented Nov 25, 2024

yaron2 commented Nov 25, 2024

Azure Event Hub binding stops working if storage act returns 503 #3573

Azure Event Hub binding stops working if storage act returns 503 #3573

Comments

MattCosturos commented Oct 18, 2024

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

oising commented Nov 14, 2024 • edited Loading

MattCosturos commented Nov 25, 2024

yaron2 commented Nov 25, 2024

oising commented Nov 14, 2024 •

edited

Loading