Skip to content

Provide actionable feedback when internal operations are failing rather than silently retrying for over 10mins #346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dhirschfeld opened this issue Feb 4, 2024 · 6 comments

Comments

@dhirschfeld
Copy link
Contributor

I'm trying to debug a problem where my request appears to hang indefinitely.

If I pause the program in the debugger it's deep in a retry loop in urllib

>>> response
<urllib3.response.HTTPResponse at 0x187b9786ec0>

>>> response.status
403

...and it seems the default timeout for requests is 15mins

>>> self
<databricks.sql.auth.thrift_http_client.THttpClient at 0x187b97858a0>

>>> self._THttpClient__timeout
900.0
@dhirschfeld
Copy link
Contributor Author

To all intents and purposes the library has hung - there is no feedback that the operation is failing and being retried multiple times or any indication of why it is failing.

Maybe this issue is more of a request for better feedback in the case of failing HTTP operations.

@dhirschfeld
Copy link
Contributor Author

The 403 is pretty fishy.

Debugging a bit further I can see the token being passed in the headers:

>>> headers
{'Content-Type': 'application/x-thrift',
 'Content-Length': '167',
 'User-Agent': 'PyDatabricksSqlConnector/3.0.1 (sqlalchemy + )',
 'Authorization': 'Bearer <snip>d6b6'}

The token being passed is the same as my default environment variable:

❯ echo $env:DATABRICKS_TOKEN
<snip>d6b6

BUT in my calling script I explicitly set the correct token to use:

os.environ['DATABRICKS_TOKEN'] = '<snip>60cd'

...so it seems the token is somehow not being picked up?

@dhirschfeld dhirschfeld changed the title Set appropriate timeouts Provide actionable feedback when internal operations are failing rather than silently retrying for over 10mins Feb 4, 2024
@susodapop
Copy link
Contributor

Do you know what kind of Thrift request is being sent that results in this 403 failing with the 403? Is it an OpenSession?

there is no feedback that the operation is failing and being retried multiple times or any indication of why it is failing.

Is this true even when you have your log level set at DEBUG? urllib3 and databricks-sql-connector both emit debug messages when they determine a request is retry-able.

Adding this to your script will help with the repro:

import logging

logging.getLogger("databricks.sql").setLevel(logging.DEBUG)
logging.getLogger("urllib3").setLevel(logging.DEBUG)

The 403 is pretty fishy.

Yeah, generally a 403 should halt everything since authentication errors are by-nature not recoverable. The DBX team should investigate. Likely the fix will be an additional conditional check added to the should_retry() method in DatabricksRetryPolicy.

@dhirschfeld
Copy link
Contributor Author

Circling back since I just hit this again. If I configure logging like:

import logging

logging.basicConfig(level='DEBUG')
logging.getLogger("databricks.sql").setLevel(logging.DEBUG)
logging.getLogger("urllib3").setLevel(logging.DEBUG)

I do get some feedback! 🎉

DEBUG:databricks.sql.thrift_backend:retry parameter: _retry_delay_min given_or_default 1.0
DEBUG:databricks.sql.thrift_backend:retry parameter: _retry_delay_max given_or_default 60.0
DEBUG:databricks.sql.thrift_backend:retry parameter: _retry_stop_after_attempts_count given_or_default 30
DEBUG:databricks.sql.thrift_backend:retry parameter: _retry_stop_after_attempts_duration given_or_default 900.0
DEBUG:databricks.sql.thrift_backend:retry parameter: _retry_delay_default given_or_default 5.0
DEBUG:databricks.sql.thrift_backend:Sending request: OpenSession(<REDACTED>)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): test.cloud.databricks.com:443
DEBUG:urllib3.connectionpool:https://test.cloud.databricks.com:443 "POST /sql/1.0/warehouses/deadbeef HTTP/1.1" 403 54
DEBUG:databricks.sql.auth.retry:This request should be retried: Other
DEBUG:databricks.sql.auth.retry:Failed requests are retried by default per configured DatabricksRetryPolicy
DEBUG:urllib3.util.retry:Incremented Retry for (url='/sql/1.0/warehouses/deadbeef'): DatabricksRetryPolicy(total=29, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /sql/1.0/warehouses/deadbeef
DEBUG:urllib3.connectionpool:https://test.cloud.databricks.com:443 "POST /sql/1.0/warehouses/deadbeef HTTP/1.1" 403 54
DEBUG:databricks.sql.auth.retry:This request should be retried: Other
DEBUG:databricks.sql.auth.retry:Failed requests are retried by default per configured DatabricksRetryPolicy
DEBUG:urllib3.util.retry:Incremented Retry for (url='/sql/1.0/warehouses/deadbeef'): DatabricksRetryPolicy(total=28, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /sql/1.0/warehouses/deadbeef
DEBUG:urllib3.connectionpool:https://test.cloud.databricks.com:443 "POST /sql/1.0/warehouses/deadbeef HTTP/1.1" 403 54
DEBUG:databricks.sql.auth.retry:This request should be retried: Other
DEBUG:databricks.sql.auth.retry:Failed requests are retried by default per configured DatabricksRetryPolicy
DEBUG:urllib3.util.retry:Incremented Retry for (url='/sql/1.0/warehouses/deadbeef'): DatabricksRetryPolicy(total=27, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /sql/1.0/warehouses/deadbeef
DEBUG:urllib3.connectionpool:https://test.cloud.databricks.com:443 "POST /sql/1.0/warehouses/deadbeef HTTP/1.1" 403 54
DEBUG:databricks.sql.auth.retry:This request should be retried: Other
DEBUG:databricks.sql.auth.retry:Failed requests are retried by default per configured DatabricksRetryPolicy
DEBUG:urllib3.util.retry:Incremented Retry for (url='/sql/1.0/warehouses/deadbeef'): DatabricksRetryPolicy(total=26, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /sql/1.0/warehouses/deadbeef
DEBUG:urllib3.connectionpool:https://test.cloud.databricks.com:443 "POST /sql/1.0/warehouses/deadbeef HTTP/1.1" 403 54
DEBUG:databricks.sql.auth.retry:This request should be retried: Other
DEBUG:databricks.sql.auth.retry:Failed requests are retried by default per configured DatabricksRetryPolicy
DEBUG:urllib3.util.retry:Incremented Retry for (url='/sql/1.0/warehouses/deadbeef'): DatabricksRetryPolicy(total=25, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /sql/1.0/warehouses/deadbeef
DEBUG:urllib3.connectionpool:https://test.cloud.databricks.com:443 "POST /sql/1.0/warehouses/deadbeef HTTP/1.1" 403 54
DEBUG:databricks.sql.auth.retry:This request should be retried: Other
DEBUG:databricks.sql.auth.retry:Failed requests are retried by default per configured DatabricksRetryPolicy
DEBUG:urllib3.util.retry:Incremented Retry for (url='/sql/1.0/warehouses/deadbeef'): DatabricksRetryPolicy(total=24, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /sql/1.0/warehouses/deadbeef
DEBUG:urllib3.connectionpool:https://test.cloud.databricks.com:443 "POST /sql/1.0/warehouses/deadbeef HTTP/1.1" 403 54
DEBUG:databricks.sql.auth.retry:This request should be retried: Other
DEBUG:databricks.sql.auth.retry:Failed requests are retried by default per configured DatabricksRetryPolicy
DEBUG:urllib3.util.retry:Incremented Retry for (url='/sql/1.0/warehouses/deadbeef'): DatabricksRetryPolicy(total=23, connect=None, read=None, redirect=None, status=None)
<ctrl-c>

@dhirschfeld
Copy link
Contributor Author

I do get some feedback! 🎉

however, I don't think users should have to (know how to) configure logging to get critical usage information from the library. The library should not appear to hang indefinitely without providing any feedback/information to the user.

In this particular case, I don't think the library should be responsible for retries at all - just raise an exception and let the user decide how best to handle the error. Certainly, indefinitely retrying a 403 isn't going to lead to success.

@andrew-christianson
Copy link

andrew-christianson commented Feb 28, 2024

FWIW, chiming it to say that I hit the same issue (invalid token), can see exactly the same output from urllib3 (after setting the log level appropriatley) and was similarly stymied for an hour by the apparently-hung process. Failing loudly on 403 seems like the right solution here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants