Skip to content

Reuse HTTP connections with a connection pool #131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jun 7, 2023
Merged

Conversation

susodapop
Copy link
Contributor

@susodapop susodapop commented May 24, 2023

Background

Right now, PySQL creates a new HTTP connection for every thrift RPC. This is wildly inefficient.

It happens because HTTP connections are handled by the thrift package. This is discussed in the Apache Thrift bug tracker here: https://issues.apache.org/jira/browse/THRIFT-2033

Quoting from that issue:

Currently THttpClient uses deprecated httplib.HTTP class (it is even not documented now). httplib.HTTP only supports HTTP 1.0 connections. As a result every Thrift call results new http or https connection. This is very expensive, especially for https. I suggest use urllib3 library which implements pool of keep-alive http connections.

The linked issue includes a patch that modifies thrift's transport/THttpClient.py

PySQL already subclasses this in sql/auth/thrift_http_client.py file. The goal of this PR is to apply a similar patch to our subclass of THttpClient.

Implementation

I selectively copied over the methods from Thrift's THttpClient and implemented them using urllib3's ConnectionPool classes. After doing so the e2e test suite execution time dipped by 50%.

⚠️ Note about Proxies

We don't have a good way to test proxy support / behaviour in the connector at this time. I've made a best-effort to implement proxy support using ProxyManager. But this could break for some users. I verified that this patch at least attempts to connect through a proxy. Beyond that it's anyone's guess.

We would be more careful about changing this behaviour except that PySQL has not historically supported proxies. We only made them work provisionally in #81 which was incorporated into the v2.5.0 release.

This means that the majority of users are not using proxies today and will not be affected by a regression in proxy support. But everyone benefits from an upgrade to HTTP 1.1 support. We'll rely on users to report if this doesn't work for them.

The mitigation if proxies break in the short-term is to pin to a previous release of PySQL and report any issues with proxies to Databricks.

Related Tickets & Documents

PECO-715

here: https://issues.apache.org/jira/browse/THRIFT-2033

This patch modifies thrift's transport/THttpClient.py

PySQL already subclasses this in our sql/auth/thrift_http_client.py file

So I should be able to apply the same patch there.

Signed-off-by: Jesse Whitehouse <[email protected]>
@susodapop
Copy link
Contributor Author

To make this easier I downloaded the patch from Thrift's bug tracker and ran this command to get it's diff:

cat 0001.patch | colordiff | less -RS

Jesse Whitehouse added 13 commits May 24, 2023 14:29
I ran test_create_insert_drop_table_core to prove that the e2e tests
still function. Meaning that this step didn't break anything.

Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
I was only using this for nice code-completion during refactoring

Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
This reverts commit 3261137.

Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
@susodapop
Copy link
Contributor Author

Connection pools now work for our entire e2e test suite. But I don't have a way to verify yet that one TCP connection is being reused. I would expect to see the e2e test suite execute more quickly if that were the case. But it's actually slower with the connection pool.

In urllib3 this is controlled by calls to release_conn() or by telling the library to automatically release connections. But doing either of these outside the transport's close() method causes weird race conditions.

Also need to implement urllib3's proxy handling. The patch in its current form doesn't support proxies.

This improves the e2e test performance by 50%

Signed-off-by: Jesse Whitehouse <[email protected]>
@susodapop
Copy link
Contributor Author

Since thrift itself doesn't have a hook for calling close() when flush() is called for the last time, I moved the call one step higher in the code (to thrift_backend.py). This change improves the e2e test suite performance by 50%.

The remaining step here is to implement the proxy handling.

Jesse Whitehouse added 3 commits June 6, 2023 13:21
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
@susodapop susodapop marked this pull request as ready for review June 6, 2023 20:12
Signed-off-by: Jesse Whitehouse <[email protected]>
@susodapop susodapop changed the title Reuse TCP connections in THttpClient Reuse HTTP connections with a connection pool Jun 6, 2023
@susodapop
Copy link
Contributor Author

Proxies are now provisionally implemented. We are not testing the proxy support for now.

Jesse Whitehouse added 2 commits June 6, 2023 15:40
Signed-off-by: Jesse Whitehouse <[email protected]>
issues with urllib3

Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Copy link
Contributor

@andrefurlan-db andrefurlan-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We just need to add comments to the important lines that were updated so other readers know what is going on

Jesse Whitehouse added 3 commits June 7, 2023 12:04
This reverts commit 259702d.

Signed-off-by: Jesse Whitehouse <[email protected]>
This reverts commit aa4fc0a.

Signed-off-by: Jesse Whitehouse <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants