-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Allow read_sql to work with chunksize. #49967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/sql.py
Outdated
@@ -189,6 +190,18 @@ def execute(sql, con, params=None): | |||
# -- Read and write to DataFrames | |||
|
|||
|
|||
def _read_helper(method: Callable, iterate: bool, **kwargs): | |||
"""method returns an iterator. If iterate is False, we need to exhaust the iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be a more direct fix to leave the connection open if chunksize is provided? It seems fairly roundabout to exhaust the iterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep the connection open, I need to make sure that the context manager at line 864 stays open. My solution is to create generating functions _read_sql_table
, _read_sql_query
, and _read_sql
. If chunksize=None
, these generating functions return a single DataFrame
. Then I have the opposite problem: if the caller wants a single DataFrame, I have to exit out of the context manager at line 864 so that the connection will close. If instead of
for frame in result:
pass
return frame
I use
return next(result)
then the line 864 context manager would not exit. I agree it is roundabout, but I think having SQLDatabase
only accept a Connection
, not an Engine
, and having SQLDatabase.read_query
only return an Iterator[DataFrame]
also simplifies things a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can _sqlalchemy_con
be aware of when chunksize
is specified and keep the connection open as needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are thinking along similar lines. I think it's will be easier to fold _sqlalchemy_con
into the SQLDatabase
class, because I want the connection to be closed at the end of iteration, or if there is an exception.
@mroeschke , an alternative to this PR, which would also address the bug, would be to stop allowing |
Unfortunately, I don't think this would be appropriate since
|
Ok, that makes sense. It might be that I need to add |
@mroeschke I created an |
@mroeschke , do the changes I made look ok? I also updated my original PR, #48576 , so that the
|
con = create_engine(con) | ||
if isinstance(con, Engine): | ||
con = self.exit_stack.enter_context(con.connect()) | ||
if need_transaction: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance that this is related to the autocommit warnings we are getting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this makes sense. If we get a connection that was not begun yet, we see the warning since the implicit autocommit handled this previously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and in case the connection is already in a transaction, my other PR modifies the line if need_transaction:
so as not to call conn.begin()
a second time: 848dc71
cc @mroeschke Could you have a look? I think this is ok as is |
@@ -69,6 +72,14 @@ | |||
# -- Helper functions | |||
|
|||
|
|||
def _cleanup_after_generator(generator, exit_stack: ExitStack): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might have slightly preferred if _query_iterator
used self.exit_stack
directly in the functions, but could be a follow up e.g.
def _query_iterator(...):
with self.exit_stack():
...
Thanks @cdcadman sorry for the delay here |
Thanks @mroeschke and I agree that would have been better to put |
Can you create a small pr for it then? Such things are easy to review and can be merged without much discussion |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This fixes a bug which I introduced in #49531. That PR refactored code so that
SQLDatabase
only accepts a sqlalchemyConnection
and not anEngine
. I used a context manager to handle cleanup. However, this can breakread_sql
withchunksize
, because the caller might be iterating over the chunks after the connection is closed. I added tests with chunksize, and I useNullPool
withcreate_engine
, because this is more likely to cause the connection to close. I also created a new string connectable to test with.