BUG: Allow read_sql to work with chunksize. #49967

cdcadman · 2022-11-30T00:57:53Z

closes BUG: read_sql with chunksize fails due to connection already closed #50199
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This fixes a bug which I introduced in #49531. That PR refactored code so that SQLDatabase only accepts a sqlalchemy Connection and not an Engine. I used a context manager to handle cleanup. However, this can break read_sql with chunksize, because the caller might be iterating over the chunks after the connection is closed. I added tests with chunksize, and I use NullPool with create_engine, because this is more likely to cause the connection to close. I also created a new string connectable to test with.

mroeschke · 2022-12-01T00:59:20Z

pandas/io/sql.py

@@ -189,6 +190,18 @@ def execute(sql, con, params=None):
 # -- Read and write to DataFrames


+def _read_helper(method: Callable, iterate: bool, **kwargs):
+    """method returns an iterator.  If iterate is False, we need to exhaust the iterator


Can there be a more direct fix to leave the connection open if chunksize is provided? It seems fairly roundabout to exhaust the iterator?

To keep the connection open, I need to make sure that the context manager at line 864 stays open. My solution is to create generating functions _read_sql_table, _read_sql_query, and _read_sql. If chunksize=None, these generating functions return a single DataFrame. Then I have the opposite problem: if the caller wants a single DataFrame, I have to exit out of the context manager at line 864 so that the connection will close. If instead of

for frame in result: pass return frame

I use

return next(result)

then the line 864 context manager would not exit. I agree it is roundabout, but I think having SQLDatabase only accept a Connection, not an Engine, and having SQLDatabase.read_query only return an Iterator[DataFrame] also simplifies things a bit.

Can _sqlalchemy_con be aware of when chunksize is specified and keep the connection open as needed?

We are thinking along similar lines. I think it's will be easier to fold _sqlalchemy_con into the SQLDatabase class, because I want the connection to be closed at the end of iteration, or if there is an exception.

cdcadman · 2022-12-03T19:24:24Z

@mroeschke , an alternative to this PR, which would also address the bug, would be to stop allowing str and sqlalchemy.engine.Engine types to be passed as connection objects to any sql methods. It looks like support for sqlalchemy.engine.Connection was added in response to #7877, and support for str was added in response to #10654. I personally prefer to pass a sqlalchemy.engine.Connection so that I have more control, and this would also simplify the code in pandas.io.sql.

mroeschke · 2022-12-05T20:05:21Z

would be to stop allowing str and sqlalchemy.engine.Engine types to be passed as connection objects to any sql methods.

Unfortunately, I don't think this would be appropriate since

This would need a full major release (2.0 -> 3.0) to deprecate these input types
This would be removing useful functionality a break a lot of downstream users

cdcadman · 2022-12-05T21:22:53Z

Ok, that makes sense. It might be that I need to add __enter__ and __exit__ methods to SQLDatabase and create engines and connections inside of the class. They would either be cleaned up in the __exit__ method or within the generator returned by read_sql if chunksize is not None. The @contextmanager which I applied to pandasSQL_builder in #49531 would be absorbed into SQLDatabase, and I wouldn't need the _read_helper method.

cdcadman · 2022-12-07T18:26:29Z

@mroeschke I created an ExitStack to handle commiting the transaction or closing the Connection, if applicable. In a later PR, it will also dispose of the Engine if the user passes a URI string. ExitStack.close either gets called in SQLDatabase.__exit__ or _cleanup_after_generator, depending on whether chunksize is None.

cdcadman · 2023-01-03T19:59:56Z

@mroeschke , do the changes I made look ok? I also updated my original PR, #48576 , so that the Engine disposal is also carried out by the ExitStack if necessary.

@mroeschke I created an ExitStack to handle commiting the transaction or closing the Connection, if applicable. In a later PR, it will also dispose of the Engine if the user passes a URI string. ExitStack.close either gets called in SQLDatabase.__exit__ or _cleanup_after_generator, depending on whether chunksize is None.

phofl · 2023-01-07T10:06:45Z

pandas/io/sql.py

+            con = create_engine(con)
+        if isinstance(con, Engine):
+            con = self.exit_stack.enter_context(con.connect())
+            if need_transaction:


Any chance that this is related to the autocommit warnings we are getting?

I'm not sure how this PR impacts the autocommit warnings, but I was able to resolve them in #48576 with this commit: 46a6e75 .

Yeah this makes sense. If we get a connection that was not begun yet, we see the warning since the implicit autocommit handled this previously

Yes, and in case the connection is already in a transaction, my other PR modifies the line if need_transaction: so as not to call conn.begin() a second time: 848dc71

phofl · 2023-01-31T17:26:06Z

cc @mroeschke

Could you have a look? I think this is ok as is

mroeschke · 2023-01-31T21:23:10Z

pandas/io/sql.py

@@ -69,6 +72,14 @@
 # -- Helper functions


+def _cleanup_after_generator(generator, exit_stack: ExitStack):


Might have slightly preferred if _query_iterator used self.exit_stack directly in the functions, but could be a follow up e.g.

def _query_iterator(...): with self.exit_stack(): ...

mroeschke · 2023-01-31T21:23:45Z

Thanks @cdcadman sorry for the delay here

cdcadman · 2023-02-01T07:46:38Z

Thanks @mroeschke and I agree that would have been better to put with exit_stack: inside of _query_iterator.

phofl · 2023-02-02T09:28:35Z

Can you create a small pr for it then? Such things are easy to review and can be merged without much discussion

cdcadman mentioned this pull request Nov 30, 2022

Make pandas/io/sql.py work with sqlalchemy 2.0 #48576

Merged

5 tasks

mroeschke reviewed Dec 1, 2022

View reviewed changes

mroeschke added the IO SQL to_sql, read_sql, read_sql_query label Dec 1, 2022

cdcadman marked this pull request as draft December 7, 2022 04:18

cdcadman marked this pull request as ready for review December 7, 2022 18:22

This was referenced Dec 10, 2022

BUG: Make io.sql.execute raise TypeError on Engine or URI string. #50177

Merged

DEPR: pandas.io.sql.execute #50185

Closed

BUG: read_sql with chunksize fails due to connection already closed #50199

Closed

BUG: Allow read_sql to work with chunksize.

654ca5f

cdcadman mentioned this pull request Jan 6, 2023

CI: Try with sqlalchemy warnings #50580

Closed

5 tasks

phofl reviewed Jan 7, 2023

View reviewed changes

mroeschke reviewed Jan 31, 2023

View reviewed changes

mroeschke approved these changes Jan 31, 2023

View reviewed changes

mroeschke added this to the 2.0 milestone Jan 31, 2023

mroeschke merged commit 4033156 into pandas-dev:main Jan 31, 2023

cdcadman deleted the read_sql_bug branch February 1, 2023 07:43

cdcadman mentioned this pull request Feb 2, 2023

CLN: Put exit_stack inside _query_iterator. #51125

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Allow read_sql to work with chunksize. #49967

BUG: Allow read_sql to work with chunksize. #49967

cdcadman commented Nov 30, 2022 •

edited

Loading

mroeschke Dec 1, 2022

cdcadman Dec 1, 2022

mroeschke Dec 5, 2022

cdcadman Dec 5, 2022

cdcadman commented Dec 3, 2022

mroeschke commented Dec 5, 2022

cdcadman commented Dec 5, 2022

cdcadman commented Dec 7, 2022

cdcadman commented Jan 3, 2023

phofl Jan 7, 2023

cdcadman Jan 7, 2023

phofl Jan 7, 2023

cdcadman Jan 7, 2023

phofl commented Jan 31, 2023 •

edited

Loading

mroeschke Jan 31, 2023

mroeschke commented Jan 31, 2023

cdcadman commented Feb 1, 2023

phofl commented Feb 2, 2023

		@@ -69,6 +72,14 @@
		# -- Helper functions


		def _cleanup_after_generator(generator, exit_stack: ExitStack):

BUG: Allow read_sql to work with chunksize. #49967

BUG: Allow read_sql to work with chunksize. #49967

Conversation

cdcadman commented Nov 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdcadman commented Dec 3, 2022

mroeschke commented Dec 5, 2022

cdcadman commented Dec 5, 2022

cdcadman commented Dec 7, 2022

cdcadman commented Jan 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Jan 31, 2023 • edited Loading

Choose a reason for hiding this comment

mroeschke commented Jan 31, 2023

cdcadman commented Feb 1, 2023

phofl commented Feb 2, 2023

cdcadman commented Nov 30, 2022 •

edited

Loading

phofl commented Jan 31, 2023 •

edited

Loading