Skip to content

BUG: Allow read_sql to work with chunksize. #49967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 31, 2023
Merged

BUG: Allow read_sql to work with chunksize. #49967

merged 1 commit into from
Jan 31, 2023

Conversation

cdcadman
Copy link
Contributor

@cdcadman cdcadman commented Nov 30, 2022

This fixes a bug which I introduced in #49531. That PR refactored code so that SQLDatabase only accepts a sqlalchemy Connection and not an Engine. I used a context manager to handle cleanup. However, this can break read_sql with chunksize, because the caller might be iterating over the chunks after the connection is closed. I added tests with chunksize, and I use NullPool with create_engine, because this is more likely to cause the connection to close. I also created a new string connectable to test with.

pandas/io/sql.py Outdated
@@ -189,6 +190,18 @@ def execute(sql, con, params=None):
# -- Read and write to DataFrames


def _read_helper(method: Callable, iterate: bool, **kwargs):
"""method returns an iterator. If iterate is False, we need to exhaust the iterator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a more direct fix to leave the connection open if chunksize is provided? It seems fairly roundabout to exhaust the iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the connection open, I need to make sure that the context manager at line 864 stays open. My solution is to create generating functions _read_sql_table, _read_sql_query, and _read_sql. If chunksize=None, these generating functions return a single DataFrame. Then I have the opposite problem: if the caller wants a single DataFrame, I have to exit out of the context manager at line 864 so that the connection will close. If instead of

for frame in result:
    pass
return frame

I use

return next(result)

then the line 864 context manager would not exit. I agree it is roundabout, but I think having SQLDatabase only accept a Connection, not an Engine, and having SQLDatabase.read_query only return an Iterator[DataFrame] also simplifies things a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can _sqlalchemy_con be aware of when chunksize is specified and keep the connection open as needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are thinking along similar lines. I think it's will be easier to fold _sqlalchemy_con into the SQLDatabase class, because I want the connection to be closed at the end of iteration, or if there is an exception.

@mroeschke mroeschke added the IO SQL to_sql, read_sql, read_sql_query label Dec 1, 2022
@cdcadman
Copy link
Contributor Author

cdcadman commented Dec 3, 2022

@mroeschke , an alternative to this PR, which would also address the bug, would be to stop allowing str and sqlalchemy.engine.Engine types to be passed as connection objects to any sql methods. It looks like support for sqlalchemy.engine.Connection was added in response to #7877, and support for str was added in response to #10654. I personally prefer to pass a sqlalchemy.engine.Connection so that I have more control, and this would also simplify the code in pandas.io.sql.

@mroeschke
Copy link
Member

would be to stop allowing str and sqlalchemy.engine.Engine types to be passed as connection objects to any sql methods.

Unfortunately, I don't think this would be appropriate since

  1. This would need a full major release (2.0 -> 3.0) to deprecate these input types
  2. This would be removing useful functionality a break a lot of downstream users

@cdcadman
Copy link
Contributor Author

cdcadman commented Dec 5, 2022

Ok, that makes sense. It might be that I need to add __enter__ and __exit__ methods to SQLDatabase and create engines and connections inside of the class. They would either be cleaned up in the __exit__ method or within the generator returned by read_sql if chunksize is not None. The @contextmanager which I applied to pandasSQL_builder in #49531 would be absorbed into SQLDatabase, and I wouldn't need the _read_helper method.

@cdcadman cdcadman marked this pull request as draft December 7, 2022 04:18
@cdcadman cdcadman marked this pull request as ready for review December 7, 2022 18:22
@cdcadman
Copy link
Contributor Author

cdcadman commented Dec 7, 2022

@mroeschke I created an ExitStack to handle commiting the transaction or closing the Connection, if applicable. In a later PR, it will also dispose of the Engine if the user passes a URI string. ExitStack.close either gets called in SQLDatabase.__exit__ or _cleanup_after_generator, depending on whether chunksize is None.

@cdcadman
Copy link
Contributor Author

cdcadman commented Jan 3, 2023

@mroeschke , do the changes I made look ok? I also updated my original PR, #48576 , so that the Engine disposal is also carried out by the ExitStack if necessary.

@mroeschke I created an ExitStack to handle commiting the transaction or closing the Connection, if applicable. In a later PR, it will also dispose of the Engine if the user passes a URI string. ExitStack.close either gets called in SQLDatabase.__exit__ or _cleanup_after_generator, depending on whether chunksize is None.

@cdcadman cdcadman mentioned this pull request Jan 6, 2023
5 tasks
con = create_engine(con)
if isinstance(con, Engine):
con = self.exit_stack.enter_context(con.connect())
if need_transaction:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance that this is related to the autocommit warnings we are getting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this PR impacts the autocommit warnings, but I was able to resolve them in #48576 with this commit: 46a6e75 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this makes sense. If we get a connection that was not begun yet, we see the warning since the implicit autocommit handled this previously

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and in case the connection is already in a transaction, my other PR modifies the line if need_transaction: so as not to call conn.begin() a second time: 848dc71

@phofl
Copy link
Member

phofl commented Jan 31, 2023

cc @mroeschke

Could you have a look? I think this is ok as is

@@ -69,6 +72,14 @@
# -- Helper functions


def _cleanup_after_generator(generator, exit_stack: ExitStack):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might have slightly preferred if _query_iterator used self.exit_stack directly in the functions, but could be a follow up e.g.

def _query_iterator(...):
    with self.exit_stack():
        ...

@mroeschke mroeschke added this to the 2.0 milestone Jan 31, 2023
@mroeschke mroeschke merged commit 4033156 into pandas-dev:main Jan 31, 2023
@mroeschke
Copy link
Member

Thanks @cdcadman sorry for the delay here

@cdcadman cdcadman deleted the read_sql_bug branch February 1, 2023 07:43
@cdcadman
Copy link
Contributor Author

cdcadman commented Feb 1, 2023

Thanks @mroeschke and I agree that would have been better to put with exit_stack: inside of _query_iterator.

@phofl
Copy link
Member

phofl commented Feb 2, 2023

Can you create a small pr for it then? Such things are easy to review and can be merged without much discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_sql with chunksize fails due to connection already closed
3 participants