Skip to content

ENH: Pluggable SQL performance via new SQL engine keyword #40556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
May 3, 2021

Conversation

yehoshuadimarsky
Copy link
Contributor

@yehoshuadimarsky yehoshuadimarsky commented Mar 21, 2021

  • xref, but is first step in implementing ENH: Pluggable SQL performance #36893
  • tests added and passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry (in the docstring)
  • whatsnew entry (in the docs)

This is the first step in implementing an engine backend to the SQL I/O functions, as discussed in #36893. This PR is simply to refactor the existing SQLAlchemy parts to make them extensible for an external engine.

No new engines were added at this stage. No new tests were added, only verified that existing tests fail ("test behavior, not implementation" suggests that we don't need to test internal refactoring)

Lots of the code was borrowed from the Parquet I/O implementation.

@yehoshuadimarsky
Copy link
Contributor Author

yehoshuadimarsky commented Mar 21, 2021

weird, looks like CI only has one error but it's totally unrelated - something about IPython code blocks in the docs, but it's in the Parquet section, nothing to do with this PR.

I can't reproduce the error locally.

(base) root@eda624c2c1ba:/workspaces/pandas-josh/doc# grep -B10 \"^<<<-------------------------------------------------------------------------$\" sphinx.log
(base) root@eda624c2c1ba:/workspaces/pandas-josh/doc# 

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable. some changes, would be ok with merging this. then if you could refactor pandas/io/sqlpy to make a sub-module e.g.
pandas/io/sql/engine, .....etc (similar in concept to how we do pandas/io/excel).

could also do that as a pre-cursor here.

@jreback jreback added the IO SQL to_sql, read_sql, read_sql_query label Mar 23, 2021
@jreback
Copy link
Contributor

jreback commented Mar 23, 2021

@pandas-dev/pandas-core if any comments on the structure.

@jreback jreback added this to the 1.3 milestone Mar 23, 2021
@yehoshuadimarsky
Copy link
Contributor Author

Made all requested changes except the test, as noted above

@yehoshuadimarsky
Copy link
Contributor Author

Weird CI failure, unrelated to my changes - something to do with inf in MySQL - but I think it should be good to merge now, can you pls take a look? Thanks. @jreback

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of comments, pls merge master.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ping on green

@yehoshuadimarsky
Copy link
Contributor Author

lgtm ping on green

ok will do, thanks!

@jreback
Copy link
Contributor

jreback commented Apr 30, 2021

IIRC we had an xfail in place for this partcular test. go ahead an dput it back and can be tackled there.

@lithomas1
Copy link
Member

lithomas1 commented Apr 30, 2021

@jreback This is not a flaky test. This is failing because we are not raising the correct error message. See #34431. Xfailed case is also xfailed here I think and is not the problem.
@yehoshuadimarsky We're a little closer now with 2 tests failing instead of 4. :) You need to also catch ProgrammingError: inf can not be used with MySQL. Why? See PyMySQL/mysqlclient#246, where they actually catch the inf case and raise a nice error. That appears in our CI with the latest pymysql, but with old pymysql(e.g. 0.8.1 where they don't raise a nice error), you need to catch original. (OperationalError: (1054, "Unknown column ... )

EDIT: Please disregard second half of comment. Accidentally copy pasted wrong thing.

@yehoshuadimarsky
Copy link
Contributor Author

yehoshuadimarsky commented Apr 30, 2021

@jreback This is not a flaky test. This is failing because we are not raising the correct error message. See #34431. Xfailed case is also xfailed here I think and is not the problem.
@yehoshuadimarsky We're a little closer now with 2 tests failing instead of 4. :) You need to also catch ProgrammingError: inf can not be used with MySQL. Why? See PyMySQL/mysqlclient#246, where they actually catch the inf case and raise a nice error. That appears in our CI with the latest pymysql, but with old pymysql(e.g. 0.8.1 where they don't raise a nice error), you need to catch original. (OperationalError: (1054, "Unknown column ... )

Great. So I catch these 2?

  • OperationalError: (1054, "Unknown column 'inf' in 'field list'")
  • OperationalError: (1054, "Unknown column 'infe0' in 'field list'")

Basically, just add the prefix OperationalError: ?

@lithomas1
Copy link
Member

lithomas1 commented Apr 30, 2021

@yehoshuadimarsky The last one to catch is ProgrammingError: inf can not be used with MySQL. You already caught the other two.
Complete list should be

OperationalError: (1054, "Unknown column 'inf' in 'field list'")
OperationalError: (1054, "Unknown column 'infe0' in 'field list'")
ProgrammingError: (pymysql.err.ProgrammingError) inf can not be used with MySQL

Sorry if I was unclear before. You can also test locally by running the first two cases in the OP in #34431.

pandas.DataFrame({'foo': [numpy.inf]}).to_sql('foobar1', engine) #ProgrammingError: inf can not be used with MySQL
pandas.DataFrame({'foo': [-numpy.inf]}).to_sql('foobar2', engine) # OperationalError: (1054, "Unknown column 'infe0' in 'field list'")

Both cases need to raise ValueError: inf cannot be used with MySQL. (Third is xfailed).
If this doesn't work, I guess just use the original fix.

pandas/pandas/io/sql.py

Lines 1595 to 1599 in 49c7791

try:
table.insert(chunksize, method=method)
except SQLAlchemyError as err:
# GH 34431 36465
raise ValueError("inf cannot be used with MySQL") from err

EDIT: Was actually sqlalchemy.exc.ProgrammingError: (pymysql.err.ProgrammingError) inf can not be used with MySQL and I was wrong here. See my later review/suggestion.

@yehoshuadimarsky
Copy link
Contributor Author

For the life of me can't figure this out: https://stackoverflow.com/questions/67358166

@yehoshuadimarsky
Copy link
Contributor Author

@lithomas1 shoot I still can't get the CI to pass, any suggestions?

@yehoshuadimarsky
Copy link
Contributor Author

@lithomas1 nope still failed CI

Co-authored-by: Thomas Li <[email protected]>
@lithomas1
Copy link
Member

@yehoshuadimarsky Yay, looks like the CI fix worked finally. I think the last thing to do is to evaluate @xhochy 's proposal. I'm not a core team member so I'll leave this to @jreback and @charlesdong1991.

@yehoshuadimarsky
Copy link
Contributor Author

@yehoshuadimarsky Yay, looks like the CI fix worked finally. I think the last thing to do is to evaluate @xhochy 's proposal. I'm not a core team member so I'll leave this to @jreback and @charlesdong1991.

Thanks for all your help!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @yehoshuadimarsky very nice (comment for followup)

Comment on lines +662 to +669

with cf.config_prefix("io.sql"):
cf.register_option(
"engine",
"auto",
sql_engine_doc,
validator=is_one_of_factory(["auto", "sqlalchemy"]),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could certainly add this (we already use entry points for plotting), but let's do as a followup (pls create an issue)

@jreback jreback merged commit fea6799 into pandas-dev:master May 3, 2021
@jreback
Copy link
Contributor

jreback commented May 3, 2021

thanks @yehoshuadimarsky very nice!

@jreback
Copy link
Contributor

jreback commented May 3, 2021

@yehoshuadimarsky can you do a followup to add a whatsnew note / update io.rst if needed anywhere.

@yehoshuadimarsky
Copy link
Contributor Author

@yehoshuadimarsky can you do a followup to add a whatsnew note / update io.rst if needed anywhere.

Yes, just for this change? There's not much here that changed to the external user, it's just refactoring to support more future engines.

@jreback
Copy link
Contributor

jreback commented May 3, 2021

yeah just a quick one for the addition of the option (even though it doesn't do anything different yet)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants