Skip to content

PERF: 1.4.0rc1 Execution time of pd.read_sql increased from seconds to minutes #45260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
SimoneD89 opened this issue Jan 8, 2022 · 9 comments · Fixed by #45371
Closed
2 of 3 tasks

PERF: 1.4.0rc1 Execution time of pd.read_sql increased from seconds to minutes #45260

SimoneD89 opened this issue Jan 8, 2022 · 9 comments · Fixed by #45371
Labels
Blocker Blocking issue or pull request for an upcoming release IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@SimoneD89
Copy link

SimoneD89 commented Jan 8, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

import urllib
from sqlalchemy import create_engine
import pandas as pd

driver = "{ODBC Driver 17 for SQL Server}"
url, database, uid, pwd = ...

params = urllib.parse.quote(
    "DRIVER=" + driver + ";"
    "SERVER=" + url + ";"
    "DATABASE=" + database + ";"
    "UID=" + uid + ";"
    "PWD=" + pwd
)

engine = create_engine(
    "mssql+pyodbc:///?odbc_connect=%s" % params
)

conn = engine.connect()

df = pd.read_sql("SELECT * FROM Table", conn)

After upgrading to 1.4.0rc, I noticed that the execution time of one of my script increased from a few seconds to 2-3 minutes. The performance regression comes from a simple pd.read_sql. Downgrading back to 1.3.5 the execution time is again of the order of seconds.

I'm sorry if I cannot provide more details, I'm not an expert in debugging at this level of expertise.

Installed Versions

INSTALLED VERSIONS

commit : d023ba7
python : 3.10.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.12-200.fc35.x86_64
Version : #1 SMP Wed Dec 29 15:03:38 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : it_IT.UTF-8
LOCALE : it_IT.UTF-8

pandas : 1.4.0rc0
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.1
pip : 21.2.3
setuptools : 57.4.0
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.1.2
blosc : None
feather : None
xlsxwriter : 3.0.2
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: 0.10.0
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.5.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pyxlsb : 1.0.9
s3fs : None
scipy : 1.7.3
sqlalchemy : 1.4.29
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : None
zstandard : None

Prior Performance

pandas : 1.3.5

@SimoneD89 SimoneD89 added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 8, 2022
@simonjayhawkins simonjayhawkins added IO SQL to_sql, read_sql, read_sql_query Regression Functionality that used to work in a prior pandas version Needs Triage Issue that has not been reviewed by a pandas team member and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 8, 2022
@simonjayhawkins simonjayhawkins added this to the 1.4 milestone Jan 8, 2022
@mroeschke
Copy link
Member

Thanks for the report.

Unfortunately there's not much reproducibility that would help the triage team. It would help if the following information could be provided:

  • Is this behavior similar in sqllite? Is there a small toy data example that the triage team could emulate locally?
  • What was the prior timing performance in 1.3.5?

@mroeschke mroeschke added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2022
@SimoneD89
Copy link
Author

SimoneD89 commented Jan 13, 2022

Thank you for your reply. I knew that there was a problem of reproducibility and one of the reason is that I don't have full access to the database. I will try to reproduce a similar sqlite database tonight/tomorrow.

I can provide some more information:

  • Timing (only pandas changes)
    pandas 1.3.5: 3.47 seconds
    pandas 1.4.0rc: 169.84 seconds
  • I observed the performance issue only in this case (and not for other queries to other databases). Therefore it should be related to the specific database and connected to some change between 1.3.5 and 1.4.0rc

@jorisvandenbossche
Copy link
Member

@SimoneD89 if providing a reproducible example is difficult in this case, you could maybe also try to profile it on your side and pass that information.

If you use IPython, you can use the %prun magic. If you could run %prun pd.read_sql("SELECT * FROM Table", conn) in both environments, and post the results here, that could already be useful to diagnose the problem.

https://github.com/jiffyclub/snakeviz provides a graphical way to present those results, and if you install that, you can replace %prun with %snakeviz in the above (after running %load_ext snakeviz).

If you don't use IPython, (I think) you can also achieve the same with cProfile.run("pd.read_sql('SELECT * FROM Table'"). Or put the full code snippet in a script. See https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script

@SimoneD89
Copy link
Author

Thank you for your help in trying to debug the problem. I executed the magic %prun and below you have the results.

The executed command was
%prun pd.read_sql("SELECT TOP 100 * FROM Table", conn)

Results
pandas 1.4.0rc1: https://pastebin.com/8UR9cruc
pandas 1.3.5: https://pastebin.com/wwNqSt4b

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 14, 2022

Thanks! One more question (I should have thought to ask it directly): could you do the same but with %prun -D read_sql_pd14.prof pd.read_sql(..?
That will save a file to disk, which has more information (or more easily to interpret with other tools) than the text output (which is already kind of a summary), and if you could upload those files here, that would be great.

@jorisvandenbossche
Copy link
Member

One thing that directly stands out is that in the pandas 1.4 version, the pyodbc.Cursor.execute method gets called > 1000 times, while in the pandas 1.3 version, it only gets called twice.
The engine.execute(..) in the actual pandas code only gets called once in both versions, though.

You are sure that eg the sqlalchemy version is the same in both cases?

@SimoneD89
Copy link
Author

I switch from one version to the other by downgrading/upgrading the package through pip.

pip install pandas==1.3.5 --user
pip install --pre --upgrade --user pandas

pip list (and pd.show_versions()) are always giving SQLAlchemy 1.4.29 in both cases.

I attached the two profiles.

read_sql_pd.zip

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 14, 2022

OK, that gives some more insight: in the pandas 1.4 version, the time is completely taken by the "table reflection" when initializing the underlying class (and not the actual sql query). I should actually also have seen that in the text summary output ..

So the time is all taken by this init:

pandas/pandas/io/sql.py

Lines 1376 to 1381 in d023ba7

def __init__(self, engine, schema: str | None = None):
from sqlalchemy.schema import MetaData
self.connectable = engine
self.meta = MetaData(schema=schema)
self.meta.reflect(bind=engine)

This code was touched by #43116, changing

        self.meta = MetaData(self.connectable, schema=schema)

to

        self.meta = MetaData(schema=schema)
        self.meta.reflect(bind=engine)

cc @fangchenli do you remember why the reflect(..) was needed?
The reflect method will load all available table definitions from the database, which can be expensive (as illustrated by this report), and is also not needed generally I think (eg when only executing a sql query).

@jorisvandenbossche jorisvandenbossche added Blocker Blocking issue or pull request for an upcoming release and removed Needs Info Clarification about behavior needed to assess issue labels Jan 14, 2022
fangchenli added a commit to fangchenli/pandas that referenced this issue Jan 14, 2022
@fangchenli
Copy link
Member

The old usage of MetaData will be removed in sqlalchemy 2.0. See https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#implicit-and-connectionless-execution-bound-metadata-removed for detail.

Instead of reflecting all tables in init, #45371 delays the reflection step to get_table method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants