PERF: 1.4.0rc1 Execution time of pd.read_sql increased from seconds to minutes #45260

SimoneD89 · 2022-01-08T10:49:22Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

import urllib
from sqlalchemy import create_engine
import pandas as pd

driver = "{ODBC Driver 17 for SQL Server}"
url, database, uid, pwd = ...

params = urllib.parse.quote(
    "DRIVER=" + driver + ";"
    "SERVER=" + url + ";"
    "DATABASE=" + database + ";"
    "UID=" + uid + ";"
    "PWD=" + pwd
)

engine = create_engine(
    "mssql+pyodbc:///?odbc_connect=%s" % params
)

conn = engine.connect()

df = pd.read_sql("SELECT * FROM Table", conn)

After upgrading to 1.4.0rc, I noticed that the execution time of one of my script increased from a few seconds to 2-3 minutes. The performance regression comes from a simple pd.read_sql. Downgrading back to 1.3.5 the execution time is again of the order of seconds.

I'm sorry if I cannot provide more details, I'm not an expert in debugging at this level of expertise.

Installed Versions

INSTALLED VERSIONS

commit : d023ba7
python : 3.10.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.12-200.fc35.x86_64
Version : #1 SMP Wed Dec 29 15:03:38 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : it_IT.UTF-8
LOCALE : it_IT.UTF-8

pandas : 1.4.0rc0
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.1
pip : 21.2.3
setuptools : 57.4.0
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.1.2
blosc : None
feather : None
xlsxwriter : 3.0.2
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: 0.10.0
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.5.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pyxlsb : 1.0.9
s3fs : None
scipy : 1.7.3
sqlalchemy : 1.4.29
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : None
zstandard : None

Prior Performance

pandas : 1.3.5

The text was updated successfully, but these errors were encountered:

mroeschke · 2022-01-13T01:11:14Z

Thanks for the report.

Unfortunately there's not much reproducibility that would help the triage team. It would help if the following information could be provided:

Is this behavior similar in sqllite? Is there a small toy data example that the triage team could emulate locally?
What was the prior timing performance in 1.3.5?

SimoneD89 · 2022-01-13T06:12:57Z

Thank you for your reply. I knew that there was a problem of reproducibility and one of the reason is that I don't have full access to the database. I will try to reproduce a similar sqlite database tonight/tomorrow.

I can provide some more information:

Timing (only pandas changes)
pandas 1.3.5: 3.47 seconds
pandas 1.4.0rc: 169.84 seconds
I observed the performance issue only in this case (and not for other queries to other databases). Therefore it should be related to the specific database and connected to some change between 1.3.5 and 1.4.0rc

jorisvandenbossche · 2022-01-14T12:32:06Z

@SimoneD89 if providing a reproducible example is difficult in this case, you could maybe also try to profile it on your side and pass that information.

If you use IPython, you can use the %prun magic. If you could run %prun pd.read_sql("SELECT * FROM Table", conn) in both environments, and post the results here, that could already be useful to diagnose the problem.

https://github.com/jiffyclub/snakeviz provides a graphical way to present those results, and if you install that, you can replace %prun with %snakeviz in the above (after running %load_ext snakeviz).

If you don't use IPython, (I think) you can also achieve the same with cProfile.run("pd.read_sql('SELECT * FROM Table'"). Or put the full code snippet in a script. See https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script

SimoneD89 · 2022-01-14T13:54:00Z

Thank you for your help in trying to debug the problem. I executed the magic %prun and below you have the results.

The executed command was
%prun pd.read_sql("SELECT TOP 100 * FROM Table", conn)

Results
pandas 1.4.0rc1: https://pastebin.com/8UR9cruc
pandas 1.3.5: https://pastebin.com/wwNqSt4b

jorisvandenbossche · 2022-01-14T14:08:14Z

Thanks! One more question (I should have thought to ask it directly): could you do the same but with %prun -D read_sql_pd14.prof pd.read_sql(..?
That will save a file to disk, which has more information (or more easily to interpret with other tools) than the text output (which is already kind of a summary), and if you could upload those files here, that would be great.

jorisvandenbossche · 2022-01-14T14:13:22Z

One thing that directly stands out is that in the pandas 1.4 version, the pyodbc.Cursor.execute method gets called > 1000 times, while in the pandas 1.3 version, it only gets called twice.
The engine.execute(..) in the actual pandas code only gets called once in both versions, though.

You are sure that eg the sqlalchemy version is the same in both cases?

SimoneD89 · 2022-01-14T14:25:15Z

I switch from one version to the other by downgrading/upgrading the package through pip.

pip install pandas==1.3.5 --user
pip install --pre --upgrade --user pandas

pip list (and pd.show_versions()) are always giving SQLAlchemy 1.4.29 in both cases.

I attached the two profiles.

read_sql_pd.zip

jorisvandenbossche · 2022-01-14T14:51:19Z

OK, that gives some more insight: in the pandas 1.4 version, the time is completely taken by the "table reflection" when initializing the underlying class (and not the actual sql query). I should actually also have seen that in the text summary output ..

So the time is all taken by this init:

pandas/pandas/io/sql.py

Lines 1376 to 1381 in d023ba7

    
           def __init__(self, engine, schema: str | None = None): 
        
               from sqlalchemy.schema import MetaData 
        
               self.connectable = engine 
        
               self.meta = MetaData(schema=schema) 
        
               self.meta.reflect(bind=engine)

This code was touched by #43116, changing

        self.meta = MetaData(self.connectable, schema=schema)

to

        self.meta = MetaData(schema=schema)
        self.meta.reflect(bind=engine)

cc @fangchenli do you remember why the reflect(..) was needed?
The reflect method will load all available table definitions from the database, which can be expensive (as illustrated by this report), and is also not needed generally I think (eg when only executing a sql query).

fangchenli · 2022-01-14T17:11:40Z

The old usage of MetaData will be removed in sqlalchemy 2.0. See https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#implicit-and-connectionless-execution-bound-metadata-removed for detail.

Instead of reflecting all tables in init, #45371 delays the reflection step to get_table method.

…init pandas-dev#45260

#45395) Co-authored-by: Fangchen Li <[email protected]>

SimoneD89 added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 8, 2022

simonjayhawkins added this to the 1.4 milestone Jan 8, 2022

mroeschke added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2022

jorisvandenbossche added Blocker Blocking issue or pull request for an upcoming release and removed Needs Info Clarification about behavior needed to assess issue labels Jan 14, 2022

fangchenli added a commit to fangchenli/pandas that referenced this issue Jan 14, 2022

PERF: avoid SQL MetaData reflection in init pandas-dev#45260

7731267

fangchenli mentioned this issue Jan 14, 2022

PERF: avoid SQL MetaData reflection in init #45260 #45371

Merged

1 task

jreback closed this as completed in #45371 Jan 16, 2022

jreback pushed a commit that referenced this issue Jan 16, 2022

PERF: avoid SQL MetaData reflection in init #45260 (#45371)

a659f1d

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this issue Jan 16, 2022

Backport PR pandas-dev#45371: PERF: avoid SQL MetaData reflection in …

0e21cee

…init pandas-dev#45260

meeseeksmachine mentioned this issue Jan 16, 2022

Backport PR #45371 on branch 1.4.x (PERF: avoid SQL MetaData reflection in init #45260) #45395

Merged

jreback pushed a commit that referenced this issue Jan 16, 2022

Backport PR #45371: PERF: avoid SQL MetaData reflection in init #45260 (

e04b37c

#45395) Co-authored-by: Fangchen Li <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: 1.4.0rc1 Execution time of pd.read_sql increased from seconds to minutes #45260

PERF: 1.4.0rc1 Execution time of pd.read_sql increased from seconds to minutes #45260

SimoneD89 commented Jan 8, 2022 •

edited

Loading

INSTALLED VERSIONS

mroeschke commented Jan 13, 2022

SimoneD89 commented Jan 13, 2022 •

edited

Loading

jorisvandenbossche commented Jan 14, 2022

SimoneD89 commented Jan 14, 2022

jorisvandenbossche commented Jan 14, 2022 •

edited

Loading

jorisvandenbossche commented Jan 14, 2022

SimoneD89 commented Jan 14, 2022

jorisvandenbossche commented Jan 14, 2022 •

edited

Loading

fangchenli commented Jan 14, 2022

PERF: 1.4.0rc1 Execution time of pd.read_sql increased from seconds to minutes #45260

PERF: 1.4.0rc1 Execution time of pd.read_sql increased from seconds to minutes #45260

Comments

SimoneD89 commented Jan 8, 2022 • edited Loading

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

mroeschke commented Jan 13, 2022

SimoneD89 commented Jan 13, 2022 • edited Loading

jorisvandenbossche commented Jan 14, 2022

SimoneD89 commented Jan 14, 2022

jorisvandenbossche commented Jan 14, 2022 • edited Loading

jorisvandenbossche commented Jan 14, 2022

SimoneD89 commented Jan 14, 2022

jorisvandenbossche commented Jan 14, 2022 • edited Loading

fangchenli commented Jan 14, 2022

SimoneD89 commented Jan 8, 2022 •

edited

Loading

SimoneD89 commented Jan 13, 2022 •

edited

Loading

jorisvandenbossche commented Jan 14, 2022 •

edited

Loading

jorisvandenbossche commented Jan 14, 2022 •

edited

Loading