-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: 1.4.0rc1 Execution time of pd.read_sql increased from seconds to minutes #45260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Unfortunately there's not much reproducibility that would help the triage team. It would help if the following information could be provided:
|
Thank you for your reply. I knew that there was a problem of reproducibility and one of the reason is that I don't have full access to the database. I will try to reproduce a similar sqlite database tonight/tomorrow. I can provide some more information:
|
@SimoneD89 if providing a reproducible example is difficult in this case, you could maybe also try to profile it on your side and pass that information. If you use IPython, you can use the https://github.com/jiffyclub/snakeviz provides a graphical way to present those results, and if you install that, you can replace If you don't use IPython, (I think) you can also achieve the same with |
Thank you for your help in trying to debug the problem. I executed the magic %prun and below you have the results. The executed command was Results |
Thanks! One more question (I should have thought to ask it directly): could you do the same but with |
One thing that directly stands out is that in the pandas 1.4 version, the You are sure that eg the sqlalchemy version is the same in both cases? |
I switch from one version to the other by downgrading/upgrading the package through pip.
I attached the two profiles. |
OK, that gives some more insight: in the pandas 1.4 version, the time is completely taken by the "table reflection" when initializing the underlying class (and not the actual sql query). I should actually also have seen that in the text summary output .. So the time is all taken by this init: Lines 1376 to 1381 in d023ba7
This code was touched by #43116, changing
to
cc @fangchenli do you remember why the |
The old usage of MetaData will be removed in sqlalchemy 2.0. See https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#implicit-and-connectionless-execution-bound-metadata-removed for detail. Instead of reflecting all tables in init, #45371 delays the reflection step to |
#45395) Co-authored-by: Fangchen Li <[email protected]>
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the master branch of pandas.
Reproducible Example
After upgrading to 1.4.0rc, I noticed that the execution time of one of my script increased from a few seconds to 2-3 minutes. The performance regression comes from a simple pd.read_sql. Downgrading back to 1.3.5 the execution time is again of the order of seconds.
I'm sorry if I cannot provide more details, I'm not an expert in debugging at this level of expertise.
Installed Versions
INSTALLED VERSIONS
commit : d023ba7
python : 3.10.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.12-200.fc35.x86_64
Version : #1 SMP Wed Dec 29 15:03:38 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : it_IT.UTF-8
LOCALE : it_IT.UTF-8
pandas : 1.4.0rc0
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.1
pip : 21.2.3
setuptools : 57.4.0
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.1.2
blosc : None
feather : None
xlsxwriter : 3.0.2
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: 0.10.0
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.5.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pyxlsb : 1.0.9
s3fs : None
scipy : 1.7.3
sqlalchemy : 1.4.29
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : None
zstandard : None
Prior Performance
pandas : 1.3.5
The text was updated successfully, but these errors were encountered: