Skip to content

PERF: Why does specifying the index column in pandas significantly increases the read time of a csv? #44158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
nitinmnsn opened this issue Oct 23, 2021 · 5 comments · Fixed by #44610
Closed
3 tasks done
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Milestone

Comments

@nitinmnsn
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

I am seeing a significantly increased read time for a CSV by pandas when I specify the index_col. I do not understand the reason behind it. Can you help me understand why that is happening and if that is actually the expected behaviour? Below is the code I am using:

import pandas as pd
#save the CSV to be used
pd.DataFrame({'id':np.arange(100000000),'b':np.random.choice(['a','b','c','d'],size=(100000000,),p=[0.25,0.25,0.25,0.25])}).to_csv('df_sp.csv',index=None)
dfpd = pd.read_csv('df_sp.csv')
#read time - 10.3 seconds
dfpd = pd.read_csv('df_sp.csv',index_col='id')
#read time - 1 minute 38.6 seconds

In fact, I am seeing significant improvement if I read the dataset without specifying index_col and then set the index by dfpd = dfpd.set_index('id'). This takes just 1.6 more seconds. Why does pandas not default to always reading the dataframe with index_col as a column and then setting it as the index internally with set_index(index_col) when index_col is specified?

Installed Versions

INSTALLED VERSIONS

commit : 73c6825
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-89-generic
Version : #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_IN
LOCALE : en_IN.ISO8859-1

pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.09.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : 1.4.23
tables : None
tabulate : 0.8.9
xarray : 0.19.0
xlrd : 2.0.1
xlwt : None
numba : 0.53.1

Prior Performance

No response

@nitinmnsn nitinmnsn added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 23, 2021
@JonathanHourany
Copy link

I'm not a dev on this project, but I did notice that there's a recent regression fix to the speed of read_csv (see: #44192) that just got merged in. It might fix this problem too.

@jreback jreback added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 27, 2021
@jreback jreback added this to the 1.3.5 milestone Oct 27, 2021
@jreback
Copy link
Contributor

jreback commented Oct 27, 2021

yep very likely

i think we just need a specific asv benchmark for this to close

@nitinmnsn
Copy link
Author

I think this is an issue of not just pandas but how pandas collaborate with OS. I am not seeing this performance issue on mac os and on google colab that I set up to test this. This performance issue is there on ubuntu 20.04 and 18.04. Checked on both pandas 1.3.4 and 1.3.3

https://stackoverflow.com/questions/69692122/why-does-specifying-the-index-column-in-pandas-significantly-increases-the-read

@asishm
Copy link
Contributor

asishm commented Oct 29, 2021

#44192 doesn't fix this issue. this is on a smaller csv with just 2 columns

tested on WSL (Ubuntu 20.04)

In [11]: %timeit df = pd.read_csv("~/df_sp.csv")
789 ms ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: %timeit df = pd.read_csv("~/df_sp.csv", index_col='id')
4.13 s ± 24.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: pd.__version__
Out[13]: '1.4.0.dev0+1006.g7bf75b0e28'

@simonjayhawkins
Copy link
Member

#44192 doesn't fix this issue.

It appears that fix and the regression reported in #44106 are independant of this performance issue and that this one is not a regression or improved (changed) by the change in #44192. Will remove the milestone.

%timeit dfpd = pd.read_csv('df_sp.csv')
# 10.5 s ± 265 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- master
# 11.8 s ± 597 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.3.x (1.3.4+10.gc803b7eaac)
# 11.3 s ± 457 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.3.4
# 12.4 s ± 351 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.2.5
# 11.9 s ± 386 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.1.5
%timeit dfpd = pd.read_csv('df_sp.csv',index_col='id')
# 57.7 s ± 3.06 s per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- master
# 1min 2s ± 3.55 s per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.3.x (1.3.4+10.gc803b7eaac)
# 1min ± 1.87 s per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- 1.3.4
# 57.8 s ± 762 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.2.5
# 1min ± 1.8 s per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.1.5

Further investigation and contributions welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants