PERF: Why does specifying the index column in pandas significantly increases the read time of a csv? #44158

nitinmnsn · 2021-10-23T21:26:39Z

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

I am seeing a significantly increased read time for a CSV by pandas when I specify the index_col. I do not understand the reason behind it. Can you help me understand why that is happening and if that is actually the expected behaviour? Below is the code I am using:

import pandas as pd
#save the CSV to be used
pd.DataFrame({'id':np.arange(100000000),'b':np.random.choice(['a','b','c','d'],size=(100000000,),p=[0.25,0.25,0.25,0.25])}).to_csv('df_sp.csv',index=None)
dfpd = pd.read_csv('df_sp.csv')
#read time - 10.3 seconds
dfpd = pd.read_csv('df_sp.csv',index_col='id')
#read time - 1 minute 38.6 seconds

In fact, I am seeing significant improvement if I read the dataset without specifying index_col and then set the index by dfpd = dfpd.set_index('id'). This takes just 1.6 more seconds. Why does pandas not default to always reading the dataframe with index_col as a column and then setting it as the index internally with set_index(index_col) when index_col is specified?

Installed Versions

INSTALLED VERSIONS

commit : 73c6825
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-89-generic
Version : #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_IN
LOCALE : en_IN.ISO8859-1

pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.09.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : 1.4.23
tables : None
tabulate : 0.8.9
xarray : 0.19.0
xlrd : 2.0.1
xlwt : None
numba : 0.53.1

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

JonathanHourany · 2021-10-27T05:04:54Z

I'm not a dev on this project, but I did notice that there's a recent regression fix to the speed of read_csv (see: #44192) that just got merged in. It might fix this problem too.

jreback · 2021-10-27T09:46:59Z

yep very likely

i think we just need a specific asv benchmark for this to close

nitinmnsn · 2021-10-28T06:21:53Z

I think this is an issue of not just pandas but how pandas collaborate with OS. I am not seeing this performance issue on mac os and on google colab that I set up to test this. This performance issue is there on ubuntu 20.04 and 18.04. Checked on both pandas 1.3.4 and 1.3.3

https://stackoverflow.com/questions/69692122/why-does-specifying-the-index-column-in-pandas-significantly-increases-the-read

asishm · 2021-10-29T23:58:10Z

#44192 doesn't fix this issue. this is on a smaller csv with just 2 columns

tested on WSL (Ubuntu 20.04)

In [11]: %timeit df = pd.read_csv("~/df_sp.csv")
789 ms ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: %timeit df = pd.read_csv("~/df_sp.csv", index_col='id')
4.13 s ± 24.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: pd.__version__
Out[13]: '1.4.0.dev0+1006.g7bf75b0e28'

simonjayhawkins · 2021-11-10T11:25:27Z

#44192 doesn't fix this issue.

It appears that fix and the regression reported in #44106 are independant of this performance issue and that this one is not a regression or improved (changed) by the change in #44192. Will remove the milestone.

%timeit dfpd = pd.read_csv('df_sp.csv')
# 10.5 s ± 265 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- master
# 11.8 s ± 597 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.3.x (1.3.4+10.gc803b7eaac)
# 11.3 s ± 457 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.3.4
# 12.4 s ± 351 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.2.5
# 11.9 s ± 386 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.1.5

%timeit dfpd = pd.read_csv('df_sp.csv',index_col='id')
# 57.7 s ± 3.06 s per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- master
# 1min 2s ± 3.55 s per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.3.x (1.3.4+10.gc803b7eaac)
# 1min ± 1.87 s per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- 1.3.4
# 57.8 s ± 762 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.2.5
# 1min ± 1.8 s per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- 1.1.5

Further investigation and contributions welcome.

nitinmnsn added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 23, 2021

jreback added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 27, 2021

jreback added this to the 1.3.5 milestone Oct 27, 2021

simonjayhawkins modified the milestones: 1.3.5, Contributions Welcome Nov 10, 2021

phofl mentioned this issue Nov 24, 2021

PERF: Improve performance in read_csv with numeric index col #44610

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.3.5, 1.4 Nov 25, 2021

jreback closed this as completed in #44610 Nov 25, 2021

This was referenced Mar 2, 2024

Potential performance Issue: Slow read_csv() Function with pandas 1.3.3 jcw024/lichess_database_ETL#3

Open

Potential performance Issue: Slow read_csv() Function with pandas 1.3.3 Dadmatech/DadmaTools#70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Why does specifying the index column in pandas significantly increases the read time of a csv? #44158

PERF: Why does specifying the index column in pandas significantly increases the read time of a csv? #44158

nitinmnsn commented Oct 23, 2021

INSTALLED VERSIONS

JonathanHourany commented Oct 27, 2021

jreback commented Oct 27, 2021

nitinmnsn commented Oct 28, 2021

asishm commented Oct 29, 2021 •

edited

Loading

simonjayhawkins commented Nov 10, 2021

PERF: Why does specifying the index column in pandas significantly increases the read time of a csv? #44158

PERF: Why does specifying the index column in pandas significantly increases the read time of a csv? #44158

Comments

nitinmnsn commented Oct 23, 2021

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

JonathanHourany commented Oct 27, 2021

jreback commented Oct 27, 2021

nitinmnsn commented Oct 28, 2021

asishm commented Oct 29, 2021 • edited Loading

simonjayhawkins commented Nov 10, 2021

asishm commented Oct 29, 2021 •

edited

Loading