-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Why does specifying the index column in pandas significantly increases the read time of a csv? #44158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not a dev on this project, but I did notice that there's a recent regression fix to the speed of |
yep very likely i think we just need a specific asv benchmark for this to close |
I think this is an issue of not just pandas but how pandas collaborate with OS. I am not seeing this performance issue on mac os and on google colab that I set up to test this. This performance issue is there on ubuntu 20.04 and 18.04. Checked on both pandas 1.3.4 and 1.3.3 |
#44192 doesn't fix this issue. this is on a smaller csv with just 2 columns tested on WSL (Ubuntu 20.04) In [11]: %timeit df = pd.read_csv("~/df_sp.csv")
789 ms ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]: %timeit df = pd.read_csv("~/df_sp.csv", index_col='id')
4.13 s ± 24.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [13]: pd.__version__
Out[13]: '1.4.0.dev0+1006.g7bf75b0e28' |
It appears that fix and the regression reported in #44106 are independant of this performance issue and that this one is not a regression or improved (changed) by the change in #44192. Will remove the milestone.
Further investigation and contributions welcome. |
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the master branch of pandas.
Reproducible Example
I am seeing a significantly increased read time for a CSV by pandas when I specify the
index_col
. I do not understand the reason behind it. Can you help me understand why that is happening and if that is actually the expected behaviour? Below is the code I am using:In fact, I am seeing significant improvement if I read the dataset without specifying
index_col
and then set the index bydfpd = dfpd.set_index('id')
. This takes just 1.6 more seconds. Why does pandas not default to always reading the dataframe withindex_col
as a column and then setting it as the index internally withset_index(index_col)
whenindex_col
is specified?Installed Versions
INSTALLED VERSIONS
commit : 73c6825
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-89-generic
Version : #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_IN
LOCALE : en_IN.ISO8859-1
pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.09.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : 1.4.23
tables : None
tabulate : 0.8.9
xarray : 0.19.0
xlrd : 2.0.1
xlwt : None
numba : 0.53.1
Prior Performance
No response
The text was updated successfully, but these errors were encountered: