Performance degradation when dumping large dataframe with np.datetime to csv #25708
Labels
Datetime
Datetime data dtype
IO CSV
read_csv, to_csv
Performance
Memory or execution speed performance
Milestone
Large DataFrame with dates is slower in 0.24.0
Problem description
Using pandas < 0.24.0 dumping a large dataset (10M rows, 40 columns) which contains text, dates and floats/ints takes about 12 minutes but using version 0.24.0 the same code takes about 2hours and 40 minutes.
We understand the code base to create the csv has change a bit. I didn't notice the speed degradation when the columns are not dates. I've not checked with datetime objects.
We've tried to condense the problem to the few lines above. Below the results I get from %prun in each version.
%prun
for pandas0.24.0
:cProfile of my script with 0.24.0
(I took the liberty of removing paths/username information)
%prun
for pandas0.23.4
:cProfile of my script with 0.23.4
What is Expected
No degradation in speed.
Output of
pd.show_versions()
for 0.24.0INSTALLED VERSIONS
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.72-68.55.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.0
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Output of
pd.show_versions()
for 0.23.4INSTALLED VERSIONS
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.72-68.55.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.1
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: