Performance degradation when dumping large dataframe with np.datetime to csv #25708

alep · 2019-03-13T14:39:44Z

Large DataFrame with dates is slower in 0.24.0

import pandas as pd
import numpy as np

d = '2018-11-29'
dt = '2018-11-26 11:18:27.0'
N = 10000000
df = pd.DataFrame({'dt': [np.datetime64(dt)] * N, 'd': [np.datetime64(d)] * N, 'r': [np.random.uniform()] * N})

Problem description

Using pandas < 0.24.0 dumping a large dataset (10M rows, 40 columns) which contains text, dates and floats/ints takes about 12 minutes but using version 0.24.0 the same code takes about 2hours and 40 minutes.

We understand the code base to create the csv has change a bit. I didn't notice the speed degradation when the columns are not dates. I've not checked with datetime objects.

We've tried to condense the problem to the few lines above. Below the results I get from %prun in each version.

`%prun` for pandas `0.24.0`:

        121342 function calls (120733 primitive calls) in 153.774 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  153.774  153.774 {built-in method builtins.exec}
        1    0.000    0.000  153.774  153.774 <string>:1(<module>)
        1    0.000    0.000  153.774  153.774 generic.py:2873(to_csv)
        1    0.000    0.000  153.773  153.773 csvs.py:130(save)
        1    0.002    0.002  153.684  153.684 csvs.py:272(_save)
      301    0.278    0.001  153.681    0.511 csvs.py:290(_save_chunk)
      301    0.009    0.000  119.472    0.397 blocks.py:2200(to_native_types)
      301    0.002    0.000  101.820    0.338 format.py:1300(_get_format_datetime64_from_values)
      301   96.018    0.319  101.818    0.338 format.py:1249(_is_dates_only)
      301   20.040    0.067   20.040    0.067 {pandas._libs.writers.write_csv_rows}
      301   17.525    0.058   17.580    0.058 {pandas._libs.tslib.format_array_from_datetime}
      602   13.766    0.023   13.766    0.023 {method 'astype' of 'numpy.ndarray' objects}
      301    0.005    0.000    8.095    0.027 blocks.py:1982(to_native_types)
     1204    5.798    0.005    5.798    0.005 {method 'reduce' of 'numpy.ufunc' objects}
      302    0.003    0.000    5.796    0.019 base.py:1023(to_native_types)
      302    0.005    0.000    5.777    0.019 base.py:1049(_format_native_types)
      301    0.002    0.000    5.749    0.019 {method 'sum' of 'numpy.ndarray' objects}
      301    0.001    0.000    5.747    0.019 _methods.py:34(_sum)
      603    0.001    0.000    0.089    0.000 missing.py:25(isna)
      603    0.003    0.000    0.088    0.000 missing.py:105(_isna_new)
      [snip]

cProfile of my script with 0.24.0

(I took the liberty of removing paths/username information)

stats:         3434284 function calls (3423766 primitive calls) in 8642.079 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000 8642.079 8642.079 ../lib/python3.7/site-packages/pandas/core/generic.py:2873(to_csv)
        1    0.000    0.000 8642.077 8642.077 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:130(save)
        1    0.042    0.042 8642.002 8642.002../lib/python3.7/site-packages/pandas/io/formats/csvs.py:272(_save)
     5242    3.588    0.001 8641.950    1.649../lib/python3.7/site-packages/pandas/io/formats/csvs.py:290(_save_chunk)
     5242    0.165    0.000 8294.922    1.582 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:2200(to_native_types)
     5242    0.044    0.000 8203.096    1.565 ../lib/python3.7/site-packages/pandas/io/formats/format.py:1300(_get_format_datetime64_from_values)
     5242 7757.167    1.480 8203.052    1.565 ../lib/python3.7/site-packages/pandas/io/formats/format.py:1249(_is_dates_only)
    20968  445.184    0.021  445.184    0.021 {method 'reduce' of 'numpy.ufunc' objects}
     5242    0.041    0.000  444.994    0.085 {method 'sum' of 'numpy.ndarray' objects}
     5242    0.023    0.000  444.953    0.085 ../lib/python3.7/site-packages/numpy/core/_methods.py:34(_sum)
     5242  212.677    0.041  212.677    0.041 {pandas._libs.writers.write_csv_rows}
    15726  122.216    0.008  122.216    0.008 {method 'astype' of 'numpy.ndarray' objects}
     5242   90.999    0.017   91.300    0.017 {pandas._libs.tslib.format_array_from_datetime}
     5242    0.570    0.000   79.782    0.015 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:1982(to_native_types)
    10484    0.200    0.000   43.947    0.004 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:730(to_native_types)
     5243    0.031    0.000    7.008    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:1023(to_native_types)
    ...the rest was removed...

`%prun` for pandas `0.23.4`:

        71818 function calls (71517 primitive calls) in 52.371 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   52.371   52.371 {built-in method builtins.exec}
        1    0.000    0.000   52.371   52.371 <string>:1(<module>)
        1    0.000    0.000   52.371   52.371 frame.py:1653(to_csv)
        1    0.000    0.000   52.370   52.370 csvs.py:123(save)
        1    0.002    0.002   51.546   51.546 csvs.py:270(_save)
      301    0.245    0.001   51.543    0.171 csvs.py:288(_save_chunk)
      301   19.846    0.066   19.846    0.066 {pandas._libs.writers.write_csv_rows}
      301    0.007    0.000   18.031    0.060 internals.py:2743(to_native_types)
      301   17.654    0.059   17.698    0.059 {pandas._libs.tslib.format_array_from_datetime}
      602   13.309    0.022   13.309    0.022 {method 'astype' of 'numpy.ndarray' objects}
      301    0.006    0.000    7.480    0.025 internals.py:2052(to_native_types)
      302    0.002    0.000    5.940    0.020 base.py:2408(to_native_types)
      302    0.004    0.000    5.925    0.020 base.py:2434(_format_native_types)
        2    0.823    0.411    0.823    0.411 {method 'close' of '_io.TextIOWrapper' objects}
      301    0.001    0.000    0.293    0.001 format.py:1291(_get_format_datetime64_from_values)
      301    0.249    0.001    0.292    0.001 format.py:1243(_is_dates_only)
      603    0.001    0.000    0.081    0.000 missing.py:32(isna)
      603    0.003    0.000    0.080    0.000 missing.py:112(_isna_new)
      603    0.033    0.000    0.072    0.000 missing.py:189(_isna_ndarraylike)
     1204    0.067    0.000    0.067    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      903    0.001    0.000    0.044    0.000 _methods.py:42(_any)
      302    0.026    0.000    0.026    0.000 {method 'ravel' of 'numpy.ndarray' objects}
      301    0.001    0.000    0.024    0.000 {method 'sum' of 'numpy.ndarray' objects}

cProfile of my script with 0.23.4

stats:         2170825 function calls (2165580 primitive calls) in 442.783 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  442.783  442.783 ../lib/python3.7/site-packages/pandas/core/frame.py:1653(to_csv)
        1    0.000    0.000  442.782  442.782 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:123(save)
        1    0.041    0.041  442.781  442.781 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:270(_save)
     5242    2.815    0.001  442.730    0.084 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:288(_save_chunk)
     5242  216.105    0.041  216.105    0.041 {pandas._libs.writers.write_csv_rows}
    15726  125.188    0.008  125.188    0.008 {method 'astype' of 'numpy.ndarray' objects}
     5242    0.121    0.000   90.872    0.017 ../lib/python3.7/site-packages/pandas/core/internals.py:2743(to_native_types)
     5242   88.329    0.017   88.590    0.017 {pandas._libs.tslib.format_array_from_datetime}
     5242    0.606    0.000   79.149    0.015 ../lib/python3.7/site-packages/pandas/core/internals.py:2052(to_native_types)
    10484    0.188    0.000   46.683    0.004 ../lib/python3.7/site-packages/pandas/core/internals.py:755(to_native_types)
     5243    0.026    0.000    7.077    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:2408(to_native_types)
     5243    0.039    0.000    6.497    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:2434(_format_native_types)

What is Expected

No degradation in speed.

Output of `pd.show_versions()` for 0.24.0

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.72-68.55.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Output of `pd.show_versions()` for 0.23.4

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.72-68.55.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.1
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

alep · 2019-03-13T14:41:03Z

I'm sorry but I cannot share the data used.

chris-b1 · 2019-03-13T15:37:04Z

So looking at your profile, it seems like _is_dates_only is the issue.

I don't think that changed in 0.24 and the slowness of that function doesn't reproduce for me (on Windows), so must be something deeper going on. Could you show a prun of just that function?

from pandas.io.formats.format import _is_dates_only
%prun _is_dates_only(df['d'])

alep · 2019-03-13T15:44:46Z

@chris-b1 Hi,

there doesn't seem to be difference in _is_dates_only

0.24.0

         312 function calls (301 primitive calls) in 0.176 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.159    0.159    0.168    0.168 format.py:1249(_is_dates_only)

0.23.4

         161 function calls (159 primitive calls) in 0.184 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.172    0.172    0.181    0.181 format.py:1243(_is_dates_only)

I think it has to do with the chunking, when there's no chunking the speed doesn't vary.

alep · 2019-03-13T15:45:46Z

Also I've observed this in OS X and Linux

hksonngan · 2019-03-15T17:51:35Z

SnakeViz of cProfile in Windows

0.23.4
master

alep changed the title ~~Performance degradation when dumping large dataframe with datetimes to csv~~ Performance degradation when dumping large dataframe with np.datetime to csv Mar 13, 2019

chris-b1 added Performance Memory or execution speed performance IO CSV read_csv, to_csv Datetime Datetime data dtype labels Mar 13, 2019

This was referenced Mar 18, 2019

25708 performance degradation when dumping large dataframe with datetime to csv #25764

Closed

PERF: fixed performance degration becasue doesn't slice values array in to_native_types of DateTimeBlock #25765

Merged

jreback added this to the 0.25.0 milestone Apr 15, 2019

jreback closed this as completed in #25765 Apr 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation when dumping large dataframe with np.datetime to csv #25708

Performance degradation when dumping large dataframe with np.datetime to csv #25708

alep commented Mar 13, 2019

INSTALLED VERSIONS

INSTALLED VERSIONS

alep commented Mar 13, 2019

chris-b1 commented Mar 13, 2019

alep commented Mar 13, 2019

alep commented Mar 13, 2019

hksonngan commented Mar 15, 2019 •

edited

Loading

Performance degradation when dumping large dataframe with np.datetime to csv #25708

Performance degradation when dumping large dataframe with np.datetime to csv #25708

Comments

alep commented Mar 13, 2019

Large DataFrame with dates is slower in 0.24.0

Problem description

%prun for pandas 0.24.0:

cProfile of my script with 0.24.0

%prun for pandas 0.23.4:

cProfile of my script with 0.23.4

What is Expected

Output of pd.show_versions() for 0.24.0

INSTALLED VERSIONS

Output of pd.show_versions() for 0.23.4

INSTALLED VERSIONS

alep commented Mar 13, 2019

chris-b1 commented Mar 13, 2019

alep commented Mar 13, 2019

0.24.0

0.23.4

alep commented Mar 13, 2019

hksonngan commented Mar 15, 2019 • edited Loading

`%prun` for pandas `0.24.0`:

`%prun` for pandas `0.23.4`:

Output of `pd.show_versions()` for 0.24.0

Output of `pd.show_versions()` for 0.23.4

hksonngan commented Mar 15, 2019 •

edited

Loading