Skip to content

Performance degradation when dumping large dataframe with np.datetime to csv #25708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alep opened this issue Mar 13, 2019 · 5 comments · Fixed by #25765
Closed

Performance degradation when dumping large dataframe with np.datetime to csv #25708

alep opened this issue Mar 13, 2019 · 5 comments · Fixed by #25765
Labels
Datetime Datetime data dtype IO CSV read_csv, to_csv Performance Memory or execution speed performance
Milestone

Comments

@alep
Copy link

alep commented Mar 13, 2019

Large DataFrame with dates is slower in 0.24.0

import pandas as pd
import numpy as np

d = '2018-11-29'
dt = '2018-11-26 11:18:27.0'
N = 10000000
df = pd.DataFrame({'dt': [np.datetime64(dt)] * N, 'd': [np.datetime64(d)] * N, 'r': [np.random.uniform()] * N})

Problem description

Using pandas < 0.24.0 dumping a large dataset (10M rows, 40 columns) which contains text, dates and floats/ints takes about 12 minutes but using version 0.24.0 the same code takes about 2hours and 40 minutes.

We understand the code base to create the csv has change a bit. I didn't notice the speed degradation when the columns are not dates. I've not checked with datetime objects.

We've tried to condense the problem to the few lines above. Below the results I get from %prun in each version.

%prun for pandas 0.24.0:

        121342 function calls (120733 primitive calls) in 153.774 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  153.774  153.774 {built-in method builtins.exec}
        1    0.000    0.000  153.774  153.774 <string>:1(<module>)
        1    0.000    0.000  153.774  153.774 generic.py:2873(to_csv)
        1    0.000    0.000  153.773  153.773 csvs.py:130(save)
        1    0.002    0.002  153.684  153.684 csvs.py:272(_save)
      301    0.278    0.001  153.681    0.511 csvs.py:290(_save_chunk)
      301    0.009    0.000  119.472    0.397 blocks.py:2200(to_native_types)
      301    0.002    0.000  101.820    0.338 format.py:1300(_get_format_datetime64_from_values)
      301   96.018    0.319  101.818    0.338 format.py:1249(_is_dates_only)
      301   20.040    0.067   20.040    0.067 {pandas._libs.writers.write_csv_rows}
      301   17.525    0.058   17.580    0.058 {pandas._libs.tslib.format_array_from_datetime}
      602   13.766    0.023   13.766    0.023 {method 'astype' of 'numpy.ndarray' objects}
      301    0.005    0.000    8.095    0.027 blocks.py:1982(to_native_types)
     1204    5.798    0.005    5.798    0.005 {method 'reduce' of 'numpy.ufunc' objects}
      302    0.003    0.000    5.796    0.019 base.py:1023(to_native_types)
      302    0.005    0.000    5.777    0.019 base.py:1049(_format_native_types)
      301    0.002    0.000    5.749    0.019 {method 'sum' of 'numpy.ndarray' objects}
      301    0.001    0.000    5.747    0.019 _methods.py:34(_sum)
      603    0.001    0.000    0.089    0.000 missing.py:25(isna)
      603    0.003    0.000    0.088    0.000 missing.py:105(_isna_new)
      [snip]

cProfile of my script with 0.24.0

(I took the liberty of removing paths/username information)

stats:         3434284 function calls (3423766 primitive calls) in 8642.079 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000 8642.079 8642.079 ../lib/python3.7/site-packages/pandas/core/generic.py:2873(to_csv)
        1    0.000    0.000 8642.077 8642.077 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:130(save)
        1    0.042    0.042 8642.002 8642.002../lib/python3.7/site-packages/pandas/io/formats/csvs.py:272(_save)
     5242    3.588    0.001 8641.950    1.649../lib/python3.7/site-packages/pandas/io/formats/csvs.py:290(_save_chunk)
     5242    0.165    0.000 8294.922    1.582 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:2200(to_native_types)
     5242    0.044    0.000 8203.096    1.565 ../lib/python3.7/site-packages/pandas/io/formats/format.py:1300(_get_format_datetime64_from_values)
     5242 7757.167    1.480 8203.052    1.565 ../lib/python3.7/site-packages/pandas/io/formats/format.py:1249(_is_dates_only)
    20968  445.184    0.021  445.184    0.021 {method 'reduce' of 'numpy.ufunc' objects}
     5242    0.041    0.000  444.994    0.085 {method 'sum' of 'numpy.ndarray' objects}
     5242    0.023    0.000  444.953    0.085 ../lib/python3.7/site-packages/numpy/core/_methods.py:34(_sum)
     5242  212.677    0.041  212.677    0.041 {pandas._libs.writers.write_csv_rows}
    15726  122.216    0.008  122.216    0.008 {method 'astype' of 'numpy.ndarray' objects}
     5242   90.999    0.017   91.300    0.017 {pandas._libs.tslib.format_array_from_datetime}
     5242    0.570    0.000   79.782    0.015 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:1982(to_native_types)
    10484    0.200    0.000   43.947    0.004 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:730(to_native_types)
     5243    0.031    0.000    7.008    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:1023(to_native_types)
    ...the rest was removed...

%prun for pandas 0.23.4:

        71818 function calls (71517 primitive calls) in 52.371 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   52.371   52.371 {built-in method builtins.exec}
        1    0.000    0.000   52.371   52.371 <string>:1(<module>)
        1    0.000    0.000   52.371   52.371 frame.py:1653(to_csv)
        1    0.000    0.000   52.370   52.370 csvs.py:123(save)
        1    0.002    0.002   51.546   51.546 csvs.py:270(_save)
      301    0.245    0.001   51.543    0.171 csvs.py:288(_save_chunk)
      301   19.846    0.066   19.846    0.066 {pandas._libs.writers.write_csv_rows}
      301    0.007    0.000   18.031    0.060 internals.py:2743(to_native_types)
      301   17.654    0.059   17.698    0.059 {pandas._libs.tslib.format_array_from_datetime}
      602   13.309    0.022   13.309    0.022 {method 'astype' of 'numpy.ndarray' objects}
      301    0.006    0.000    7.480    0.025 internals.py:2052(to_native_types)
      302    0.002    0.000    5.940    0.020 base.py:2408(to_native_types)
      302    0.004    0.000    5.925    0.020 base.py:2434(_format_native_types)
        2    0.823    0.411    0.823    0.411 {method 'close' of '_io.TextIOWrapper' objects}
      301    0.001    0.000    0.293    0.001 format.py:1291(_get_format_datetime64_from_values)
      301    0.249    0.001    0.292    0.001 format.py:1243(_is_dates_only)
      603    0.001    0.000    0.081    0.000 missing.py:32(isna)
      603    0.003    0.000    0.080    0.000 missing.py:112(_isna_new)
      603    0.033    0.000    0.072    0.000 missing.py:189(_isna_ndarraylike)
     1204    0.067    0.000    0.067    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      903    0.001    0.000    0.044    0.000 _methods.py:42(_any)
      302    0.026    0.000    0.026    0.000 {method 'ravel' of 'numpy.ndarray' objects}
      301    0.001    0.000    0.024    0.000 {method 'sum' of 'numpy.ndarray' objects}

cProfile of my script with 0.23.4

stats:         2170825 function calls (2165580 primitive calls) in 442.783 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  442.783  442.783 ../lib/python3.7/site-packages/pandas/core/frame.py:1653(to_csv)
        1    0.000    0.000  442.782  442.782 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:123(save)
        1    0.041    0.041  442.781  442.781 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:270(_save)
     5242    2.815    0.001  442.730    0.084 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:288(_save_chunk)
     5242  216.105    0.041  216.105    0.041 {pandas._libs.writers.write_csv_rows}
    15726  125.188    0.008  125.188    0.008 {method 'astype' of 'numpy.ndarray' objects}
     5242    0.121    0.000   90.872    0.017 ../lib/python3.7/site-packages/pandas/core/internals.py:2743(to_native_types)
     5242   88.329    0.017   88.590    0.017 {pandas._libs.tslib.format_array_from_datetime}
     5242    0.606    0.000   79.149    0.015 ../lib/python3.7/site-packages/pandas/core/internals.py:2052(to_native_types)
    10484    0.188    0.000   46.683    0.004 ../lib/python3.7/site-packages/pandas/core/internals.py:755(to_native_types)
     5243    0.026    0.000    7.077    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:2408(to_native_types)
     5243    0.039    0.000    6.497    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:2434(_format_native_types)

What is Expected

No degradation in speed.

Output of pd.show_versions() for 0.24.0

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.72-68.55.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Output of pd.show_versions() for 0.23.4

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.72-68.55.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.1
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@alep alep changed the title Performance degradation when dumping large dataframe with datetimes to csv Performance degradation when dumping large dataframe with np.datetime to csv Mar 13, 2019
@alep
Copy link
Author

alep commented Mar 13, 2019

I'm sorry but I cannot share the data used.

@chris-b1
Copy link
Contributor

So looking at your profile, it seems like _is_dates_only is the issue.

I don't think that changed in 0.24 and the slowness of that function doesn't reproduce for me (on Windows), so must be something deeper going on. Could you show a prun of just that function?

from pandas.io.formats.format import _is_dates_only
%prun _is_dates_only(df['d'])

@chris-b1 chris-b1 added Performance Memory or execution speed performance IO CSV read_csv, to_csv Datetime Datetime data dtype labels Mar 13, 2019
@alep
Copy link
Author

alep commented Mar 13, 2019

@chris-b1 Hi,

there doesn't seem to be difference in _is_dates_only

0.24.0

         312 function calls (301 primitive calls) in 0.176 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.159    0.159    0.168    0.168 format.py:1249(_is_dates_only)

0.23.4

         161 function calls (159 primitive calls) in 0.184 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.172    0.172    0.181    0.181 format.py:1243(_is_dates_only)

I think it has to do with the chunking, when there's no chunking the speed doesn't vary.

@alep
Copy link
Author

alep commented Mar 13, 2019

Also I've observed this in OS X and Linux

@hksonngan
Copy link
Contributor

hksonngan commented Mar 15, 2019

SnakeViz of cProfile in Windows

  • 0.23.4
    Pandas 0.23.4
  • master
    Pandas master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
4 participants