-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DataFrame.groupby.sum() is extremely slow when dtype is timedelta64[ns] compared to int64. #20660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. cc @WillAyd if you have any pointers on where someone could get started here? |
Hmm well after a quick profile it appears that the timedelta data does not leverage the Cython function. The problem starts on the below line: pandas/pandas/core/groupby/groupby.py Line 3938 in fa231e8
For Perhaps the easiest fix is to add pandas/pandas/core/groupby/groupby.py Line 1431 in fa231e8
Running this locally gets the examples back in line with one another: In [12]: %timeit td.groupby(lambda x: x).sum()
8.53 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [13]: %timeit i.groupby(lambda x: x).sum()
9.66 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) It's not entirely clear to me what the overall purpose of FWIW we don't have any ASVs for timedelta so this would be a good opportunity to add. @wezzman are you interested in trying a PR for this? |
timedelta is not a numeric type (e.g. float, int); so this is correct. certainly could add more benchmarks though. |
#18053 is different, its about |
Understood on the numeric type check, but you don't have any objection to relaxing the |
@WillAyd, I am unsure what you mean by a PR. |
PR="Pull Request"
http://pandas-docs.github.io/pandas-docs-travis/contributing.html
…On Mon, Apr 16, 2018 at 8:33 AM, wezzman ***@***.***> wrote:
@WillAyd <https://github.com/WillAyd>, I am unsure what you mean by a PR.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20660 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIhd_cp-KvsEdtudO-VQliiyRuZo8ks5tpJ3FgaJpZM4TQxJA>
.
|
I know this is a fairly old thread, but I thought I'd mention that this is a problem for dtype |
@kaijfox if you are up for it PRs are always welcome. I believe the resolution was highlighted in some comments above so if you can piece together with benchmarks would certainly take a look! |
The performance looks comparable now. Could use some ASVs
|
take |
Steps to demonstrate
Problem description
When performing a summation on grouped 'timedelta64[ns]' data, there is a significant performance decrease compared to the same data interpreted as 'int64'.
Possibly related to #18053
Expected Behavior
It is my understanding that internally 'timedelta64[ns]' are just 'int64' and are interpreted as a count of 'ns'. Shouldn't the summation performance be equal in that case?
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None
pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.1
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: