Failure in groupby-apply if aggregating timedelta and datetime columns #15562

field-cady · 2017-03-04T00:29:36Z

Code Sample, a copy-pastable example if possible

import pandas as pd
fname = 'foo.csv'

sample_data = '''clientid,datetime
A,2017-02-01 00:00:00
B,2017-02-01 00:00:00
C,2017-02-01 00:00:00'''
open(fname, 'w').write(sample_data)

df = pd.read_csv(fname)
df['datetime'] = pd.to_datetime(df.datetime)
df['time_delta_zero'] = df.datetime - df.datetime
# Next line fails w error message below
print df.groupby('clientid').apply(
	lambda ddf: pd.Series(dict(
		clientid_age = ddf.time_delta_zero.min(),
		date = ddf.datetime.min()
		))
	)

Problem description

The current behavior is that the groupby-apply line fails, with the error message indicated below.

Expected Output

clientid date clientid_age
0 A 2017-02-01 00:00:00 0 days
1 B 2017-02-01 00:00:00 0 days
2 C 2017-02-01 00:00:00 0 days

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.10.final.0 python-bits: 64 OS: Darwin OS-release: 16.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

Traceback (most recent call last):
File "", line 2, in
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/groupby.py", line 651, in apply
return self._python_apply_general(f)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/groupby.py", line 660, in _python_apply_general
not_indexed_same=mutated or self.mutated)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/groupby.py", line 3343, in _wrap_applied_output
axis=self.axis).unstack()
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 2043, in unstack
return unstack(self, level, fill_value)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/reshape.py", line 408, in unstack
return unstacker.get_result()
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/reshape.py", line 169, in get_result
return DataFrame(values, index=index, columns=columns)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 255, in init
copy=copy)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 432, in _init_ndarray
return create_block_manager_from_blocks([values], [columns, index])
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 3993, in create_block_manager_from_blocks
construction_error(tot_items, blocks[0].shape[1:], axes, e)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 3967, in construction_error
if block_shape[0] == 0:
IndexError: tuple index out of range

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-04T14:39:52Z

What you are doing completely non-performant and way less readable than this idiomatic.

In [31]: df.groupby('clientid').min()
Out[31]: 
           datetime  time_delta_zero
clientid                            
A        2017-02-01           0 days
B        2017-02-01           0 days
C        2017-02-01           0 days

I'll mark it as a bug, though, its shouldn't error (as you are giving back a series; the inference logic is already quite complex).

welcome for a PR to fix though.

field-cady · 2017-03-06T16:50:29Z

Oh yes, this is definitely horrible pandas code to write! It's a simplified example from a much more complex script doing things more complicated than min().

I found a work-around where I return a DataFrame from the applied function rather than a Series.

jreback · 2017-03-06T16:56:17Z

@field-cady my point is you can compute .min() or other functions via .apply. don't compute them all at once, simply do:

result = df.groupby(...).agg(['min', my_really_hard_to_compute_function])

or you can be even more specific via

result1 = df.groupby(..).this_column.agg(....)
result2 = df.groupby(..).that_column.agg(...)
result = pd.concat([result1, result2], axis=1)

is also quite idiomatic / readable.

PnPie · 2017-03-16T11:47:05Z

It seems that the error comes from the creation of DataFrame with n-dimension timedelta/datetime values, pandas extracts and flattens the embedded n-dimensional array here:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L469-L473

Ex. Here we our initial values is an 2-dimensional array like
[[datetime.timedelta(0), datetime.timedelta(0), datetime.timedelta(0)], [datetime.datetime(2017, 3, 16), datetime.datetime(2017, 3, 16), datetime.datetime(2017, 3, 16)]]
and it was flattened to an 1-dimensional array:
[datetime.timedelta(0), datetime.timedelta(0), datetime.timedelta(0), datetime.datetime(2017, 3, 16), datetime.datetime(2017, 3, 16), datetime.datetime(2017, 3, 16)]
I didn't get well that why we want to extract datetime like array here ?

jreback · 2017-03-16T12:19:02Z

so the purpose is to infer from a bunch of objects whether they are datetimelikes (e.g. datetime, datetime w/tz or timedeltas). This goes thru some logic to see if its possible, then tries to convert, backing aways if things are not fully convertible (IOW there are mixed non-same types things that are not NaNs/NaT's). IOW you will get a datetime64[ns] or datetime64[tz-aware] or timedelta64[ns] or back what you started.

The only thing is I think I originally made this work regardless of the passed in shape (see the ravel). This is wrong, it should preserve the shape and return a list of array-like if ndim > 1 or array-like if ndim == 1 . array-like are the converted objects or original array if you cannot convert successfully.

So this would fix this issue (and another one, ill find the reference). happy to have you put up a patch!

jreback · 2017-03-16T12:39:08Z

@PnPie this is the same cause as this: #13287

mroeschke · 2019-10-11T04:34:20Z

Looks to work on master. Could use a test:

In [41]: df.groupby('clientid').apply(
    ...: ^Ilambda ddf: pd.Series(dict(
    ...: ^I^Iclientid_age = ddf.time_delta_zero.min(),
    ...: ^I^Idate = ddf.datetime.min()
    ...: ^I^I))
    ...: ^I)
Out[41]:
         clientid_age       date
clientid
A              0 days 2017-02-01
B              0 days 2017-02-01
C              0 days 2017-02-01

In [42]: pd.__version__
Out[42]: '0.26.0.dev0+533.gd8f9be7e3'

jhereth · 2019-10-16T21:30:41Z

works with released version (0.25.1), too. Can be closed.

mroeschke · 2019-10-16T21:35:03Z

Care to contribute a regression test @qudade?

jhereth · 2019-10-17T00:10:50Z

@mroeschke Sure, will do!

…dev#15562)

…#29063)

…dev#15562) (pandas-dev#29063)

jreback added Bug Difficulty Intermediate Groupby Timedelta Timedelta data type Datetime Datetime data dtype labels Mar 4, 2017

jreback added this to the Next Major Release milestone Mar 4, 2017

jreback mentioned this issue Jan 13, 2018

Transform with Count Agg against DateTime returns DateTime #19200

Closed

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Difficulty Intermediate Groupby Timedelta Timedelta data type Datetime Datetime data dtype labels Oct 11, 2019

jhereth added a commit to jhereth/pandas that referenced this issue Oct 17, 2019

TST: regression test for groupby with datetime and timedelta (pandas-…

f21c172

…dev#15562)

jhereth mentioned this issue Oct 17, 2019

TST: regression test for groupby with datetime and timedelta (#15562) #29063

Merged

5 tasks

jhereth added a commit to jhereth/pandas that referenced this issue Oct 17, 2019

TST: regression test for groupby with datetime and timedelta (pandas-…

4c580c3

…dev#15562)

jreback removed this from the Contributions Welcome milestone Oct 18, 2019

jreback added this to the 1.0 milestone Oct 18, 2019

jhereth added a commit to jhereth/pandas that referenced this issue Oct 18, 2019

TST: regression test for groupby with datetime and timedelta (pandas-…

eb3a547

…dev#15562)

jhereth added a commit to jhereth/pandas that referenced this issue Oct 18, 2019

TST: regression test for groupby with datetime and timedelta (pandas-…

2fcf885

…dev#15562)

jreback closed this as completed in #29063 Oct 19, 2019

jreback pushed a commit that referenced this issue Oct 19, 2019

TST: regression test for groupby with datetime and timedelta (#15562) (…

cb76dcb

…#29063)

HawkinsBA pushed a commit to HawkinsBA/pandas that referenced this issue Oct 29, 2019

TST: regression test for groupby with datetime and timedelta (pandas-…

78d3420

…dev#15562) (pandas-dev#29063)

Reksbril pushed a commit to Reksbril/pandas that referenced this issue Nov 18, 2019

TST: regression test for groupby with datetime and timedelta (pandas-…

b3182a7

…dev#15562) (pandas-dev#29063)

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

TST: regression test for groupby with datetime and timedelta (pandas-…

b6866a4

…dev#15562) (pandas-dev#29063)

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

TST: regression test for groupby with datetime and timedelta (pandas-…

b58d950

…dev#15562) (pandas-dev#29063)

bongolegend pushed a commit to bongolegend/pandas that referenced this issue Jan 1, 2020

TST: regression test for groupby with datetime and timedelta (pandas-…

0e8c376

…dev#15562) (pandas-dev#29063)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in groupby-apply if aggregating timedelta and datetime columns #15562

Failure in groupby-apply if aggregating timedelta and datetime columns #15562

field-cady commented Mar 4, 2017

jreback commented Mar 4, 2017

field-cady commented Mar 6, 2017

jreback commented Mar 6, 2017

PnPie commented Mar 16, 2017

jreback commented Mar 16, 2017

jreback commented Mar 16, 2017

mroeschke commented Oct 11, 2019

jhereth commented Oct 16, 2019

mroeschke commented Oct 16, 2019

jhereth commented Oct 17, 2019

Failure in groupby-apply if aggregating timedelta and datetime columns #15562

Failure in groupby-apply if aggregating timedelta and datetime columns #15562

Comments

field-cady commented Mar 4, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Mar 4, 2017

field-cady commented Mar 6, 2017

jreback commented Mar 6, 2017

PnPie commented Mar 16, 2017

jreback commented Mar 16, 2017

jreback commented Mar 16, 2017

mroeschke commented Oct 11, 2019

jhereth commented Oct 16, 2019

mroeschke commented Oct 16, 2019

jhereth commented Oct 17, 2019

Output of `pd.show_versions()`