Skip to content

Failure in groupby-apply if aggregating timedelta and datetime columns #15562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
field-cady opened this issue Mar 4, 2017 · 10 comments · Fixed by #29063
Closed

Failure in groupby-apply if aggregating timedelta and datetime columns #15562

field-cady opened this issue Mar 4, 2017 · 10 comments · Fixed by #29063
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@field-cady
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
fname = 'foo.csv'

sample_data = '''clientid,datetime
A,2017-02-01 00:00:00
B,2017-02-01 00:00:00
C,2017-02-01 00:00:00'''
open(fname, 'w').write(sample_data)

df = pd.read_csv(fname)
df['datetime'] = pd.to_datetime(df.datetime)
df['time_delta_zero'] = df.datetime - df.datetime
# Next line fails w error message below
print df.groupby('clientid').apply(
	lambda ddf: pd.Series(dict(
		clientid_age = ddf.time_delta_zero.min(),
		date = ddf.datetime.min()
		))
	)

Problem description

The current behavior is that the groupby-apply line fails, with the error message indicated below.

Expected Output

clientid date clientid_age
0 A 2017-02-01 00:00:00 0 days
1 B 2017-02-01 00:00:00 0 days
2 C 2017-02-01 00:00:00 0 days

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.10.final.0 python-bits: 64 OS: Darwin OS-release: 16.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

Traceback (most recent call last):
File "", line 2, in
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/groupby.py", line 651, in apply
return self._python_apply_general(f)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/groupby.py", line 660, in _python_apply_general
not_indexed_same=mutated or self.mutated)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/groupby.py", line 3343, in _wrap_applied_output
axis=self.axis).unstack()
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 2043, in unstack
return unstack(self, level, fill_value)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/reshape.py", line 408, in unstack
return unstacker.get_result()
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/reshape.py", line 169, in get_result
return DataFrame(values, index=index, columns=columns)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 255, in init
copy=copy)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 432, in _init_ndarray
return create_block_manager_from_blocks([values], [columns, index])
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 3993, in create_block_manager_from_blocks
construction_error(tot_items, blocks[0].shape[1:], axes, e)
File "/Users/fieldc/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 3967, in construction_error
if block_shape[0] == 0:
IndexError: tuple index out of range

@jreback
Copy link
Contributor

jreback commented Mar 4, 2017

What you are doing completely non-performant and way less readable than this idiomatic.

In [31]: df.groupby('clientid').min()
Out[31]: 
           datetime  time_delta_zero
clientid                            
A        2017-02-01           0 days
B        2017-02-01           0 days
C        2017-02-01           0 days

I'll mark it as a bug, though, its shouldn't error (as you are giving back a series; the inference logic is already quite complex).

welcome for a PR to fix though.

@jreback jreback added this to the Next Major Release milestone Mar 4, 2017
@field-cady
Copy link
Author

Oh yes, this is definitely horrible pandas code to write! It's a simplified example from a much more complex script doing things more complicated than min().

I found a work-around where I return a DataFrame from the applied function rather than a Series.

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

@field-cady my point is you can compute .min() or other functions via .apply. don't compute them all at once, simply do:

result = df.groupby(...).agg(['min', my_really_hard_to_compute_function])

or you can be even more specific via

result1 = df.groupby(..).this_column.agg(....)
result2 = df.groupby(..).that_column.agg(...)
result = pd.concat([result1, result2], axis=1)

is also quite idiomatic / readable.

@PnPie
Copy link

PnPie commented Mar 16, 2017

It seems that the error comes from the creation of DataFrame with n-dimension timedelta/datetime values, pandas extracts and flattens the embedded n-dimensional array here:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L469-L473

Ex. Here we our initial values is an 2-dimensional array like
[[datetime.timedelta(0), datetime.timedelta(0), datetime.timedelta(0)], [datetime.datetime(2017, 3, 16), datetime.datetime(2017, 3, 16), datetime.datetime(2017, 3, 16)]]
and it was flattened to an 1-dimensional array:
[datetime.timedelta(0), datetime.timedelta(0), datetime.timedelta(0), datetime.datetime(2017, 3, 16), datetime.datetime(2017, 3, 16), datetime.datetime(2017, 3, 16)]
I didn't get well that why we want to extract datetime like array here ?

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

so the purpose is to infer from a bunch of objects whether they are datetimelikes (e.g. datetime, datetime w/tz or timedeltas). This goes thru some logic to see if its possible, then tries to convert, backing aways if things are not fully convertible (IOW there are mixed non-same types things that are not NaNs/NaT's). IOW you will get a datetime64[ns] or datetime64[tz-aware] or timedelta64[ns] or back what you started.

The only thing is I think I originally made this work regardless of the passed in shape (see the ravel). This is wrong, it should preserve the shape and return a list of array-like if ndim > 1 or array-like if ndim == 1 . array-like are the converted objects or original array if you cannot convert successfully.

So this would fix this issue (and another one, ill find the reference). happy to have you put up a patch!

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

@PnPie this is the same cause as this: #13287

@mroeschke
Copy link
Member

Looks to work on master. Could use a test:

In [41]: df.groupby('clientid').apply(
    ...: ^Ilambda ddf: pd.Series(dict(
    ...: ^I^Iclientid_age = ddf.time_delta_zero.min(),
    ...: ^I^Idate = ddf.datetime.min()
    ...: ^I^I))
    ...: ^I)
Out[41]:
         clientid_age       date
clientid
A              0 days 2017-02-01
B              0 days 2017-02-01
C              0 days 2017-02-01

In [42]: pd.__version__
Out[42]: '0.26.0.dev0+533.gd8f9be7e3'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Difficulty Intermediate Groupby Timedelta Timedelta data type Datetime Datetime data dtype labels Oct 11, 2019
@jhereth
Copy link
Contributor

jhereth commented Oct 16, 2019

works with released version (0.25.1), too. Can be closed.

@mroeschke
Copy link
Member

Care to contribute a regression test @qudade?

@jhereth
Copy link
Contributor

jhereth commented Oct 17, 2019

@mroeschke Sure, will do!

@jreback jreback added this to the 1.0 milestone Oct 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants