groupby function: different printed result after irrelevant change #14810

elDan101 · 2016-12-06T17:55:25Z

(1)

def groupby_func(x):
        #computing speedup, relative to first line
        return x.iloc[0, :] / x.iloc[0: , :] # (with '0') 

sdf_size = odf.loc[:, cols].groupby(by="size").apply(groupby_func)

(2)

def groupby_func(x):
        #computing speedup, relative to first line
        return x.iloc[0, :] / x.iloc[: , :] # (without '0')

sdf_size = odf.loc[:, cols].groupby(by="size").apply(groupby_func)

Problem description

For the version (1) and (2) I get two different outputs. When I print 'sdf_size' on the console:

(1)

(2)

Somehow, with '0:' the (printed) result is how I wanted it to be (see screenshots, grouping of size in index). But after deleting the 0, which I expected to be unnecessary, I got a different result, which I didn't expect to be different (I think '0:' and ':' to be same -- correct me if I am wrong on this). An explicit setting of "group_keys=True" didn't change anything.

Just ask if something is unclear.

Thank you.

Output of `pd.show_versions()`

commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-51-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 0.6
Cython: None
numpy: 1.11.2
scipy: 0.18.0
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-12-06T18:08:29Z

Reproducible example:

In [3]: df = pd.DataFrame(np.random.randn(10, 2), columns=['a', 'b'])

In [4]: df['c'] = np.random.choice([0, 1], size=len(df))

In [5]: df
Out[5]:
          a         b  c
0  1.993989 -0.443380  0
1  0.451656 -1.374338  1
2 -0.341937  0.095889  1
3  0.831831 -0.119458  0
4  0.506889 -0.405047  0
5 -1.802596 -0.409155  0
6 -0.085620 -2.005494  1
7 -0.230276 -0.709994  1
8 -0.337890  1.010063  1
9 -0.900651  0.611446  0

In [6]: df.groupby('c').apply(lambda x: x.iloc[0, :] / x.iloc[:, :])
Out[6]:
          a          b    c
0  1.000000   1.000000  NaN
1  1.000000   1.000000  1.0
2 -1.320873 -14.332663  1.0
3  2.397109   3.711587  NaN
4  3.933775   1.094639  NaN
5 -1.106177   1.083648  NaN
6 -5.275115   0.685287  1.0
7 -1.961371   1.935705  1.0
8 -1.336693  -1.360647  1.0
9 -2.213942  -0.725134  NaN

In [7]: df.groupby('c').apply(lambda x: x.iloc[0, :] / x.iloc[0:, :])
Out[7]:
            a          b    c
c
0 0  1.000000   1.000000  NaN
  3  2.397109   3.711587  NaN
  4  3.933775   1.094639  NaN
  5 -1.106177   1.083648  NaN
  9 -2.213942  -0.725134  NaN
1 1  1.000000   1.000000  1.0
  2 -1.320873 -14.332663  1.0
  6 -5.275115   0.685287  1.0
  7 -1.961371   1.935705  1.0
  8 -1.336693  -1.360647  1.0

@elDan101 does changing your code to sdf_size = odf.loc[:, cols].groupby(by="size").transform(groupby_func) solve your issue (using transform instead of apply)? .apply does quite a bit of inference to make things come out "right", but this is a clear case of a transform. I don't know for sure, but I'm guessing there's some kind of identity check in .apply, and df.iloc[0:, :] is not df.

elDan101 · 2016-12-06T18:24:30Z

I c&p your suggestion, but an exception was thrown which I wasn't able to quickly solve

Traceback (most recent call last):
File "viz.py", line 205, in
sdf = create_speedup_dataframe(odf)
File "viz.py", line 183, in create_speedup_dataframe
sdf_size = odf.loc[:, cols].groupby(by="size").transform(groupby_func)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3541, in transform
return self._transform_general(func, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3474, in _transform_general
path, res = self._choose_path(fast_path, slow_path, group)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3588, in _choose_path
res = slow_path(group)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3583, in
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4163, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4259, in _apply_standard
results[i] = func(v)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3583, in
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "viz.py", line 179, in groupby_func
x = x.iloc[0, :] / x.iloc[0: , :] # computing speedup
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1309, in getitem
return self._getitem_tuple(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1559, in _getitem_tuple
self._has_valid_tuple(tup)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 149, in _has_valid_tuple
raise IndexingError('Too many indexers')
pandas.core.indexing.IndexingError: ('Too many indexers', u'occurred at index procs')

TomAugspurger · 2016-12-06T18:26:06Z

.transform works column by column, so change your indexers from x.iloc[0, :] to just x.iloc[0].

jreback · 2016-12-06T18:29:22Z

This is the idiomatic way to transform by groups

In [18]: df.groupby('c').transform('first')/df[['a', 'b']]
Out[18]:
           a          b
0   1.000000   1.000000
1  -0.680665  -0.774614
2   1.000000   1.000000
3  -0.398770   0.821029
4   1.708528  14.396555
5   0.857692  -2.335223
6  20.181797   1.131629
7  -0.794080  -0.447658
8   3.881273  -1.540272
9   7.046538 -18.276593

elDan101 · 2016-12-06T18:46:57Z

@TomAugspurger
When I use transform the output is for both cases the same and corresponds to the (2) output (screenshot in issue opener). Actually, I am expecting an output like screenshot (1), because this is what is returned when I use inbuilt groupby-functions (like 'mean').

I was not aware of the transform function, there is also no documentation:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.transform.html?highlight=transform#pandas.core.groupby.GroupBy.transform

TomAugspurger · 2016-12-06T22:41:08Z

tranform is documented in the prose section: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation, though we could add it to the API docs if desired.

Aggregations like 'mean' set the index to the keys. If you're actual function is an aggregation, then use .agg, if you're doing a 1:1 transformation use .transform.

Closing for now, but ask if you have questions. Please include an actual copy-pastable example though (no screenshots). Hard to help out with your actual problem when I can't run the code.

elDan101 · 2016-12-07T12:57:28Z

Thank you. Actually, one question, I am still wondering why "0:" and ":" makes a difference.
Is there an intuition that helps me to see why this is different?

transform is documented in the prose section: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation, though we could add it to the API docs if desired.

Maybe a "see also: link" to the link you provided is enough. Thanks for this, I was new to grouping with pandas, will become handy again in future I suppose.

TomAugspurger · 2016-12-07T14:01:23Z

I am still wondering why "0:" and ":" makes a difference.
Is there an intuition that helps me to see why this is different?

I don't know off the top of my head. My guess is that it has to do with the inference df.apply does to try to guess the output shape, and the fact that df.iloc[0:] returns a different object that df:

In [5]: df = pd.DataFrame(np.random.randn(10, 2), columns=['a', 'b'])

In [6]: df.iloc[:] is df
Out[6]: True

In [7]: df.iloc[0:] is df
Out[7]: False

Feel free to dig through the source code in pandas.core.groupby if you're interested :)

jorisvandenbossche · 2016-12-07T14:44:20Z

It is true that the identity is not the same, but both are equal

In [53]: df.iloc[0:].equals(df.iloc[:])
Out[53]: True

so I think this can be considered a bug that it does not behave the same in groupby.apply

jreback · 2016-12-07T15:08:53Z

equality is not enough

it has to be identical
this is a view

the reason is we r testing for mutation and there is no good way

jorisvandenbossche · 2016-12-14T11:37:40Z

equality is not enough, it has to be identical

But in the original issue, it was not about identity I think, as the identical/equal dataframe was only in the denominator: x.iloc[0, :] / x.iloc[:, :] vs x.iloc[0, :] / x.iloc[0:, :]

jreback added Usage Question Groupby labels Dec 6, 2016

TomAugspurger closed this as completed Dec 6, 2016

TomAugspurger added this to the No action milestone Dec 6, 2016

jreback mentioned this issue Dec 20, 2016

DataFrame.groupby.apply returns different results with copy lambda functions. #14927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby function: different printed result after irrelevant change #14810

groupby function: different printed result after irrelevant change #14810

elDan101 commented Dec 6, 2016

TomAugspurger commented Dec 6, 2016

elDan101 commented Dec 6, 2016

TomAugspurger commented Dec 6, 2016 •

edited

Loading

jreback commented Dec 6, 2016

elDan101 commented Dec 6, 2016 •

edited

Loading

TomAugspurger commented Dec 6, 2016

elDan101 commented Dec 7, 2016

TomAugspurger commented Dec 7, 2016

jorisvandenbossche commented Dec 7, 2016

jreback commented Dec 7, 2016

jorisvandenbossche commented Dec 14, 2016

groupby function: different printed result after irrelevant change #14810

groupby function: different printed result after irrelevant change #14810

Comments

elDan101 commented Dec 6, 2016

Problem description

Output of pd.show_versions()

TomAugspurger commented Dec 6, 2016

elDan101 commented Dec 6, 2016

TomAugspurger commented Dec 6, 2016 • edited Loading

jreback commented Dec 6, 2016

elDan101 commented Dec 6, 2016 • edited Loading

TomAugspurger commented Dec 6, 2016

elDan101 commented Dec 7, 2016

TomAugspurger commented Dec 7, 2016

jorisvandenbossche commented Dec 7, 2016

jreback commented Dec 7, 2016

jorisvandenbossche commented Dec 14, 2016

Output of `pd.show_versions()`

TomAugspurger commented Dec 6, 2016 •

edited

Loading

elDan101 commented Dec 6, 2016 •

edited

Loading