Skip to content

groupby function: different printed result after irrelevant change #14810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elDan101 opened this issue Dec 6, 2016 · 11 comments
Closed

groupby function: different printed result after irrelevant change #14810

elDan101 opened this issue Dec 6, 2016 · 11 comments

Comments

@elDan101
Copy link

elDan101 commented Dec 6, 2016

(1)

def groupby_func(x):
        #computing speedup, relative to first line
        return x.iloc[0, :] / x.iloc[0: , :] # (with '0') 

sdf_size = odf.loc[:, cols].groupby(by="size").apply(groupby_func)

(2)

def groupby_func(x):
        #computing speedup, relative to first line
        return x.iloc[0, :] / x.iloc[: , :] # (without '0')

sdf_size = odf.loc[:, cols].groupby(by="size").apply(groupby_func)

Problem description

For the version (1) and (2) I get two different outputs. When I print 'sdf_size' on the console:

(1)
with

(2)
without

Somehow, with '0:' the (printed) result is how I wanted it to be (see screenshots, grouping of size in index). But after deleting the 0, which I expected to be unnecessary, I got a different result, which I didn't expect to be different (I think '0:' and ':' to be same -- correct me if I am wrong on this). An explicit setting of "group_keys=True" didn't change anything.

Just ask if something is unclear.

Thank you.

Output of pd.show_versions()

commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-51-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 0.6
Cython: None
numpy: 1.11.2
scipy: 0.18.0
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Reproducible example:

In [3]: df = pd.DataFrame(np.random.randn(10, 2), columns=['a', 'b'])

In [4]: df['c'] = np.random.choice([0, 1], size=len(df))

In [5]: df
Out[5]:
          a         b  c
0  1.993989 -0.443380  0
1  0.451656 -1.374338  1
2 -0.341937  0.095889  1
3  0.831831 -0.119458  0
4  0.506889 -0.405047  0
5 -1.802596 -0.409155  0
6 -0.085620 -2.005494  1
7 -0.230276 -0.709994  1
8 -0.337890  1.010063  1
9 -0.900651  0.611446  0

In [6]: df.groupby('c').apply(lambda x: x.iloc[0, :] / x.iloc[:, :])
Out[6]:
          a          b    c
0  1.000000   1.000000  NaN
1  1.000000   1.000000  1.0
2 -1.320873 -14.332663  1.0
3  2.397109   3.711587  NaN
4  3.933775   1.094639  NaN
5 -1.106177   1.083648  NaN
6 -5.275115   0.685287  1.0
7 -1.961371   1.935705  1.0
8 -1.336693  -1.360647  1.0
9 -2.213942  -0.725134  NaN

In [7]: df.groupby('c').apply(lambda x: x.iloc[0, :] / x.iloc[0:, :])
Out[7]:
            a          b    c
c
0 0  1.000000   1.000000  NaN
  3  2.397109   3.711587  NaN
  4  3.933775   1.094639  NaN
  5 -1.106177   1.083648  NaN
  9 -2.213942  -0.725134  NaN
1 1  1.000000   1.000000  1.0
  2 -1.320873 -14.332663  1.0
  6 -5.275115   0.685287  1.0
  7 -1.961371   1.935705  1.0
  8 -1.336693  -1.360647  1.0

@elDan101 does changing your code to sdf_size = odf.loc[:, cols].groupby(by="size").transform(groupby_func) solve your issue (using transform instead of apply)? .apply does quite a bit of inference to make things come out "right", but this is a clear case of a transform. I don't know for sure, but I'm guessing there's some kind of identity check in .apply, and df.iloc[0:, :] is not df.

@elDan101
Copy link
Author

elDan101 commented Dec 6, 2016

I c&p your suggestion, but an exception was thrown which I wasn't able to quickly solve

Traceback (most recent call last):
File "viz.py", line 205, in
sdf = create_speedup_dataframe(odf)
File "viz.py", line 183, in create_speedup_dataframe
sdf_size = odf.loc[:, cols].groupby(by="size").transform(groupby_func)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3541, in transform
return self._transform_general(func, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3474, in _transform_general
path, res = self._choose_path(fast_path, slow_path, group)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3588, in _choose_path
res = slow_path(group)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3583, in
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4163, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4259, in _apply_standard
results[i] = func(v)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3583, in
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "viz.py", line 179, in groupby_func
x = x.iloc[0, :] / x.iloc[0: , :] # computing speedup
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1309, in getitem
return self._getitem_tuple(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1559, in _getitem_tuple
self._has_valid_tuple(tup)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 149, in _has_valid_tuple
raise IndexingError('Too many indexers')
pandas.core.indexing.IndexingError: ('Too many indexers', u'occurred at index procs')

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 6, 2016

.transform works column by column, so change your indexers from x.iloc[0, :] to just x.iloc[0].

@jreback
Copy link
Contributor

jreback commented Dec 6, 2016

This is the idiomatic way to transform by groups

In [18]: df.groupby('c').transform('first')/df[['a', 'b']]
Out[18]:
           a          b
0   1.000000   1.000000
1  -0.680665  -0.774614
2   1.000000   1.000000
3  -0.398770   0.821029
4   1.708528  14.396555
5   0.857692  -2.335223
6  20.181797   1.131629
7  -0.794080  -0.447658
8   3.881273  -1.540272
9   7.046538 -18.276593

@elDan101
Copy link
Author

elDan101 commented Dec 6, 2016

@TomAugspurger
When I use transform the output is for both cases the same and corresponds to the (2) output (screenshot in issue opener). Actually, I am expecting an output like screenshot (1), because this is what is returned when I use inbuilt groupby-functions (like 'mean').

I was not aware of the transform function, there is also no documentation:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.transform.html?highlight=transform#pandas.core.groupby.GroupBy.transform

@TomAugspurger
Copy link
Contributor

tranform is documented in the prose section: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation, though we could add it to the API docs if desired.

Aggregations like 'mean' set the index to the keys. If you're actual function is an aggregation, then use .agg, if you're doing a 1:1 transformation use .transform.

Closing for now, but ask if you have questions. Please include an actual copy-pastable example though (no screenshots). Hard to help out with your actual problem when I can't run the code.

@TomAugspurger TomAugspurger added this to the No action milestone Dec 6, 2016
@elDan101
Copy link
Author

elDan101 commented Dec 7, 2016

Thank you. Actually, one question, I am still wondering why "0:" and ":" makes a difference.
Is there an intuition that helps me to see why this is different?

transform is documented in the prose section: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation, though we could add it to the API docs if desired.

Maybe a "see also: link" to the link you provided is enough. Thanks for this, I was new to grouping with pandas, will become handy again in future I suppose.

@TomAugspurger
Copy link
Contributor

I am still wondering why "0:" and ":" makes a difference.
Is there an intuition that helps me to see why this is different?

I don't know off the top of my head. My guess is that it has to do with the inference df.apply does to try to guess the output shape, and the fact that df.iloc[0:] returns a different object that df:

In [5]: df = pd.DataFrame(np.random.randn(10, 2), columns=['a', 'b'])

In [6]: df.iloc[:] is df
Out[6]: True

In [7]: df.iloc[0:] is df
Out[7]: False

Feel free to dig through the source code in pandas.core.groupby if you're interested :)

@jorisvandenbossche
Copy link
Member

It is true that the identity is not the same, but both are equal

In [53]: df.iloc[0:].equals(df.iloc[:])
Out[53]: True

so I think this can be considered a bug that it does not behave the same in groupby.apply

@jreback
Copy link
Contributor

jreback commented Dec 7, 2016

equality is not enough

it has to be identical
this is a view

the reason is we r testing for mutation and there is no good way

@jorisvandenbossche
Copy link
Member

equality is not enough, it has to be identical

But in the original issue, it was not about identity I think, as the identical/equal dataframe was only in the denominator: x.iloc[0, :] / x.iloc[:, :] vs x.iloc[0, :] / x.iloc[0:, :]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants