-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
groupby function: different printed result after irrelevant change #14810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Reproducible example: In [3]: df = pd.DataFrame(np.random.randn(10, 2), columns=['a', 'b'])
In [4]: df['c'] = np.random.choice([0, 1], size=len(df))
In [5]: df
Out[5]:
a b c
0 1.993989 -0.443380 0
1 0.451656 -1.374338 1
2 -0.341937 0.095889 1
3 0.831831 -0.119458 0
4 0.506889 -0.405047 0
5 -1.802596 -0.409155 0
6 -0.085620 -2.005494 1
7 -0.230276 -0.709994 1
8 -0.337890 1.010063 1
9 -0.900651 0.611446 0
In [6]: df.groupby('c').apply(lambda x: x.iloc[0, :] / x.iloc[:, :])
Out[6]:
a b c
0 1.000000 1.000000 NaN
1 1.000000 1.000000 1.0
2 -1.320873 -14.332663 1.0
3 2.397109 3.711587 NaN
4 3.933775 1.094639 NaN
5 -1.106177 1.083648 NaN
6 -5.275115 0.685287 1.0
7 -1.961371 1.935705 1.0
8 -1.336693 -1.360647 1.0
9 -2.213942 -0.725134 NaN
In [7]: df.groupby('c').apply(lambda x: x.iloc[0, :] / x.iloc[0:, :])
Out[7]:
a b c
c
0 0 1.000000 1.000000 NaN
3 2.397109 3.711587 NaN
4 3.933775 1.094639 NaN
5 -1.106177 1.083648 NaN
9 -2.213942 -0.725134 NaN
1 1 1.000000 1.000000 1.0
2 -1.320873 -14.332663 1.0
6 -5.275115 0.685287 1.0
7 -1.961371 1.935705 1.0
8 -1.336693 -1.360647 1.0 @elDan101 does changing your code to |
I c&p your suggestion, but an exception was thrown which I wasn't able to quickly solve
|
|
This is the idiomatic way to transform by groups
|
@TomAugspurger I was not aware of the transform function, there is also no documentation: |
tranform is documented in the prose section: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation, though we could add it to the API docs if desired. Aggregations like Closing for now, but ask if you have questions. Please include an actual copy-pastable example though (no screenshots). Hard to help out with your actual problem when I can't run the code. |
Thank you. Actually, one question, I am still wondering why "0:" and ":" makes a difference.
Maybe a "see also: link" to the link you provided is enough. Thanks for this, I was new to grouping with pandas, will become handy again in future I suppose. |
I don't know off the top of my head. My guess is that it has to do with the inference In [5]: df = pd.DataFrame(np.random.randn(10, 2), columns=['a', 'b'])
In [6]: df.iloc[:] is df
Out[6]: True
In [7]: df.iloc[0:] is df
Out[7]: False Feel free to dig through the source code in |
It is true that the identity is not the same, but both are equal
so I think this can be considered a bug that it does not behave the same in |
equality is not enough it has to be identical the reason is we r testing for mutation and there is no good way |
But in the original issue, it was not about identity I think, as the identical/equal dataframe was only in the denominator: |
(1)
(2)
Problem description
For the version (1) and (2) I get two different outputs. When I print 'sdf_size' on the console:
(1)

(2)

Somehow, with '0:' the (printed) result is how I wanted it to be (see screenshots, grouping of size in index). But after deleting the 0, which I expected to be unnecessary, I got a different result, which I didn't expect to be different (I think '0:' and ':' to be same -- correct me if I am wrong on this). An explicit setting of "group_keys=True" didn't change anything.
Just ask if something is unclear.
Thank you.
Output of
pd.show_versions()
pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 0.6
Cython: None
numpy: 1.11.2
scipy: 0.18.0
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: