Skip to content

BUG: a duplicated index would cause groupby.fillna(method='ffill') a wrong result #43412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
mIgLLL opened this issue Sep 5, 2021 · 7 comments · Fixed by #55408
Closed
3 tasks

BUG: a duplicated index would cause groupby.fillna(method='ffill') a wrong result #43412

mIgLLL opened this issue Sep 5, 2021 · 7 comments · Fixed by #55408
Assignees
Labels
Bug good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions

Comments

@mIgLLL
Copy link

mIgLLL commented Sep 5, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
import pandas as pd


df1 = pd.DataFrame({'B': [np.nan, 1, 2, np.nan, 4,np.nan,np.nan,7,8,9,10,11,np.nan,13,14,15]})

df1.loc[:5,'A']=3
df1.loc[5:10,'A']=2
df1.loc[10:,'A']=4

df2=df1.copy()

df=df1.append(df2)

df['C']=df.groupby('A')['B'].shift(1)
a=df.groupby('A')['C'].fillna(method='ffill')

# after reset_index()

df=df.reset_index(drop=True)
df['D']=df.groupby('A')['C'].fillna(method='ffill')

Problem description

[this should explain why the current behaviour is a problem and why the expected output is a better solution]

I find that if the index is duplicated after some dataframe appending, the groupby.fillna() would cause something wrong.
I have already find the solution to the problem. I can aviod this problem by just reset_index first.
But I still fell confusing on the result in dataframe "a". It would be great help if anyone can give me some guide.
Secondly, I think this may cause serious wrong of anyone don't take it serious and just write a command like:
df['D']=df.groupby('A')['C'].fillna(method='ffill').reset_index().loc[:,'C']

I think this should be fixed or warned more seriously.
Though I know that something like df['D']=df.groupby('A')['C'].fillna(method='ffill') without reset_index won't work.

Expected Output

It should be well performed as a dataframe after reset_index().

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 5f648bf
python : 3.8.11.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : zh_CN
LOCALE : Chinese (Simplified)_China.936

pandas : 1.3.2
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.2
setuptools : 52.0.0.post20210125
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.26.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

[paste the output of pd.show_versions() here leaving a blank line after the details tag]

@mIgLLL mIgLLL added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 5, 2021
@simonjayhawkins
Copy link
Member

Thanks @mIgLLL for the report.

I think this should be fixed or warned more seriously.
Though I know that something like df['D']=df.groupby('A')['C'].fillna(method='ffill') without reset_index won't work.

df['D']=df.groupby('A')['C'].fillna(method='ffill') raises ValueError: cannot reindex on an axis with duplicate labels

This was not shown in the code sample above. Why need a warning when an exception is raised?


maybe #8849 (comment) is related and much simpler code sample.

That example on master now gives

df = pd.DataFrame({"a": [1, 2, 3]}, index=[0, 1, 0])
a = df.groupby(level=0).apply(lambda x: x)
print(a)
   a
0  1
0  3
1  2

The result index has a different order to the dataframe and so assigning the result back to a dataframe column instead raises

df = pd.DataFrame({"a": [1, 2, 3]}, index=[0, 1, 0])
df["b"] = df.groupby(level=0)["a"].apply(lambda x: x) # ValueError: cannot reindex from a duplicate axis

whereas the shift call returns the same index.

df = pd.DataFrame({"a": [1, 2, 3]}, index=[0, 1, 0])
df.groupby(level=0)["a"].shift()
0    NaN
1    NaN
0    1.0
Name: a, dtype: float64

and therefore can be assigned to the dataframe without a re-index being necessary and does not raise.

df = pd.DataFrame({"a": [1, 2, 3]}, index=[0, 1, 0])
df["b"] = df.groupby(level=0)["a"].shift()
print(df)
   a    b
0  1  NaN
1  2  NaN
0  3  1.0

@simonjayhawkins simonjayhawkins added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 8, 2021
@rhshadrach
Copy link
Member

OP code now works and appears to be the correct result.

@rhshadrach rhshadrach added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 15, 2023
@abrowne34
Copy link

abrowne34 commented Aug 1, 2023

I would like to contribute to this issue and write the test. Could I please take this?

@rhshadrach
Copy link
Member

@prasad-yashdeep
Copy link

take

@josemayer
Copy link
Contributor

@prasad-yashdeep are you working on it yet?

@josemayer
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants