-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: a duplicated index would cause groupby.fillna(method='ffill') a wrong result #43412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @mIgLLL for the report.
This was not shown in the code sample above. Why need a warning when an exception is raised? maybe #8849 (comment) is related and much simpler code sample. That example on master now gives
The result index has a different order to the dataframe and so assigning the result back to a dataframe column instead raises
whereas the shift call returns the same index.
and therefore can be assigned to the dataframe without a re-index being necessary and does not raise.
|
OP code now works and appears to be the correct result. |
I would like to contribute to this issue and write the test. Could I please take this? |
@abrowne34 - see the 4th paragraph here: https://pandas.pydata.org/pandas-docs/stable/development/contributing.html?highlight=take#where-to-start |
take |
@prasad-yashdeep are you working on it yet? |
take |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
[this should explain why the current behaviour is a problem and why the expected output is a better solution]
I find that if the index is duplicated after some dataframe appending, the groupby.fillna() would cause something wrong.
I have already find the solution to the problem. I can aviod this problem by just reset_index first.
But I still fell confusing on the result in dataframe "a". It would be great help if anyone can give me some guide.
Secondly, I think this may cause serious wrong of anyone don't take it serious and just write a command like:
df['D']=df.groupby('A')['C'].fillna(method='ffill').reset_index().loc[:,'C']
I think this should be fixed or warned more seriously.
Though I know that something like df['D']=df.groupby('A')['C'].fillna(method='ffill') without reset_index won't work.
Expected Output
It should be well performed as a dataframe after reset_index().
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 5f648bf
python : 3.8.11.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : zh_CN
LOCALE : Chinese (Simplified)_China.936
pandas : 1.3.2
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.2
setuptools : 52.0.0.post20210125
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.26.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
[paste the output of
pd.show_versions()
here leaving a blank line after the details tag]The text was updated successfully, but these errors were encountered: