-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: MultiIndex with nan values obtained by groupby
behaves different to MultiIndex.from_tuples()
#36060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
When recreating the MI of the groupby operation, the multiplication works as expected, so I suspect the groupby grouping MI creation is where it fails. In [59]: speed = df.groupby(['animal', 'type'], dropna=False)['speed'].first()
...: speed.index = pd.MultiIndex.from_tuples(speed.index)
...:
...: speed * wing
Out[59]:
Falcon NaN 15960.0
Parrot NaN 1056.0
dtype: float64 |
as a workaround In [68]: wing.reindex_like(speed) * speed
Out[68]:
animal type
Falcon NaN 15960.0
Parrot NaN 1056.0
dtype: float64 I'm not sure which representation is preferable (storing |
Unsure if this is the same underlying issue, but it fits the title: In[3] = pd.DataFrame(data={"a": [1, 2, 3, np.nan, 4], "b": ["a", "b", "c", "d", np.nan], "c": [0, 12, 23, 45, 56]})
In[4] df
a b c
0 1.0 a 0
1 2.0 b 12
2 3.0 c 23
3 NaN d 45
4 4.0 NaN 56
In[5] df.groupby(["a", "c"], dropna=False).sum().groupby(["a", "c"], dropna=True).sum()
b
a c
1.0 0 a
2.0 12 b
3.0 23 c
4.0 56 0
NaN 45 d
In [6]: idx = pd.MultiIndex.from_tuples([(1.0, 0), (2.0, 12), (3.0, 23), (4.0, 56), (np.nan, 45)], names=('a', 'b'))
In [8]: df2 = pd.DataFrame(["a", "b", "c", np.nan, "d"], index=idx)
In [9]: df2
0
a b
1.0 0 a
2.0 12 b
3.0 23 c
4.0 56 NaN
NaN 45 d
In [12]: df2.groupby(["a", "b"], dropna=True).first()
0
a b
1.0 0 a
2.0 12 b
3.0 23 c
4.0 56 NaN After performing a |
My rewriting of index workaround can cause another issue ( Pandas version: 1.1.2
I reopened another issue providing more detail on the above. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I checked in 1.1.1
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
I'm trying to perform combine two series (say multiplication for now). One of them is obtained by a groupby aggregation (say
first
) and the other series is constructed manually. Both series have a MultiIndex which should be the same and a multiplication should work fine. However, it seems thatgroupby(..., dropna=False)
creates a different MI which causes the operation to return an unexpected result.Expected Output
I would expect the result of
speed * wing
to beOutput of
pd.show_versions()
INSTALLED VERSIONS
commit : f2ca0a2
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.7.17-200.fc32.x86_64
Version : #1 SMP Fri Aug 21 15:23:46 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 1.1.1
numpy : 1.18.4
pytz : 2019.3
dateutil : 2.8.1
pip : 20.2.1
setuptools : 47.3.1
Cython : 0.29.17
pytest : 5.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.17.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : 1.8.6
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: