Unexpected behaviour of groupby.transform when using 'fillna' #30918

lfiedler · 2020-01-11T12:20:47Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'baz'],
        'B': [1, 2, np.nan, 3, 3, np.nan, 4],
        'C': [np.nan]*7,
        'D': [0,1,2,3,4,5,6],
        'E': [np.nan] + [datetime.datetime(2020,1,1)]*3 + [datetime.datetime(2020,1,2)]*2 +[datetime.datetime(2020,1,3)],
        'F': list('abcdefg'),
        'G': list('abc') + [np.nan] + list('efg'),
        'id': range(0,7),
    }
).set_index('id')
df.groupby('A').transform('fillna', value=9999)

Output

B	C	D	E	F	G
9999.0	9999.0	2	2020-01-01 00:00:00	c	c
9999.0	9999.0	2	2020-01-01 00:00:00	c	c
9999.0	9999.0	2	2020-01-01 00:00:00	c	c
9999.0	9999.0	2	2020-01-01 00:00:00	c	c
1.0	9999.0	0	9999	a	a
1.0	9999.0	0	9999	a	a
2.0	9999.0	1	2020-01-01 00:00:00	b	b

Problem description

When using GroupBy.transform together with 'fillna' I expected it to work like GroupBy.transform together with lambda x: x.fillna(). Instead, it seems to also change values that are not NaN. Even worse, it seems to shuffle contents between groups.

Is this how it is expected to work?

Expected Output

df.groupby('A').transform(lambda x: x.fillna(9999))

B	C	D	E	F	G
1.0	9999.0	0	9999	a	a
2.0	9999.0	1	2020-01-01 00:00:00	b	b
9999.0	9999.0	2	2020-01-01 00:00:00	c	c
3.0	9999.0	3	2020-01-01 00:00:00	d	9999
3.0	9999.0	4	2020-01-02 00:00:00	e	e
9999.0	9999.0	5	2020-01-02 00:00:00	f	f
4.0	9999.0	6	2020-01-03 00:00:00	g	g

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : None.None

pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200106
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

fujiaxiang · 2020-01-12T04:22:40Z

Tested on latest master.

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'1.0.0rc0+12.g75ecfa448'
>>> df = pd.DataFrame(
...     {
...         'A': ['foo', 'foo', 'bar', 'bar', 'baz'],
...         'B': [1, 2, np.nan, 3, 3],
...     }
... )
>>> df
     A    B
0  foo  1.0
1  foo  2.0
2  bar  NaN
3  bar  3.0
4  baz  3.0
>>> df.groupby('A').transform("fillna", value=9)
     B
0  9.0
1  9.0
2  1.0
3  1.0
4  2.0

Looks like a bug in GroupBy._transform_fast. Will take a look.

fujiaxiang mentioned this issue Jan 17, 2020

BUG: groupby transform fillna produces wrong result #31101

Merged

5 tasks

jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 24, 2020

jreback added this to the 1.1 milestone Jan 24, 2020

fujiaxiang mentioned this issue Jan 24, 2020

ENH: allow 'pad', 'backfill' and 'cumcount' in groupby.transform #31269

Closed

WillAyd closed this as completed in #31101 Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behaviour of groupby.transform when using 'fillna' #30918

Unexpected behaviour of groupby.transform when using 'fillna' #30918

lfiedler commented Jan 11, 2020

INSTALLED VERSIONS

fujiaxiang commented Jan 12, 2020

Unexpected behaviour of groupby.transform when using 'fillna' #30918

Unexpected behaviour of groupby.transform when using 'fillna' #30918

Comments

lfiedler commented Jan 11, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

fujiaxiang commented Jan 12, 2020

Output of `pd.show_versions()`