BUG: incorrect groupby().ffill() in pandas 0.23.0 #21207

adbull · 2018-05-25T16:43:44Z

Code Sample, a copy-pastable example if possible

Input:

import numpy as np
import pandas as pd

df2 = pd.DataFrame(dict(x=0, y=[np.nan]*9 + [1]*9))
print(df2.head())
print(df2.groupby('x').ffill().head())

Output:

   x   y
0  0 NaN
1  0 NaN
2  0 NaN
3  0 NaN
4  0 NaN
   x    y
0  0  NaN
1  0  1.0
2  0  1.0
3  0  1.0
4  0  1.0

Problem description

The new groupby().ffill() in pandas 0.23.0 (#19673) returns incorrect answers, and appears to be permuting the input before filling.

Expected Output

   x   y
0  0 NaN
1  0 NaN
2  0 NaN
3  0 NaN
4  0 NaN
   x   y
0  0 NaN
1  0 NaN
2  0 NaN
3  0 NaN
4  0 NaN

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.14-200.fc26.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.utf8
LOCALE: en_GB.UTF-8

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 4.2.1
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.2
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-05-25T21:14:12Z

That does seem odd. Did you notice any significance in choosing 16 elements? I tried the below construct and it worked fine:

In [30]: df = pd.DataFrame({'x': 0, 'y': [np.nan] * 8 + [1] * 8})
In [31]: df.groupby('x').ffill()

As did any digit less than 8. Am I looking at it wrong or did you notice the same behavior?

WillAyd · 2018-05-25T21:32:23Z

The problem stems from the below line:

pandas/pandas/_libs/groupby.pyx

Line 300 in f6abb61

sorted_labels = np.argsort(labels).astype(np.int64, copy=False)

The intention here is to sort the labels (here column 'x' provides the labels) so that you can iterate over each group's values consecutively in order of appearance

Printing after that statement here's what is shows when there are 16 or less records:

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]

Here's what it prints for 18 elements (like your example):

[ 0 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1 16 17]

The latter being out of sequence is what's causing the issue here. Not sure why that happens just yet but investigating further

adbull · 2018-05-25T21:53:41Z

Sounds like the issue arises because np.argsort isn't stable by default. Using kind='mergesort' should fix?

WillAyd · 2018-05-25T22:16:30Z

That's what I ended up doing in the PR referencing this

adbull changed the title ~~BUG: incorrect groupby().ffill() in pandas 0.23.0~~ BUG: incorrect groupby().ffill() in pandas 0.23.0 May 25, 2018

WillAyd added Groupby Regression Functionality that used to work in a prior pandas version and removed Regression Functionality that used to work in a prior pandas version labels May 25, 2018

WillAyd added the Regression Functionality that used to work in a prior pandas version label May 25, 2018

WillAyd added the Bug label May 25, 2018

WillAyd mentioned this issue May 25, 2018

Stable Sorting Algorithm for Fillna Indexer #21212

Merged

4 tasks

jreback added this to the 0.23.1 milestone May 29, 2018

jreback closed this as completed in #21212 May 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: incorrect groupby().ffill() in pandas 0.23.0 #21207

BUG: incorrect groupby().ffill() in pandas 0.23.0 #21207

adbull commented May 25, 2018 •

edited

Loading

INSTALLED VERSIONS

WillAyd commented May 25, 2018

WillAyd commented May 25, 2018 •

edited

Loading

adbull commented May 25, 2018

WillAyd commented May 25, 2018

BUG: incorrect groupby().ffill() in pandas 0.23.0 #21207

BUG: incorrect groupby().ffill() in pandas 0.23.0 #21207

Comments

adbull commented May 25, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented May 25, 2018

WillAyd commented May 25, 2018 • edited Loading

adbull commented May 25, 2018

WillAyd commented May 25, 2018

adbull commented May 25, 2018 •

edited

Loading

Output of `pd.show_versions()`

WillAyd commented May 25, 2018 •

edited

Loading