DataFrame.reindex with specified fill method fails for MultiIndex #23693

aberres · 2018-11-14T13:29:12Z

Code Sample, a copy-pastable example if possible

import pandas as pd

i = pd.MultiIndex.from_tuples([('a', 'b'), ('d', 'e')])
df = pd.DataFrame([[0, 7], [3, 4]], index=i, columns=['x', 'y'])

print(df)
#      x  y
# a b  0  7
# d e  3  4

i2 = pd.MultiIndex.from_tuples([('a', 'b'), ('d', 'e'), ('h', 'i')])
# same behavior
#i2 = pd.MultiIndex.from_tuples([('a', 'b', 'c'), ('d', 'e', 'f'), ('h', 'i', 'j')])

print(df.reindex(i2, axis=0, method='ffill'))
#       x    y
# a b  3.0  4.0
# d e  NaN  NaN
# h i  0.0  7.0

Problem description

The reindexing operation above introduces a row to the MultiIndex. When no fill method is specified the new row is added and filled with NA as expected.

When ffill is specified the behavior is not explainable for me. The index is updated as expected but a NA row is added in the middle of the existing data.

Expected Output

Not sure if ffill for MultiIndexes is designed like this, but I was hoping for

#       x    y
# a b  3.0  4.0
# d e  0.0  7.0
# h i  0.0  7.0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None

pandas: 0.23.4
pytest: 3.7.1
pip: 18.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: 0.11.1
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: 1.6.1
bottleneck: 1.2.1
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

aberres · 2018-11-14T13:59:05Z

Interesting:

When the targe index has an additional level and the level 0 is unchanged things work:

i = pd.MultiIndex.from_tuples([('a',), ('d',)])
df = pd.DataFrame([[0, 7], [3, 4]], index=i, columns=['x', 'y'])

i2 = pd.MultiIndex.from_tuples([('a', 'b'), ('a', 't'), ('d', 'e'), ('d', 'f')])
df.reindex(i2, axis=0, method='ffill')

When the level 0 is changed again wrong NA rows are introduced:

i = pd.MultiIndex.from_tuples([('a', 'b'), ('d', 'e')])
df = pd.DataFrame([[0, 7], [3, 4]], index=i, columns=['x', 'y'])
i2 = pd.MultiIndex.from_tuples([('a', 'b'), ('a', 't'), ('d', 'e'), ('d', 'f')])
df.reindex(i2, axis=0, method='ffill')

gfyoung · 2018-11-14T18:47:47Z

cc @toobaz

eddy-geek · 2019-09-22T21:51:25Z

Here is another example of the almost-identical issue I stumbled onto:

from io import StringIO
r_dfshort = StringIO('''
a,timestamp,foo
A,2018-12-12,12
A,2019-01-02,2.1
A,2019-01-04,4.1
B,2019-01-02,2.2
B,2019-01-04,4.2
''')
r_dflong = StringIO('''
a,timestamp,bar
A,2019-01-01,1
A,2019-01-02,2
A,2019-01-03,3
A,2019-01-04,4
B,2019-01-01,1
B,2019-01-02,2
B,2019-01-03,3
B,2019-01-04,4
''')
d_dfshort = pd.read_csv(r_dfshort).set_index(['a','timestamp'])
d_dflong = pd.read_csv(r_dflong).set_index(['a','timestamp'])
a_dfshort = d_dfshort.reindex(index=d_dflong.index, method='ffill')

print(a_dfshort)

Outputs:

              foo
a timestamp      
A 2019-01-01  nan     #right (although I'd prefer if it were "12")
  2019-01-02 2.10
  2019-01-03  nan     #wrong
  2019-01-04 4.10
B 2019-01-01 4.10     #right
  2019-01-02 2.20
  2019-01-03 4.10     #wrong
  2019-01-04 4.20

Note that a workaround, "align" seems to work well:

a_dfshort, _ = d_dfshort.align(d_dflong, axis='rows', method='ffill', join='right')
print(a_dfshort)

Output

              foo
a timestamp      
A 2019-01-01  nan
  2019-01-02 2.10
  2019-01-03 2.10
  2019-01-04 4.10
B 2019-01-01 4.10
  2019-01-02 2.20
  2019-01-03 2.20
  2019-01-04 4.20

pd.show_versions:

Click to expand

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-1062.1.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 5.0.1
pip: 19.2.3
setuptools: 41.0.1
Cython: None
numpy: 1.16.3
scipy: 1.3.0
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.3
bs4: None
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

eddy-geek · 2019-09-22T22:04:33Z

Code entry-point (and interesting commit to bisect): #9019

gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Nov 14, 2018

phofl mentioned this issue Nov 9, 2020

TST: Add test for filling new rows through reindexing MultiIndex #37726

Merged

4 tasks

phofl added the Needs Tests Unit test(s) needed to prevent regressions label Nov 9, 2020

jreback added this to the 1.2 milestone Nov 11, 2020

jreback closed this as completed in #37726 Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.reindex with specified fill method fails for MultiIndex #23693

DataFrame.reindex with specified fill method fails for MultiIndex #23693

aberres commented Nov 14, 2018

INSTALLED VERSIONS

aberres commented Nov 14, 2018

gfyoung commented Nov 14, 2018

eddy-geek commented Sep 22, 2019 •

edited

Loading

eddy-geek commented Sep 22, 2019

DataFrame.reindex with specified fill method fails for MultiIndex #23693

DataFrame.reindex with specified fill method fails for MultiIndex #23693

Comments

aberres commented Nov 14, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

aberres commented Nov 14, 2018

gfyoung commented Nov 14, 2018

eddy-geek commented Sep 22, 2019 • edited Loading

eddy-geek commented Sep 22, 2019

Output of `pd.show_versions()`

eddy-geek commented Sep 22, 2019 •

edited

Loading