Skip to content

DataFrame.reindex with specified fill method fails for MultiIndex #23693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aberres opened this issue Nov 14, 2018 · 4 comments · Fixed by #37726
Closed

DataFrame.reindex with specified fill method fails for MultiIndex #23693

aberres opened this issue Nov 14, 2018 · 4 comments · Fixed by #37726
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@aberres
Copy link
Contributor

aberres commented Nov 14, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd

i = pd.MultiIndex.from_tuples([('a', 'b'), ('d', 'e')])
df = pd.DataFrame([[0, 7], [3, 4]], index=i, columns=['x', 'y'])

print(df)
#      x  y
# a b  0  7
# d e  3  4

i2 = pd.MultiIndex.from_tuples([('a', 'b'), ('d', 'e'), ('h', 'i')])
# same behavior
#i2 = pd.MultiIndex.from_tuples([('a', 'b', 'c'), ('d', 'e', 'f'), ('h', 'i', 'j')])

print(df.reindex(i2, axis=0, method='ffill'))
#       x    y
# a b  3.0  4.0
# d e  NaN  NaN
# h i  0.0  7.0

Problem description

The reindexing operation above introduces a row to the MultiIndex. When no fill method is specified the new row is added and filled with NA as expected.

When ffill is specified the behavior is not explainable for me. The index is updated as expected but a NA row is added in the middle of the existing data.

Expected Output

Not sure if ffill for MultiIndexes is designed like this, but I was hoping for

#       x    y
# a b  3.0  4.0
# d e  0.0  7.0
# h i  0.0  7.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.23.4
pytest: 3.7.1
pip: 18.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: 0.11.1
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: 1.6.1
bottleneck: 1.2.1
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None

@aberres
Copy link
Contributor Author

aberres commented Nov 14, 2018

Interesting:

When the targe index has an additional level and the level 0 is unchanged things work:

i = pd.MultiIndex.from_tuples([('a',), ('d',)])
df = pd.DataFrame([[0, 7], [3, 4]], index=i, columns=['x', 'y'])

i2 = pd.MultiIndex.from_tuples([('a', 'b'), ('a', 't'), ('d', 'e'), ('d', 'f')])
df.reindex(i2, axis=0, method='ffill')

When the level 0 is changed again wrong NA rows are introduced:

i = pd.MultiIndex.from_tuples([('a', 'b'), ('d', 'e')])
df = pd.DataFrame([[0, 7], [3, 4]], index=i, columns=['x', 'y'])
i2 = pd.MultiIndex.from_tuples([('a', 'b'), ('a', 't'), ('d', 'e'), ('d', 'f')])
df.reindex(i2, axis=0, method='ffill')

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Nov 14, 2018
@gfyoung
Copy link
Member

gfyoung commented Nov 14, 2018

cc @toobaz

@eddy-geek
Copy link

eddy-geek commented Sep 22, 2019

Here is another example of the almost-identical issue I stumbled onto:

from io import StringIO
r_dfshort = StringIO('''
a,timestamp,foo
A,2018-12-12,12
A,2019-01-02,2.1
A,2019-01-04,4.1
B,2019-01-02,2.2
B,2019-01-04,4.2
''')
r_dflong = StringIO('''
a,timestamp,bar
A,2019-01-01,1
A,2019-01-02,2
A,2019-01-03,3
A,2019-01-04,4
B,2019-01-01,1
B,2019-01-02,2
B,2019-01-03,3
B,2019-01-04,4
''')
d_dfshort = pd.read_csv(r_dfshort).set_index(['a','timestamp'])
d_dflong = pd.read_csv(r_dflong).set_index(['a','timestamp'])
a_dfshort = d_dfshort.reindex(index=d_dflong.index, method='ffill')

print(a_dfshort)

Outputs:

              foo
a timestamp      
A 2019-01-01  nan     #right (although I'd prefer if it were "12")
  2019-01-02 2.10
  2019-01-03  nan     #wrong
  2019-01-04 4.10
B 2019-01-01 4.10     #right
  2019-01-02 2.20
  2019-01-03 4.10     #wrong
  2019-01-04 4.20

Note that a workaround, "align" seems to work well:

a_dfshort, _ = d_dfshort.align(d_dflong, axis='rows', method='ffill', join='right')
print(a_dfshort)

Output

              foo
a timestamp      
A 2019-01-01  nan
  2019-01-02 2.10
  2019-01-03 2.10
  2019-01-04 4.10
B 2019-01-01 4.10
  2019-01-02 2.20
  2019-01-03 2.20
  2019-01-04 4.20

pd.show_versions:

Click to expand
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-1062.1.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 5.0.1
pip: 19.2.3
setuptools: 41.0.1
Cython: None
numpy: 1.16.3
scipy: 1.3.0
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.3
bs4: None
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@eddy-geek
Copy link

Code entry-point (and interesting commit to bisect): #9019

@phofl phofl added the Needs Tests Unit test(s) needed to prevent regressions label Nov 9, 2020
@jreback jreback added this to the 1.2 milestone Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants