How should `DataFrame.append` behave related to indexes types? #22957

lowerthansound · 2018-10-03T01:27:44Z

I'm currently working on a refactoring the code DataFrame.append (#22915). One question that came up was, what should be the behavior when appending DataFrames with different index types.

To simplify this, let's put the focus on the index itself, and assume that it is valid for both rows and columns (IMHO consistency between the two might be good).

Code Sample

>>> df1 = pd.DataFrame([[1, 2, 3]], columns=index1)
>>> df2 = pd.DataFrame([[4, 5, 6]], columns=index2)
>>> # what should happen in the next line when index1 and
>>> # index2 are of different types?
>>> result = df1.append(df2)

Current Behavior

Index 0 (rows)

All types of indexes work together, except for:

CategoricalIndex + {another}

When the types don't match, the result is cast to object.

There's a also a bug that happend when appending MultiIndex + DatetimeIndex. I will verify it more in detail later and raise an issue.

Index 1 (columns)

All types of indexes work together, except for:

IntervalIndex + {another}
MultiIndex + {another}
{another} + PeriodIndex
{another} + IntervalIndex
{another} + MultiIndex
DatetimeIndex + {another} (works for df.append(series))
TimedeltaIndex + {numeric} (works for df.append(series))
PeriodIndex + {another} (works for df.append(series))

The kinds of exceptions raised here are the most varied and sometimes very cryptic.

For example:

>>> df1 = pd.DataFrame([[1]], columns=['A'])
>>> df2 = pd.DataFrame([[2]], columns=pd.PeriodIndex([2000], freq='A'))
>>> df1.append(df2)
Traceback (most recent call last):
  File "<ipython-input-15-8ab0723181fb>", line 1, in <module>
    df1.append(df2)
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/frame.py", line 6211, in append
    sort=sort)
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 226, in concat
    return op.get_result()
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 423, in get_result
    copy=self.copy)
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/internals.py", line 5416, in concatenate_block_managers
    elif is_uniform_join_units(join_units):
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/internals.py", line 5440, in is_uniform_join_units
    all(not ju.is_na or ju.block.is_extension for ju in join_units) and
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/internals.py", line 5440, in <genexpr>
    all(not ju.is_na or ju.block.is_extension for ju in join_units) and
AttributeError: 'NoneType' object has no attribute 'is_extension'

Suggestion

Since index 0 already allows almost all types of index the be concatenated, we don't want to restrict this behavior. A suggestion I give is to allow any types of indexes to be merged together (and possibly raise a RuntimeWarning when a cast is necessary).

Foe example, joining two indexes of the same type should produce the same type (as the current behavior):

>>> df1 = pd.DataFrame([[1]], columns=pd.Int64Index([1]))
>>> df2 = pd.DataFrame([[2]], columns=pd.Int64Index([2]))
>>> result = df1.append(df2)
>>> result
     1    2
0  1.0  NaN
0  NaN  2.0
>>> result.columns
Int64Index([1, 2], dtype='int64')

Joining two indexes of different types usually upcasts to object (only when necessary)

>>> df1 = pd.DataFrame([[1]], columns=pd.Int64Index([1]))
>>> df2 = pd.DataFrame([[2]], columns=pd.interval_range(0, 1))
>>> result = df1.append(df2)
>>> result
     1  (0, 1]
0  1.0     NaN
1  NaN     2.0
>>> result.columns
Index([1, (0, 1]], dtype='object')

Empty indexes

I believe that empty indexes (e.g. those created in a DataFrame with no rows or columns) should be ignored when calculating the final dtype of an index merge. This seems like it is already the current behavior:

>>> df1 = pd.DataFrame()
>>> df2 = pd.DataFrame(index=[1])
>>> df1.index
Index([], dtype='object')
>>> df2.index
Int64Index([1], dtype='int64')
>>> df1.append(df2).index
Int64Index([1], dtype='int64')

However, when an empty index has a dtype different from object, we may want to preserve it (as it may have been created explicitly by the user).

>>> # current behavior
>>> df1 = pd.DataFrame(index=pd.Float64Index([]))
>>> df2 = pd.DataFrame(index=[1])
>>> df1.append(df2).index
Int64Index([1], dtype='int64')
>>>
>>> # suggested behavior
>>> df1.append(df2).index
Float64Index([1], dtype='int64')

Sorry if this was too long, this is being my first contribution to open source and I still didn't get the hang of how things work. Any suggestions, whether related to the issue or meta, are welcome!

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-10-03T02:50:34Z

I think the rule is that appending any two different types of indexes ends up with object dtype.

The specific example you showed, Index and PeriodIndex is a definite bug.

pambot · 2018-10-03T04:27:33Z

I don't know much about internals, but digging into the bug, this is the code being run:

def is_uniform_join_units(join_units):
    return (
        # all blocks need to have the same type
        all(type(ju.block) is type(join_units[0].block) for ju in join_units) and  # noqa
        # no blocks that would get missing values (can lead to type upcasts)
        # unless we're an extension dtype.
        all(not ju.is_na or ju.block.is_extension for ju in join_units) and
        # no blocks with indexers (as then the dimensions do not fit)
        all(not ju.indexers for ju in join_units) and
        # disregard Panels
        all(ju.block.ndim <= 2 for ju in join_units) and
        # only use this path when there is something to concatenate
        len(join_units) > 1)

Where,

In[3]: join_units
Out[3]: 
[JoinUnit(IntBlock: slice(0, 1, 1), 1 x 1, dtype: int64, {}),
 JoinUnit(None, {})]

So it looks like PeriodIndex is not returning the right kind of Block, just a None because indexers is giving it an index of -1. Does any of this ring a bell? Seems to me like PeriodIndex should be a DatetimeBlock or something?

TomAugspurger · 2018-10-08T20:22:16Z

FYI, I suspect that all the PeriodIndex ones will be solved by #22862 (which will be in 0.24)

lowerthansound · 2018-10-08T23:43:22Z

Thanks, will take that into account (:

jreback · 2018-10-09T10:48:35Z

so this is tested independently (though i suspect it’s not fully tested for a cartesian product of Indexes)

look at Index.append (and subclasses of Index)

lowerthansound · 2018-10-10T00:07:00Z

Will take a look :). Also believe that some of the errors are raised from functions like Index.union and Index.difference.

CDBridger · 2019-02-10T21:32:54Z

so if I'm reading this right, the reason why I am getting that exact example error from the first post when running append, is because I have columns between my two frames that have the same name but contain different types of data?

EDIT: Is there an easy way to find out what the offending columns are? I tried casting every column in my two dataframes to strings (resulting in dtype beig an object) but that didn't seem to fix it. My use case is two seperate CSV's with intersecting columns as well as unique columns that I want to append by stacking them together into one csv while retaining the unique columns from each one and filling them with None/Nan/Null for the rows from the opposing csv.

e.g

df1:

  A B C D
0 w x y z

df2:

  C D E
0 i j k

merged df:

  A B C D E
0 w x y z  -
1 - - i j k

Except in my case obviously there are many more shared and unique columns (~15 shared columns, 10 unique columns each, total of 35 columns in merged csv/dataframe)

mroeschke · 2021-06-23T00:40:09Z

The example with index and PeriodIndex looks to work on master. Could use a test

In [2]: >>> df1 = pd.DataFrame([[1]], columns=['A'])
   ...: >>> df2 = pd.DataFrame([[2]], columns=pd.PeriodIndex([2000], freq='A'))
   ...: >>> df1.append(df2)
Out[2]:
     A  2000
0  1.0   NaN
0  NaN   2.0

TomAugspurger mentioned this issue Oct 10, 2018

[WIP] API/CLN: Refactor DataFrame.append #22915

Closed

10 tasks

mroeschke added Bug Indexing Related to indexing on series/frames, not to indexes themselves Period Period data type labels Jan 13, 2019

simonjayhawkins mentioned this issue Jun 29, 2020

BUG: DataFrame.append with empty DataFrame and Series with tz-aware datetime value allocated object column #35038

Merged

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Indexing Related to indexing on series/frames, not to indexes themselves Period Period data type labels Jun 23, 2021

jbrockmendel mentioned this issue Dec 26, 2021

DEPR: Series/DataFrame.append (#35407) #44539

Merged

5 tasks

jreback added this to the 1.4 milestone Dec 27, 2021

jreback closed this as completed in #44539 Dec 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should `DataFrame.append` behave related to indexes types? #22957

How should `DataFrame.append` behave related to indexes types? #22957

lowerthansound commented Oct 3, 2018 •

edited

Loading

TomAugspurger commented Oct 3, 2018

pambot commented Oct 3, 2018

TomAugspurger commented Oct 8, 2018

lowerthansound commented Oct 8, 2018

jreback commented Oct 9, 2018

lowerthansound commented Oct 10, 2018 •

edited

Loading

CDBridger commented Feb 10, 2019 •

edited

Loading

mroeschke commented Jun 23, 2021

How should DataFrame.append behave related to indexes types? #22957

How should DataFrame.append behave related to indexes types? #22957

Comments

lowerthansound commented Oct 3, 2018 • edited Loading

Code Sample

Current Behavior

Index 0 (rows)

Index 1 (columns)

Suggestion

Empty indexes

TomAugspurger commented Oct 3, 2018

pambot commented Oct 3, 2018

TomAugspurger commented Oct 8, 2018

lowerthansound commented Oct 8, 2018

jreback commented Oct 9, 2018

lowerthansound commented Oct 10, 2018 • edited Loading

CDBridger commented Feb 10, 2019 • edited Loading

mroeschke commented Jun 23, 2021

How should `DataFrame.append` behave related to indexes types? #22957

How should `DataFrame.append` behave related to indexes types? #22957

lowerthansound commented Oct 3, 2018 •

edited

Loading

lowerthansound commented Oct 10, 2018 •

edited

Loading

CDBridger commented Feb 10, 2019 •

edited

Loading