Skip to content

How should DataFrame.append behave related to indexes types? #22957

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lowerthansound opened this issue Oct 3, 2018 · 8 comments · Fixed by #44539
Closed

How should DataFrame.append behave related to indexes types? #22957

lowerthansound opened this issue Oct 3, 2018 · 8 comments · Fixed by #44539
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@lowerthansound
Copy link

lowerthansound commented Oct 3, 2018

I'm currently working on a refactoring the code DataFrame.append (#22915). One question that came up was, what should be the behavior when appending DataFrames with different index types.

To simplify this, let's put the focus on the index itself, and assume that it is valid for both rows and columns (IMHO consistency between the two might be good).

Code Sample

>>> df1 = pd.DataFrame([[1, 2, 3]], columns=index1)
>>> df2 = pd.DataFrame([[4, 5, 6]], columns=index2)
>>> # what should happen in the next line when index1 and
>>> # index2 are of different types?
>>> result = df1.append(df2)

Current Behavior

Index 0 (rows)

All types of indexes work together, except for:

  • CategoricalIndex + {another}

When the types don't match, the result is cast to object.

There's a also a bug that happend when appending MultiIndex + DatetimeIndex. I will verify it more in detail later and raise an issue.

Index 1 (columns)

All types of indexes work together, except for:

  • IntervalIndex + {another}
  • MultiIndex + {another}
  • {another} + PeriodIndex
  • {another} + IntervalIndex
  • {another} + MultiIndex
  • DatetimeIndex + {another} (works for df.append(series))
  • TimedeltaIndex + {numeric} (works for df.append(series))
  • PeriodIndex + {another} (works for df.append(series))

The kinds of exceptions raised here are the most varied and sometimes very cryptic.

For example:

>>> df1 = pd.DataFrame([[1]], columns=['A'])
>>> df2 = pd.DataFrame([[2]], columns=pd.PeriodIndex([2000], freq='A'))
>>> df1.append(df2)
Traceback (most recent call last):
  File "<ipython-input-15-8ab0723181fb>", line 1, in <module>
    df1.append(df2)
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/frame.py", line 6211, in append
    sort=sort)
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 226, in concat
    return op.get_result()
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 423, in get_result
    copy=self.copy)
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/internals.py", line 5416, in concatenate_block_managers
    elif is_uniform_join_units(join_units):
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/internals.py", line 5440, in is_uniform_join_units
    all(not ju.is_na or ju.block.is_extension for ju in join_units) and
  File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/internals.py", line 5440, in <genexpr>
    all(not ju.is_na or ju.block.is_extension for ju in join_units) and
AttributeError: 'NoneType' object has no attribute 'is_extension'

Suggestion

Since index 0 already allows almost all types of index the be concatenated, we don't want to restrict this behavior. A suggestion I give is to allow any types of indexes to be merged together (and possibly raise a RuntimeWarning when a cast is necessary).

Foe example, joining two indexes of the same type should produce the same type (as the current behavior):

>>> df1 = pd.DataFrame([[1]], columns=pd.Int64Index([1]))
>>> df2 = pd.DataFrame([[2]], columns=pd.Int64Index([2]))
>>> result = df1.append(df2)
>>> result
     1    2
0  1.0  NaN
0  NaN  2.0
>>> result.columns
Int64Index([1, 2], dtype='int64')

Joining two indexes of different types usually upcasts to object (only when necessary)

>>> df1 = pd.DataFrame([[1]], columns=pd.Int64Index([1]))
>>> df2 = pd.DataFrame([[2]], columns=pd.interval_range(0, 1))
>>> result = df1.append(df2)
>>> result
     1  (0, 1]
0  1.0     NaN
1  NaN     2.0
>>> result.columns
Index([1, (0, 1]], dtype='object')

Empty indexes

I believe that empty indexes (e.g. those created in a DataFrame with no rows or columns) should be ignored when calculating the final dtype of an index merge. This seems like it is already the current behavior:

>>> df1 = pd.DataFrame()
>>> df2 = pd.DataFrame(index=[1])
>>> df1.index
Index([], dtype='object')
>>> df2.index
Int64Index([1], dtype='int64')
>>> df1.append(df2).index
Int64Index([1], dtype='int64')

However, when an empty index has a dtype different from object, we may want to preserve it (as it may have been created explicitly by the user).

>>> # current behavior
>>> df1 = pd.DataFrame(index=pd.Float64Index([]))
>>> df2 = pd.DataFrame(index=[1])
>>> df1.append(df2).index
Int64Index([1], dtype='int64')
>>>
>>> # suggested behavior
>>> df1.append(df2).index
Float64Index([1], dtype='int64')

Sorry if this was too long, this is being my first contribution to open source and I still didn't get the hang of how things work. Any suggestions, whether related to the issue or meta, are welcome!

@TomAugspurger
Copy link
Contributor

I think the rule is that appending any two different types of indexes ends up with object dtype.

The specific example you showed, Index and PeriodIndex is a definite bug.

@pambot
Copy link
Contributor

pambot commented Oct 3, 2018

I don't know much about internals, but digging into the bug, this is the code being run:

def is_uniform_join_units(join_units):
    return (
        # all blocks need to have the same type
        all(type(ju.block) is type(join_units[0].block) for ju in join_units) and  # noqa
        # no blocks that would get missing values (can lead to type upcasts)
        # unless we're an extension dtype.
        all(not ju.is_na or ju.block.is_extension for ju in join_units) and
        # no blocks with indexers (as then the dimensions do not fit)
        all(not ju.indexers for ju in join_units) and
        # disregard Panels
        all(ju.block.ndim <= 2 for ju in join_units) and
        # only use this path when there is something to concatenate
        len(join_units) > 1)

Where,

In[3]: join_units
Out[3]: 
[JoinUnit(IntBlock: slice(0, 1, 1), 1 x 1, dtype: int64, {}),
 JoinUnit(None, {})]

So it looks like PeriodIndex is not returning the right kind of Block, just a None because indexers is giving it an index of -1. Does any of this ring a bell? Seems to me like PeriodIndex should be a DatetimeBlock or something?

@TomAugspurger
Copy link
Contributor

FYI, I suspect that all the PeriodIndex ones will be solved by #22862 (which will be in 0.24)

@lowerthansound
Copy link
Author

Thanks, will take that into account (:

@jreback
Copy link
Contributor

jreback commented Oct 9, 2018

so this is tested independently (though i suspect it’s not fully tested for a cartesian product of Indexes)

look at Index.append (and subclasses of Index)

@lowerthansound
Copy link
Author

lowerthansound commented Oct 10, 2018

Will take a look :). Also believe that some of the errors are raised from functions like Index.union and Index.difference.

@mroeschke mroeschke added Bug Indexing Related to indexing on series/frames, not to indexes themselves Period Period data type labels Jan 13, 2019
@CDBridger
Copy link

CDBridger commented Feb 10, 2019

so if I'm reading this right, the reason why I am getting that exact example error from the first post when running append, is because I have columns between my two frames that have the same name but contain different types of data?

EDIT: Is there an easy way to find out what the offending columns are? I tried casting every column in my two dataframes to strings (resulting in dtype beig an object) but that didn't seem to fix it. My use case is two seperate CSV's with intersecting columns as well as unique columns that I want to append by stacking them together into one csv while retaining the unique columns from each one and filling them with None/Nan/Null for the rows from the opposing csv.

e.g

df1:

  A B C D
0 w x y z

df2:

  C D E
0 i j k

merged df:

  A B C D E
0 w x y z  -
1 - - i j k

Except in my case obviously there are many more shared and unique columns (~15 shared columns, 10 unique columns each, total of 35 columns in merged csv/dataframe)

@mroeschke
Copy link
Member

The example with index and PeriodIndex looks to work on master. Could use a test

In [2]: >>> df1 = pd.DataFrame([[1]], columns=['A'])
   ...: >>> df2 = pd.DataFrame([[2]], columns=pd.PeriodIndex([2000], freq='A'))
   ...: >>> df1.append(df2)
Out[2]:
     A  2000
0  1.0   NaN
0  NaN   2.0

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Indexing Related to indexing on series/frames, not to indexes themselves Period Period data type labels Jun 23, 2021
@jreback jreback added this to the 1.4 milestone Dec 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants