Skip to content

df.append should retain columns type if same type #18359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
topper-123 opened this issue Nov 18, 2017 · 7 comments · Fixed by #19021
Closed

df.append should retain columns type if same type #18359

topper-123 opened this issue Nov 18, 2017 · 7 comments · Fixed by #19021
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@topper-123
Copy link
Contributor

topper-123 commented Nov 18, 2017

Currently df.append loses columns index type, if the columns is a CategoricalIndex:

>>> idx = pd.CategoricalIndex('a b'.split())
>>> df = pd.DataFrame([[1, 2]], columns=idx)
>>> ser = pd.Series([3, 4], index=idx, name=1)
>>> df.append(ser).columns
Index(['a', 'b'], dtype='object')

df.append(ser).columns should return a CategoricalIndex equal to idx.

pandas 0.21 has the new CategoricalDtype, so it's now easy to compare CategoricalIndex instances for strict type equality. Hence this issue should be much easier to solve than previously.

Solution proposal

In frame.py::DataFrame.append there is this line:

combined_columns = self.columns.tolist() + self.columns.union(
                    other.index).difference(self.columns).tolist()

This line converts CategoricalIndex columns to normal indexes. So by making some checks for types and dtypes it should be easy return the correct index. So if the above would be something like this instead:

same_types = type(self.columns) == type(other.index)
same_dtypes = self.columns.dtype == other.index.dtype
if same_types and same_dtypes:
    combined_columns = self.columns.union(other.index)
else:
    combined_columns = self.columns.tolist() + self.columns.union(
        other.index).difference(self.columns).tolist()

and I think this issue can be solved (haven't checked yet all details, maybe some adjustments have to be made). I'd appreciate comments if this approach is ok.

@topper-123 topper-123 changed the title df.append should retain columns type df.append should retain columns type if same type Nov 18, 2017
@jreback
Copy link
Contributor

jreback commented Nov 19, 2017

yeah this is kind of messy. this should all use Index.append then none of this is an issue. we shouldn't be using .tolist() at all.

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 19, 2017
@jreback jreback added this to the Next Major Release milestone Nov 19, 2017
@topper-123
Copy link
Contributor Author

Hi, I've started to look into this.

ATM it seems like .union actually is very robust, and I'm leaning towards that simply combined_columns = self.columns.union(other.index) is possible, but I wonder why you pointed to Index.append. Did you mean Index.Union?

@jreback
Copy link
Contributor

jreback commented Nov 28, 2017

no i meant append; you need to append the union of differences (i think this is the symmetric_didferenev)

@jreback jreback closed this as completed Nov 28, 2017
@jreback jreback reopened this Nov 28, 2017
@topper-123
Copy link
Contributor Author

topper-123 commented Nov 29, 2017

symmetric_difference doesn't work:

>>> d = pd.api.types.CategoricalDtype('A B C'.split())
>>> c1 = pd.CategoricalIndex('A B'.split(), dtype=d)
>>> c2 = pd.CategoricalIndex('B C'.split(), dtype=d)
>>> c1.symmetric_difference(c2)
Index(['A', 'C'], dtype='object')  # notice index type and also values are not good to be appended

Just difference is good:

>>> c2.append(c1.difference(c2))
CategoricalIndex(['B', 'C', 'A'], categories=['A', 'B', 'C'], ordered=False, dtype='category')

Which gives the same result (in this case, maybe generally) as union:

>>> c2.union(c1)
CategoricalIndex(['B', 'C', 'A'], categories=['A', 'B', 'C'], ordered=False, dtype='category')

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Jan 14, 2018
@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018
@avnishbm
Copy link

This issue still exists with dataframe.append, it happens to change the index type to pandas.core.indexes.base.Index, even though both the dataframes being merged had index type as pandas.core.indexes.datetimes.DatetimeIndex! Can this be looked into please?

Have also tried using inplace=True for the append call but it was of no use!

@jreback
Copy link
Contributor

jreback commented Mar 12, 2021

@avnishbm this is a very old issue

if you think u have a bug then pls show a reproducible example in a new issue testing against the latest released version

@avnishbm
Copy link

@jreback have raised a new issue providing the details: #40435

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants