-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: DataFrame.append preserves columns dtype if possible #19021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: DataFrame.append preserves columns dtype if possible #19021
Conversation
72214ca
to
6bb6860
Compare
Can't get Travis-ci to complete because it uses too much time, somehow. I assume this has nothing to do with this PR, but is about travis settings? |
rebase and push again, fixed some hanging by Travis CI |
6bb6860
to
9cfa3d3
Compare
Codecov Report
@@ Coverage Diff @@
## master #19021 +/- ##
==========================================
- Coverage 91.84% 91.82% -0.03%
==========================================
Files 153 153
Lines 49295 49299 +4
==========================================
- Hits 45276 45268 -8
- Misses 4019 4031 +12
Continue to review full report at Codecov.
|
Ok, rebased. |
idx_diff = other.index.difference(self.columns) | ||
try: | ||
combined_columns = self.columns.append(idx_diff) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this should not be done here, rather if a TypeError happens, then Index.append
handles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand you here, could you expand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [1]: pd.Index([1, 2, 3]).append(pd.Index(list('abc')))
Out[1]: Index([1, 2, 3, 'a', 'b', 'c'], dtype='object')
In [2]: pd.Index([1, 2, 3]).append(pd.CategoricalIndex(list('abc')))
Out[2]: Index([1, 2, 3, 'a', 'b', 'c'], dtype='object')
In [3]: pd.Index([1, 2, 3]).append(pd.MultiIndex.from_product([[1], ['a']]))
Out[3]: Index([1, 2, 3, (1, 'a')], dtype='object')
In [4]: pd.Index([1, 2, 3]).append(pd.IntervalIndex.from_breaks([1, 2, 3]))
Out[4]: Index([1, 2, 3, (1, 2], (2, 3]], dtype='object')
.append`` should never raise
so you are catching something that does not happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The below fails, though:
>>> pd.CategoricalIndex('A B C'.split()).append(pd.Index([1])
TypeError: cannot append a non-category item to a CategoricalIndex
This is what is tested in the if not isinstance(df_columns, (pd.IntervalIndex, pd.MultiIndex)):
part of the tests. In addition, the reason why pd.IntervalIndex and pd.MultiIndex are excluded for testing there is because of reindexing issues:
>>> ii = pd.IntervalIndex.from_breaks([0,1,2])
>>> df = pd.DataFrame([[1, 2], [4, 5]], columns=ii)
>>> ser = pd.Series([7], index=['a'])
>>> combined = pd.Index(ii.tolist() + ser.index.tolist())
>>> df.reindex(columns=combined)
TypeError: unorderable types: Interval() > int()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, the MI case we did on purpose. However you cannot convert to lists and expect things to work. for the TypeError, do .astype(object)
on each left and right then .append
again. we never should coerce to lists.
pandas/tests/reshape/test_concat.py
Outdated
def test_append_same_columns_type(self, df_columns): | ||
# GH18359 | ||
|
||
# ser.index is a normal pd.Index, result from df.append(ser) should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this as a separate test (the if part), though I suspect you are actually hiding a bug (see above).
pandas/tests/reshape/test_concat.py
Outdated
def test_append_dtype_coerce(self): | ||
|
||
# GH 4993 | ||
# appending with datetime will incorrectly convert datetime64 | ||
import datetime as dt | ||
from pandas import NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can move other imports to top as well
9cfa3d3
to
0d55da6
Compare
Changed and green (except for travis errors that seem unrelated). A different subject, but shouldn't |
pandas/tests/reshape/test_concat.py
Outdated
|
||
# MultiIndex and IntervalIndex will raise if appended or appended to | ||
# a different type | ||
if (isinstance(df_columns, (pd.IntervalIndex, pd.MultiIndex)) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do this in a test. I want a separate test for MI/II (or 2 is ok).
8f8b140
to
bd400df
Compare
Hello @topper-123! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on April 17, 2018 at 17:28 Hours UTC |
730ac3a
to
26e7419
Compare
464fdc2
to
48f406b
Compare
All is green again, now that #19112 has been pulled in. |
@topper-123 added some ids to the parametrization. also I added a test case in the combo test or an ordered and unordered category which is failing. (its commented out). can you have a look? |
also pls rebase |
@jreback , wrt. the ordered vs unordered combo in I think the current implementation is reasonable behaviour, as |
pandas/tests/reshape/test_concat.py
Outdated
|
||
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=mi) | ||
ser = pd.Series([7, 8, 9], index=other_type, name=2) | ||
if isinstance(other_type, pd.IntervalIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is wrong, append should raise a TypeError (like all the others). can you fix this in II?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the ValueError
is coming from here:
pandas/pandas/core/indexes/interval.py
Lines 1155 to 1161 in 1245f06
def _as_like_interval_index(self, other, error_msg): | |
self._assert_can_do_setop(other) | |
other = _ensure_index(other) | |
if (not isinstance(other, IntervalIndex) or | |
self.closed != other.closed): | |
raise ValueError(error_msg) | |
return other |
Seems reasonable to split the if
condition, and raise a different exception in each case, along the lines of:
if not isinstance(other, IntervalIndex):
raise TypeError(error_msg)
elif self.closed != other.closed:
raise ValueError(error_msg)
I think this change will only break the test below, at least within the II tests, but all that needs to be changed is ValueError
--> TypeError
:
pandas/pandas/tests/indexes/interval/test_interval.py
Lines 936 to 940 in 1245f06
# non-IntervalIndex | |
msg = ('can only do set operations between two IntervalIndex objects ' | |
'that are closed on the same side') | |
with tm.assert_raises_regex(ValueError, msg): | |
set_op(Index([1, 2, 3])) |
pandas/tests/reshape/test_concat.py
Outdated
dt.datetime(2013, 1, 3, 7, 12)]), | ||
pd.Index([4, 5, 6]), | ||
], ids=lambda x: str(x.dtype)) | ||
def test_append_interval_index_raises(self, other_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above.
cc @jschendel
cc @jschendel |
pandas/tests/reshape/test_concat.py
Outdated
|
||
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=other_type) | ||
ser = pd.Series([7, 8, 9], index=ii, name=2) | ||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks to be the same as above
a5deefb
to
860a5d2
Compare
doc/source/whatsnew/v0.23.0.txt
Outdated
- ``IntervalIndex.astype`` now supports conversions between subtypes when passed an ``IntervalDtype`` (:issue:`19197`) | ||
- :meth:`DataFrame.append` now preserves the type of the calling dataframe's columns, when possible (:issue:`18359`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated the whatsnew, don't remove the whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
pandas/tests/reshape/test_concat.py
Outdated
# different index type | ||
|
||
ii = pd.IntervalIndex.from_breaks([0, 1, 2, 3]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tests now looks superfluous, yes? and you just need to add it to the above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appending IntervalIndex with other index types still fails, only with a TypeError rather than a ValueError, so not sure I understand.
Oh, you mean combine test_append_multi_index_raises
and test_append_interval_index_raises
? I think you're right. I'll get on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. None of these should fail, you are catching the TypeError
already (and then coercing to object
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, combining IntervalIndex
and other index types will stilll fail, because Interval
and other dtypes are not orderable:
>>> ii = pd.IntervalIndex.from_breaks([0,1,2,3])
>>> i = pd.RangeIndex(3)
>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=ii)
>>> ser = pd.Series([7, 8, 9], index=i, name=2)
>>> df.append(ser)
TypeError: unorderable types: Interval() > int()
test_append_multi_index_raises
and test_append_interval_index_raises
are very similar and I've made a changes to combine them into a single test, see test_append_different_columns_types_raises
.
EDIT: I'm wondering if trying to order here is a bug, as you can append to the index itself:
>>> ii.append(i)
Index([(0, 1], (1, 2], (2, 3], 0, 1, 2], dtype='object')
If the above is possible, surely you can use DataFrame.append
as well. I'll take a look tonight.
I've addressed the comments, except one where I've commented specifically. If #19353 (exact matches for CategoricalIndex) or a version thereof could be accepted, it would be possible to append frames/series with CategoricalIndex as well, i.e. move IntervalIndex checks from |
pandas/tests/reshape/test_concat.py
Outdated
columns=combined_columns) | ||
assert_frame_equal(result, expected) | ||
|
||
@pytest.mark.parametrize("this_type", [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indexes = [pd.RangeIndex(3),
pd.Index([4, 5, 6]),
pd.Index(list("abc")),
pd.CategoricalIndex('A B C'.split()),
pd.CategoricalIndex('A B C'.split(), ordered=True),
pd.IntervalIndex.from_breaks([0, 1, 2, 3]),
pd.MultiIndex.from_arrays(['A B C'.split(), 'D E F'.split()]),
pd.DatetimeIndex([dt.datetime(2013, 1, 3, 0, 0),
dt.datetime(2013, 1, 3, 6, 10),
dt.datetime(2013, 1, 3, 7, 12)])]
@pytest.fixture(params=indexes)
def this_type(request):
return request.param
@pytest.fixture(params=indexes)
def that_type(params=indexes):
return request.param
.....
def append_different_coumns_types_raises(self, this_type, that_type):
....
944cf96
to
f2fbdf4
Compare
I've uploaded a version with fixture, but honestly I consider this less readable than the previous version... |
pandas/tests/reshape/test_concat.py
Outdated
pd.MultiIndex.from_arrays(['A B C'.split(), 'D E F'.split()]), | ||
] | ||
|
||
@pytest.fixture(params=indexes_II) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, rather than use the fixtures, let's leave the lists where they are (though you don't need to have indexes_I at all, and pls name them something else more descriptive. and you can use parametrize directrly.
columns=ser_index) | ||
assert_frame_equal(result, expected) | ||
|
||
@pytest.mark.parametrize("df_columns, series_index", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just use 2 parameterize here, one for df_columns and one for series_index; you don't need combinations, pytest already does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this approach to work, AFAIKS, parametrize doesn't do this for you, and I've googled it without results. Can you give an example?
can you update |
can you rebase |
@topper-123 can you rebase |
f2fbdf4
to
7e57ff6
Compare
Rebased. I've also simplified the tests a bit. I dropped the fixture-based approach, as I don't think it makes things clearer in this case. |
7e57ff6
to
fe4bc5a
Compare
do you disagree wit the approach, @jreback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. can you just make the whatsnew a bit more
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -380,6 +380,7 @@ Other Enhancements | |||
- :class:`IntervalIndex` now supports time zone aware ``Interval`` objects (:issue:`18537`, :issue:`18538`) | |||
- :func:`Series` / :func:`DataFrame` tab completion also returns identifiers in the first level of a :func:`MultiIndex`. (:issue:`16326`) | |||
- :func:`read_excel()` has gained the ``nrows`` parameter (:issue:`16645`) | |||
- :meth:`DataFrame.append` now preserves the type of the calling dataframe's columns, when possible (:issue:`18359`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the case this is actually addressing (e.g. categorical)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
fe4bc5a
to
13537ae
Compare
@jreback, could you pull this in? |
thanks @topper-123 |
A related question. I was expecting making a column an index , or level in a MultiIndex, to not change the dtype. df = pd.DataFrame(dict(x=np.array([1,2],dtype = 'int32'),
y =['a', 'b']))
df.set_index('x').reset_index().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
x 2 non-null int64
y 2 non-null object
dtypes: int64(1), object(1)
memory usage: 104.0+ bytes Is this exepected ,and will this enhancement change this? Thanks |
The index will in this case be a Int64Index: >>> df.set_index('x').index
Int64Index([1, 2], dtype='int64', name='x') Pandas doesn't have a Int32Index, so this behaviour is as expected and you shouldn't expect an int32 when resetting (or you'll have to reset manually). |
self, index_can_append, index_cannot_append_with_other): | ||
# GH18359 | ||
# Dataframe.append will raise if IntervalIndex/MultiIndex appends | ||
# or is appended to a different index type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@topper-123 can you help me understand why raising is the correct thing to do here? i have a branch that fixes a Series construction bug:
def test_constructor_dict_multiindex_reindex_flat(self):
# construction involves reindexing with a MultiIndex corner case
data = {('i', 'i'): 0, ('i', 'j'): 1, ('j', 'i'): 2, "j": np.nan}
expected = Series(data)
result = Series(expected[:-1].to_dict(), index=expected.index)
tm.assert_series_equal(result, expected)
but the fix for that (in MultiIndex.reindex) causes this test to fail
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR makes
DataFrame.append
preserve columns dtype, e.g.:Previously, the above returned
Index(['a', 'b'], dtype='object')
, i.e. the index type information was lost when usingappend
.