-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: map() on Index returns an Index, not array #14506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note that the following changes result in some additional changes to the API which should potentially be addressed so I would appreciate suggestions.
|
@nateyoder use This will need to be in 0.20.0 |
|
||
API changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.20.0
@@ -943,7 +943,7 @@ def map(self, mapper): | |||
|
|||
Returns | |||
------- | |||
applied : Categorical or np.ndarray. | |||
applied : Categorical or Index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -766,6 +766,26 @@ def test_sub(self): | |||
self.assertRaises(TypeError, lambda: idx - idx.tolist()) | |||
self.assertRaises(TypeError, lambda: idx.tolist() - idx) | |||
|
|||
def test_map_identity_mapping(self): | |||
for name, cur_index in self.indices.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment with the appropriate github issue numbers (don't go crazy but where appropriate)
|
||
def test_map_that_returns_tuples_creates_index_not_multi_index(self): | ||
boolean_index = tm.makeIntIndex(3).map(lambda x: (x, x == 1)) | ||
expected = Index([(0, False), (1, True), (2, False)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ensure_index
and this will be a MI
@jreback I may have misunderstood what you were saying but I gave using _ensure_index a shot with the below code but it didn't seem to do anything. I believe this may be because the map returns a np.ndarray rather than a list which would then be converted to a MultiIndex? Should I go ahead an put a tolist in there despite the performance drawbacks or am I totally barking up the wrong tree?
|
hmm, actually it IS prob worth introspecting this and if it is ndim==1, then its no problem, if its is an array-like of tuples, then you can pass to |
Because the outputs of _arrmap were actually always contained in a 1d np.ndarray I actually checked for the presence of a tuple. Let me know if you'd prefer something else. |
Current coverage is 85.31% (diff: 100%)@@ master #14506 diff @@
==========================================
Files 144 144
Lines 51016 51022 +6
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43522 43528 +6
Misses 7494 7494
Partials 0 0
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For DatetimeIndex (and probably timedelta and period as well), the return value is still an array if the result is not a datetime anymore (eg with .map(lambda x: 1)
or .map(lambda x: x.hour)
applied : array | ||
applied : Index | ||
The output of the mapping function applied to the index. | ||
If the function returns a tuple a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing some content
expected = Index([(0, ), (1, ), (2, )]) | ||
self.assert_index_equal(boolean_index, expected) | ||
|
||
def test_map_that_reduces_multi_index_to_single_index_returns_index(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you shorten the test names a bit?
(or else just include them in one single test, then you can add the long name/explanation as a comment)
@@ -38,6 +38,25 @@ Other enhancements | |||
Backwards incompatible API changes | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- ``map`` on an ``Index`` now returns an ``Index``, not an array (:issue:`12766`) | |||
.. ipython:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some sphinx sytax comments (sphinx is rather picky ..):
- blank line above
.. ipython ...
- Can you also indent this line (and all lines below) with two spaces? (to be at the level of the content of the bullet point) -> so everything will be part of this bullet point
Great catch on the DatetimeIndex! Sorry I missed that. I'll try to update the PR today. |
Unfortunately when making this change I ended up causing two other tests (series.test_apply.TestSeriesApply.test_apply_datetimetz and series.test_apply.TestSeriesMap.test_map_datetimetz) to fail because they became np.int64 instead of np.int32. Is this a significant failure that I should try to work around or should I change the dtype in the tests? |
892a7a6
to
aa0120f
Compare
aa0120f
to
77ca253
Compare
Just wanted to check and see if there were any additional actions you would like taken on this PR? It appeared to me as though the AppVeyor failure was not due to my changes but am not familiar with AppVeyor. |
@@ -38,8 +38,27 @@ Other enhancements | |||
Backwards incompatible API changes | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
.. _whatsnew_0200.api: | |||
- ``map`` on an ``Index`` now returns an ``Index``, not an array (:issue:`12766`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a new sub-section
def test_map_identity_mapping(self): | ||
# GH 12766 | ||
for name, cur_index in self.indices.items(): | ||
self.assert_index_equal(cur_index, cur_index.map(lambda x: x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just use assert_index_equal
(not the self), same routine, but more consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I changed the ones related to my commit but left the other ones in test_base.py alone. Let me know if you'd like them changed as well.
exp = pd.Categorical(list('ababc'), categories=list('cba'), | ||
ordered=True) | ||
tm.assert_categorical_equal(result, exp) | ||
exp = pd.CategoricalIndex(list('ababc'), categories=list('cba'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show an example like this in the whatsnew as well (e.g. CategoryIndex.map -> CI rather than Category now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jreback. Just wanted to touchbase here since I added this example and then you asked me to remove in another comment below. Did I add it in the wrong place or did you just decide it was a little overkill? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's too long for the what's new; so need to pare it down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 sounds good. I shortened it up like you suggested. Let me know if you'd like any other changes.
@@ -124,7 +124,7 @@ def test_apply_datetimetz(self): | |||
|
|||
# change dtype | |||
result = s.apply(lambda x: x.hour) | |||
exp = pd.Series(list(range(24)) + [0], name='XX', dtype=np.int32) | |||
exp = pd.Series(list(range(24)) + [0], name='XX', dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I don't think this should have changed, these are normally int32
s
any idea? (also this might not be the same on windows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true it previously gave int32, but is there any reason for that? We almost always use int64 as the default integer size, and also currently Timestamp.hour gives you back int64 when you replace the map
with an explicit loop:
In [20]: dtidx = pd.date_range(start='2012-01-01', periods=4)
In [29]: dtidx.map(lambda x: x.hour)
Out[29]: array([0, 0, 0, 0], dtype=int32)
In [30]: np.array([x.hour for x in dtidx])
Out[30]: array([0, 0, 0, 0])
In [31]: np.array([x.hour for x in dtidx]).dtype
Out[31]: dtype('int64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably the result of using asobject
below
In [37]: dtidx.map(lambda x: x.hour)
Out[37]: array([0, 0, 0, 0], dtype=int32)
In [36]: dtidx.asobject.map(lambda x: x.hour)
Out[36]: array([0, 0, 0, 0], dtype=int64)
``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually an implementation detail (as they r stored as int32)
we can change but that should be separate PR
ideally just like to preserve here
raise TypeError | ||
return result | ||
except Exception: | ||
return _algos.arrmap_object(self.asobject.values, f) | ||
return self.asobject.map(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is better, using .values
can change dtypes (esp timezones)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might actually close another issue...
@@ -104,7 +104,7 @@ def test_dateindex_conversion(self): | |||
for freq in ('B', 'L', 'S'): | |||
dateindex = tm.makeDateIndex(k=10, freq=freq) | |||
rs = self.dtc.convert(dateindex, None, None) | |||
xp = converter.dates.date2num(dateindex._mpl_repr()) | |||
xp = Index(converter.dates.date2num(dateindex._mpl_repr())) | |||
tm.assert_almost_equal(rs, xp, decimals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare as index
@@ -1513,8 +1513,8 @@ def test_map(self): | |||
|
|||
f = lambda x: x.days | |||
result = rng.map(f) | |||
exp = np.array([f(x) for x in rng], dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might show some of these examples in the whatsnew as well (e.g. period/timedelta)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@nateyoder sorry for the delay. this looks pretty good. couple of comments. Pls have a look at the dtype changes w.r.t. @sinhrks thoughts |
d1fb027
to
9a28a29
Compare
Hi @jreback. Sorry for my confusing message. I understand that it is the current behavior of indices to ALWAYS return |
not sure what you are referring |
@jreback I think the confusion is this: you ask that the return dtype should be int32 (as before: return type of an int32 array), but for an Index this can never be the case as even an int32 array will result in a int64 index. So for The question remains of course if it is possible to keep the dtype of
So for a series it gives in64 for datetimes, but int32 for timezone aware datetimes. So personally I wouldn't mind of all those case become int64 (for Index.map it will be in any case, only leaving tz aware Series to return int32) |
@nateyoder @jorisvandenbossche response is good (I dont think this is a big deal that we r actually returning int32 from the datetime computations - can evaluate that on another issue) I think my original comment was why we needed to change a test from int32 to int64; since we are always returning an Index which by definition have restricted dtypes the test needs to change so ought to document this with an example of that (in whatsnew) |
Sounds good. I updated the whatsnew and believe I have no addressed all of the comments. |
@@ -12,7 +12,7 @@ Highlights include: | |||
|
|||
Check the :ref:`API Changes <whatsnew_0200.api_breaking>` and :ref:`deprecations <whatsnew_0200.deprecations>` before updating. | |||
|
|||
.. contents:: What's new in v0.19.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already fixed in master, so you may need to rebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. just shorten up the whatsnew examples a bit. If you want refresh the docs a bit (with some examples), you can search where we use Index.map
(I don't know if its even shown) in indexing.rst or advanced.rst. Can do this in a followup PR (or here if you want)
Map on Index types now return other Index types | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
- ``map`` on an ``Index`` now returns an ``Index``, not an array (:issue:`12766`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array -> np.array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or 'numpy array'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
.. ipython:: python | ||
|
||
mi = MultiIndex.from_tuples([(1, 2), (2, 4)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this with where you define the Index as well. You can then put the previous/new for Multi with Index
|
||
|
||
- ``map`` on an ``CategoricalIndex`` now returns a ``CategoricalIndex``, not a Categorical | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can skip this example (the CategoricalIndex) and below
applied : array | ||
applied : Index | ||
The output of the mapping function applied to the index. | ||
If the function returns a tuple with more than one element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say the output Index type will be inferred
if isinstance(result, np.ndarray): | ||
self._shallow_copy(result) | ||
|
||
if not isinstance(result, Index): | ||
raise TypeError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is caught in an outer scope and so is not seen. but can you add an informative message to this TypeError
@nateyoder Could you update for the latest comments of @jreback? Would be nice to get this in! |
…categorical index; map on a categorical will either return a categorical or an index (rather than a numpy array)
… create a multiindex instead of an index
…ot a tseries; sphinx changes; fix docstring
…st to create the object
…other uses of assert_index_equal to testing instead os self
3e5c421
to
b36e83c
Compare
@jorisvandenbossche Sorry for the delay. Let me know if you notice anything else that needs to be updated. Thanks I've enjoyed my first contribution! |
@@ -91,8 +91,75 @@ Other enhancements | |||
Backwards incompatible API changes | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
.. _whatsnew_0200.api: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a ref tag as well
.. _whatsnew.api_breaking.index_map
mi.map(lambda x: x[0]) | ||
|
||
|
||
- ``map`` on a Series withe datetime64 values may return int64 dtypes rather than int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withe -> with
lgtm. just a small doc change. |
thanks! great effort! |
@nateyoder Thanks a lot! |
closes pandas-dev#12766 closes pandas-dev#12798 This is a follow on to pandas-dev#12798. Author: Nate Yoder <[email protected]> Closes pandas-dev#14506 from nateyoder/index_map_index and squashes the following commits: 95e4440 [Nate Yoder] fix typo and add ref tag in whatsnew b36e83c [Nate Yoder] update whatsnew, fix documentation 4635e6a [Nate Yoder] compare as index a17ddab [Nate Yoder] Fix unused import and docstrings per pep8radius docformatter; change other uses of assert_index_equal to testing instead os self ab168e7 [Nate Yoder] Update whatsnew and add git PR to tests to denote changes 504c2a2 [Nate Yoder] Fix tests that weren't run by PyCharm 23c133d [Nate Yoder] Update tests to match dtype int64 07b772a [Nate Yoder] use the numpy results if we can to avoid repeating the computation just to create the object a110be9 [Nate Yoder] make map on time tseries indices return index if dtype of output is not a tseries; sphinx changes; fix docstring a596744 [Nate Yoder] introspect results from map so that if the output array has tuples we create a multiindex instead of an index 5fc66c3 [Nate Yoder] make map return an index if it operates on an index, multi index, or categorical index; map on a categorical will either return a categorical or an index (rather than a numpy array)
git diff upstream/master | flake8 --diff
This is a follow on to #12798.