-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Allow indices to be mapped through through dictionaries or series #15081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/indexes/base.py
Outdated
@@ -2395,6 +2394,25 @@ def get_indexer_for(self, target, **kwargs): | |||
indexer, _ = self.get_indexer_non_unique(target, **kwargs) | |||
return indexer | |||
|
|||
def get_values_from_dict(self, input_dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a private function (_get_values.....
)
pandas/indexes/base.py
Outdated
|
||
Parameters | ||
---------- | ||
input_dict : dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this data
pandas/indexes/base.py
Outdated
|
||
Returns | ||
------- | ||
Union[np.array, list] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should always return np.ndarray no?
pandas/indexes/base.py
Outdated
@@ -2438,8 +2456,8 @@ def map(self, mapper): | |||
|
|||
Parameters | |||
---------- | |||
mapper : callable | |||
Function to be applied. | |||
mapper : Union[function, dict, Series] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use {callable, dict, Series}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say new in 0.20.0 for (dict/Series)
pandas/indexes/base.py
Outdated
@@ -2450,7 +2468,18 @@ def map(self, mapper): | |||
|
|||
""" | |||
from .multi import MultiIndex | |||
mapped_values = self._arrmap(self.values, mapper) | |||
|
|||
if isinstance(mapper, ABCSeries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a dict, why can' you just Series(data)
then treat it like a Series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I think its worth the effort to combine this code with Series.map
(in reality you can prob just move the Series.map
code to core.base.Base
I think (with maybe some modifications here).
pandas/tseries/tdi.py
Outdated
def get_values_from_dict(self, input_dict): | ||
"""Return the values of the input dictionary in the order the keys are | ||
in the index. np.nan is returned for index values not in the | ||
dictionary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same doc-string
pandas/tseries/tdi.py
Outdated
Returns | ||
------- | ||
Union[np.array, list] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can move this to tseries.common, then for the DatetimeIndex you do a super call (e.g. with the dict_compat)
pandas/tseries/tests/test_period.py
Outdated
@@ -3511,6 +3511,10 @@ def test_map(self): | |||
result = index.map(lambda x: x + 1) | |||
expected = index + 1 | |||
tm.assert_index_equal(result, expected) | |||
series_map = pd.Series(expected, index) | |||
tm.assert_index_equal(index.map(series_map), expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try with an empty, try with some NaT's
rng = timedelta_range('1 day', periods=10) | ||
|
||
f = lambda x: x.days | ||
result = rng.map(f) | ||
exp = Int64Index([f(x) for x in rng]) | ||
tm.assert_index_equal(result, exp) | ||
|
||
s = Series(exp, index=rng) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sam as above (you might wish to move these tests to test_datetimelike to avoid duplication)
exp_values = [f(x) for x in rng] | ||
exp = Index(exp_values, dtype='<U8') | ||
|
||
tm.assert_index_equal(rng.map(f), exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test for empty
can you rebase / update |
b02834b
to
dfb4cae
Compare
Codecov Report
@@ Coverage Diff @@
## master #15081 +/- ##
==========================================
- Coverage 91.34% 91.31% -0.03%
==========================================
Files 163 163
Lines 49717 49742 +25
==========================================
+ Hits 45413 45424 +11
- Misses 4304 4318 +14
Continue to review full report at Codecov.
|
Hi @jreback sorry for the delay. Work got a little crazy. Let me know what other changes you'd like to see made. |
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -139,6 +139,8 @@ Other enhancements | |||
- ``pd.merge_asof()`` gained the option ``direction='backward'|'forward'|'nearest'`` (:issue:`14887`) | |||
- ``Series/DataFrame.asfreq()`` have gained a ``fill_value`` parameter, to fill missing values (:issue:`3715`). | |||
- ``Series/DataFrame.resample.asfreq`` have gained a ``fill_value`` parameter, to fill missing values during resampling (:issue:`3715`). | |||
- ``Index.map`` can now accept series and dictionary input object. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number
pandas/indexes/base.py
Outdated
@@ -2450,7 +2468,18 @@ def map(self, mapper): | |||
|
|||
""" | |||
from .multi import MultiIndex | |||
mapped_values = self._arrmap(self.values, mapper) | |||
|
|||
if isinstance(mapper, ABCSeries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
pandas/indexes/base.py
Outdated
@@ -2450,7 +2468,18 @@ def map(self, mapper): | |||
|
|||
""" | |||
from .multi import MultiIndex | |||
mapped_values = self._arrmap(self.values, mapper) | |||
|
|||
if isinstance(mapper, ABCSeries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I think its worth the effort to combine this code with Series.map
(in reality you can prob just move the Series.map
code to core.base.Base
I think (with maybe some modifications here).
@nateyoder this was quite close. can you rebase / update and we can get it in? |
Hi Jeff, My apologies as work has swamped me again. I'll try to get everything up to date and working again this weekend and you can let me know if there are other changes you'd like. Thanks for your patience. |
@nateyoder great thanks. yeah when you rebase will take another look. feel free to ping. |
can you rebase / update |
dfb4cae
to
ef7daba
Compare
Hi @jreback. Apologies again on the delay. I updated the tests to include some of the missing cases but was a little unsure how to re-use the mapping code from series like you had suggested. |
I took a shot at combining the code but in at least my current implementation I still had to do some object specific code. If you've got a better idea let me know and I'll be happy to take a look. I also took this opportunity to not create a series from a dict but instead use an index directly which as you can see from the below ASV results provided a slight but pretty consistent performance improvement for dicts and didn't significantly slow down series maps:
EDIT: Note that give that the above method did NOT work in both 2.7 and 3 this performance improvement was reverted and the included changes did not significantly change performance. |
Sorry didn't test with 2.7. I'll go back to just using series even though its a little slower. |
Sorry for the trashing around. I wasn't familiar with what should be an index vs a multi-index in mapping with a dictionary made of tuples. I ended up just going back to using a series because I couldn't keep the performance going straight through the _ensure_index function. |
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -521,6 +521,7 @@ Other Enhancements | |||
- The ``display.show_dimensions`` option can now also be used to specify | |||
whether the length of a ``Series`` should be shown in its repr (:issue:`7117`). | |||
- ``parallel_coordinates()`` has gained a ``sort_labels`` keyword arg that sorts class labels and the colours assigned to them (:issue:`15908`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove for now. This will be in 0.21.0 (though file doesn't exist yet, don't create it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
pandas/core/base.py
Outdated
@@ -8,10 +8,10 @@ | |||
|
|||
from pandas.core.dtypes.missing import isnull | |||
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries, ABCIndexClass | |||
from pandas.core.dtypes.common import is_object_dtype, is_list_like, is_scalar | |||
from pandas.core.dtypes.common import is_object_dtype, is_list_like, is_scalar, is_extension_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this causes lint issues, use parens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pandas/core/base.py
Outdated
@@ -933,6 +933,44 @@ def _reduce(self, op, name, axis=0, skipna=True, numeric_only=None, | |||
klass=self.__class__.__name__, op=name)) | |||
return func(**kwds) | |||
|
|||
def _map_values(self, values, arg, na_action=None): | |||
if is_extension_type(self.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pandas/core/indexes/base.py
Outdated
|
||
Parameters | ||
---------- | ||
data : {dict, DictWithoutMissing} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pandas/core/indexes/base.py
Outdated
@@ -2706,31 +2726,36 @@ def groupby(self, values): | |||
|
|||
return result | |||
|
|||
def map(self, mapper): | |||
"""Apply mapper function to an index. | |||
def map(self, arg, na_action=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically need to deprecate and rename mapper -> arg add a versionadded (0.21.0) for na_action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something I need to do? If so can you show me how you've done this in the past? Otherwise should I just change arg -> mapper?
tm.assert_index_equal(self.index.map(dict_map), expected) | ||
|
||
# empty mappable | ||
nan_index = Index([pd.np.nan] * len(self.index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.nan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
result = self.index.map(lambda x: pd.NaT if x == self.index[0] else x) | ||
expected = Index([pd.NaT] + self.index[1:].tolist()) | ||
tm.assert_index_equal(result, expected) | ||
series_map = pd.Series(expected, self.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line between cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
index = PeriodIndex([2005, 2007, 2009], freq='A') | ||
result = index.map(lambda x: x + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do now, raise? put a test for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it still works. I understood from your previous comment on Jan 9th that I could move it to test_datetimelike but have since tested using them using pandas/tests/indexes/datetimelike.py
@@ -74,3 +74,29 @@ def test_union(self): | |||
for case in cases: | |||
result = first.union(case) | |||
self.assertTrue(tm.equalContents(result, everything)) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally could push some of these (NaT & empty) into pandas/tests/indexes/datetimelike.py
so that they get tested for all DTI, PI, and TDI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then in the sub-classes call the super-class for testing (and add addtional); so these are all called test_map
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I misunderstood your comment on the last PR and thought you asked me to put these here (test_datetimelike
) rather than than there. Thanks for making that clearer.
def test_map_with_categorical_series(self): | ||
# GH 12756 | ||
a = pd.Index([1, 2, 3, 4]) | ||
b = pd.Series(["even", "odd", "even", "odd"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really odd. you are mapping the codes??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to see if someone did try to do this (not sure why you would) that it would work as expected. So yes it maps the codes in this case. Is there a different preferred behavior?
pls rebase / update when you can |
9c51582
to
e58cca2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long delay here. I ran into (and am still running into) some issues with IntervalIndex in which the get_indexer function appears to be buggy (#16410). I skipped that test for now but depending on the desired behavior will probably have to do a custom map function for that? Also I'm not sure I fully understand the new pytest behavior so sorry if there are issues there as well.
def test_map_with_categorical_series(self): | ||
# GH 12756 | ||
a = pd.Index([1, 2, 3, 4]) | ||
b = pd.Series(["even", "odd", "even", "odd"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to see if someone did try to do this (not sure why you would) that it would work as expected. So yes it maps the codes in this case. Is there a different preferred behavior?
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -521,6 +521,7 @@ Other Enhancements | |||
- The ``display.show_dimensions`` option can now also be used to specify | |||
whether the length of a ``Series`` should be shown in its repr (:issue:`7117`). | |||
- ``parallel_coordinates()`` has gained a ``sort_labels`` keyword arg that sorts class labels and the colours assigned to them (:issue:`15908`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
pandas/core/base.py
Outdated
@@ -8,10 +8,10 @@ | |||
|
|||
from pandas.core.dtypes.missing import isnull | |||
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries, ABCIndexClass | |||
from pandas.core.dtypes.common import is_object_dtype, is_list_like, is_scalar | |||
from pandas.core.dtypes.common import is_object_dtype, is_list_like, is_scalar, is_extension_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pandas/core/base.py
Outdated
@@ -933,6 +933,44 @@ def _reduce(self, op, name, axis=0, skipna=True, numeric_only=None, | |||
klass=self.__class__.__name__, op=name)) | |||
return func(**kwds) | |||
|
|||
def _map_values(self, values, arg, na_action=None): | |||
if is_extension_type(self.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pandas/core/indexes/base.py
Outdated
|
||
Parameters | ||
---------- | ||
data : {dict, DictWithoutMissing} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
result = self.index.map(lambda x: pd.NaT if x == self.index[0] else x) | ||
expected = Index([pd.NaT] + self.index[1:].tolist()) | ||
tm.assert_index_equal(result, expected) | ||
series_map = pd.Series(expected, self.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -74,3 +74,29 @@ def test_union(self): | |||
for case in cases: | |||
result = first.union(case) | |||
self.assertTrue(tm.equalContents(result, everything)) | |||
|
|||
def test_map(self): | |||
expected = self.index + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -74,3 +74,29 @@ def test_union(self): | |||
for case in cases: | |||
result = first.union(case) | |||
self.assertTrue(tm.equalContents(result, everything)) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I misunderstood your comment on the last PR and thought you asked me to put these here (test_datetimelike
) rather than than there. Thanks for making that clearer.
@@ -1398,6 +1398,17 @@ def get_value_maybe_box(self, series, key): | |||
key, tz=self.tz) | |||
return _maybe_box(self, values, series, key) | |||
|
|||
@Appender(_index_shared_docs['_get_values_from_dict']) | |||
def _get_values_from_dict(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the above comment.
@@ -704,6 +704,14 @@ def __rsub__(self, other): | |||
def _add_delta(self, other): | |||
return NotImplemented | |||
|
|||
@Appender(_index_shared_docs['_get_values_from_dict']) | |||
def _get_values_from_dict(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume it does get hit indirectly as it was essentially pulled into the subclasses from the pd.Series instantiation. It is essentially replicating the behavior that was previously in lines 188-203 of pandas/core/series.py and allowing each object to handle it as seen fit rather than doing it in an if conditional. Personally this way make more sense to me but since I ended up not making use of this functionality outside of the Series instantiation I'm happy to revert the changes if you'd prefer.
can you rebase / update |
@nateyoder can you rebase / update. sorry didn't get to review this soon. |
@nateyoder going to finaly get this in! can you rebase and move note to 0.22.0 |
Hi @jreback. Sorry the problem has totally been on my side. Yes I'd love to get it in as well though I unfortunately won't be able to work on it this weekend. If I try to get you updates early next week would that work? Sorry for the delay! |
@nateyoder sure! that is fine. just ping when ready. |
…d add dict performance test
3ff220c
to
30e7e7a
Compare
also refactored to remove turning things into object dtypes unecessarily PR
0.21.0
|
@TomAugspurger and @jorisvandenbossche if you have any comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a small comment
pandas/tests/indexes/datetimelike.py
Outdated
tm.assert_index_equal(self.index.map(series_map), expected) | ||
|
||
dict_map = {i: e for e, i in zip(expected, self.index)} | ||
tm.assert_index_equal(self.index.map(dict_map), expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: in the base tests you have the case of an empty mapping generating an Index of NaNs. Should there be a test for that here as well that it gives an Index of NaTs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no easy way to do this actually, e.g.
In [2]: pd.date_range('20130101', periods=3).map({})
Out[2]: Float64Index([nan, nan, nan], dtype='float64')
I could special case this I suppose (e.g. empty dict).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried to fix this, but it goes down a rabbit hole. if you want to open an issue ok I guess.
pandas/core/base.py
Outdated
# Dictionary does not have a default. Thus it's safe to | ||
# convert to an Series for efficiency. | ||
from pandas import Series | ||
mapper = Series(mapper, index=mapper.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is passing index=mapper.keys()
necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is to handle some odd-ball cases where you have tuples as keys in the dictionary
pandas/core/indexes/base.py
Outdated
def map(self, mapper): | ||
"""Apply mapper function to an index. | ||
def map(self, mapper, na_action=None): | ||
"""Map values of Series using input correspondence (which can be a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence should be a single line so that it looks OK in the autosummary table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/series.py
Outdated
@@ -2245,14 +2230,14 @@ def unstack(self, level=-1, fill_value=None): | |||
# ---------------------------------------------------------------------- | |||
# function application | |||
|
|||
def map(self, arg, na_action=None): | |||
def map(self, mapper, na_action=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a release note for this in API-breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted (was an error)
thanks @nateyoder glad we got this in! |
@jreback thanks for doing all the work to finish it up!! Really appreciate you picking up the slack! |
git diff upstream/master | flake8 --diff