Skip to content

DOC: Improve the docstrings of CategoricalIndex.map #20286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1081,20 +1081,26 @@ def remove_unused_categories(self, inplace=False):
return cat

def map(self, mapper):
"""Apply mapper function to its categories (not codes).
"""
Map categories (not codes) using input correspondence (a dict,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to fit on one line - can you condense?

Series, or function).

Maps the categories to new categories. If the mapping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is mostly a copy / paste of the original but I find the wording rather confusing - any thoughts on how to make the explanation clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, this is not very clear but I don't know how to make it better. In my understanding a Categorical resembles in some respect a MultiIndex: the categories are a little bit like the levels and the codes are like the labels. Without that analogy in mind, any wording is going to be confusing.
What about something like that (a bit verbose):

"""
Maps the categories (the values of the elements of the array, not the
codes that represent the positions) to new categories. If the mapping
correspondence ...
"""

I added backticks to the codes to make it clear that we are talking about an attribute and not just a generic word.
What do you think? I'm open to suggestions!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd What is the aspect you find confusing? The first "Maps the categories to new categories" or the explanation after it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have news on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line still need to be consolidated to one line.

As to the Extended Summary I think saying things like "mapping correspondence maps each original category to a different new category" is just very confusing. Could we not be more concise and just say something like "Maps the categories to new categories, retaining any ordering. Missing entries will be replaced with an np.ndarray containing NaN."?

I think that is the what the summary is trying to say now but it's not entirely clear. Adding Examples to show how the mapping works and how missing data is handled would make things a lot clearer.

correspondence maps each original category to a different new category
the result is a Categorical which has the same order property as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can link to Categorical here so :class:`~pandas.Categorical'

the original, otherwise an np.ndarray is returned.

If a dictionary or Series is used any unmapped category is mapped to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When referencing Python types use the actual name and place in backticks, so `dict` here. For pandas types can also link to the class so :class:`~pandas.Series`

NA. Note that if this happens an np.ndarray will be returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but use `NaN` instead of NA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please notice that NA is widely used across the pandas documentation:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html#pandas.Series.sum

We asked ourselves whether NA or NaN was more appropriate and decided for NA that seems more used than NaN.
What does the official guideline say on the subject?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the good answer here, that is something we should discuss (in some places I see we use "NA/NaN" ..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, shall I go for NA or NaN?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use NA when referring to the concept of missing data and NaN to refer to the actual missing value. Here you want the latter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified it.
But please notice that when we have timestamps the missing value is a NaT.

In [42]: idx = pd.CategoricalIndex(['a', 'b', 'c'])

In [43]: cat = idx.values

In [44]: type(cat)
Out[44]: pandas.core.arrays.categorical.Categorical

In [45]: cat.map({'a': pd.Timestamp('2000-01-03'), 'b': pd.Timestamp('2000-01-02')})
Out[45]: DatetimeIndex(['2000-01-03', '2000-01-02', 'NaT'], dtype='datetime64[ns]', freq=None)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But please notice that when we have timestamps the missing value is a NaT.

Yes, we need some better guidelines on this, but you can ignore that inconsistency for now


Parameters
----------
mapper : callable
Function to be applied. When all categories are mapped
to different categories, the result will be Categorical which has
the same order property as the original. Otherwise, the result will
be np.ndarray.
Function to be applied.

Returns
-------
applied : Categorical or Index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a period at the end of the type. Also should namespace pandas types (except for DataFrame and Series) so pandas.Categorial or pandas.Index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outside the original scope of the contribution, see first PR comment.


"""
new_categories = self.categories.map(mapper)
try:
Expand Down
7 changes: 4 additions & 3 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -3121,22 +3121,23 @@ def groupby(self, values):
return result

def map(self, mapper, na_action=None):
"""Map values of Series using input correspondence
"""
Map values using input correspondence (a dict, Series, or function).

Parameters
----------
mapper : function, dict, or Series
Mapping correspondence.
na_action : {None, 'ignore'}
If 'ignore', propagate NA values, without passing them to the
mapping function
mapping correspondence.

Returns
-------
applied : Union[Index, MultiIndex], inferred
The output of the mapping function applied to the index.
If the function returns a tuple with more than one element
a MultiIndex will be returned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a small example here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outside the original scope of the contribution, see first PR comment.

"""

from .multi import MultiIndex
Expand Down
52 changes: 45 additions & 7 deletions pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -661,20 +661,58 @@ def is_dtype_equal(self, other):
take_nd = take

def map(self, mapper):
"""Apply mapper function to its categories (not codes).
"""
Map index values using input correspondence (a dict, Series, or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you leave out 'index' it fits on one line?

function).

Maps the values (their categories, not the codes) of the index to new
categories. If the mapping correspondence maps each original category
to a different new category the result is a CategoricalIndex which has
the same order property as the original, otherwise an Index is
returned.

If a dictionary or Series is used any unmapped category is mapped to
NA. Note that if this happens an Index will be returned.

Parameters
----------
mapper : callable
Function to be applied. When all categories are mapped
to different categories, the result will be a CategoricalIndex
which has the same order property as the original. Otherwise,
the result will be a Index.
mapper : function, dict, or Series
Mapping correspondence.

Returns
-------
applied : CategoricalIndex or Index
CategoricalIndex or Index
Mapped index.

See Also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all of these methods are similar why don't you add a See Also in all of the instances that refer to one another?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outside the original scope of the contribution, see first PR comment.

--------
Index.map : Apply a mapping correspondence on an Index.
Series.map : Apply a mapping correspondence on a Series.
Series.apply : Apply more complex functions on a Series.

Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see Examples here - why not add to other map implementations as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outside the original scope of the contribution, see first PR comment.

--------
>>> idx = pd.CategoricalIndex(['a', 'b', 'c'])
>>> idx
CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'],
ordered=False, dtype='category')
>>> idx.map(lambda x: x.upper())
CategoricalIndex(['A', 'B', 'C'], categories=['A', 'B', 'C'],
ordered=False, dtype='category')
>>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'})
CategoricalIndex(['first', 'second', 'third'], categories=['first',
'second', 'third'], ordered=False, dtype='category')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche @l736x the point I was trying to make was that since we mention that the ordering property gets retained with a mapping, that we should have an example for an ordered CategoricalIndex explicitly showing this. It could be something as simple as:

Note that one-to-one mappings will retain the ordering of the CategoricalIndex
idx = pd.CategoricalIndex(['a,'b','c'], ordered=True)
idx.map({'a': 3, 'b': 2, 'c': 1})

Just my $.02 though @jorisvandenbossche I'm good to go whenever you want to merge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would indeed be good to have an example for this. But maybe just adding ordered=True to the example above is enough? (without adding another example)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me


If the mapping is not bijective an Index is returned:

>>> idx.map({'a': 'first', 'b': 'second', 'c': 'first'})
Index(['first', 'second', 'first'], dtype='object')

If a dictionary is used, all unmapped categories are mapped to NA and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as before (may be a few other places) use `dict` instead of dictionary and `NaN` instead of NA

the result is an Index:

>>> idx.map({'a': 'first', 'b': 'second'})
Index(['first', 'second', nan], dtype='object')
"""
return self._shallow_copy_with_infer(self.values.map(mapper))

Expand Down
17 changes: 9 additions & 8 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2423,25 +2423,26 @@ def unstack(self, level=-1, fill_value=None):

def map(self, arg, na_action=None):
"""
Map values of Series using input correspondence (which can be
a dict, Series, or function)
Map values of Series using input correspondence (a dict, Series, or
function).

Parameters
----------
arg : function, dict, or Series
Mapping correspondence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add here some details on how a dict and Series are handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outside the original scope of the contribution, see first PR comment.

na_action : {None, 'ignore'}
If 'ignore', propagate NA values, without passing them to the
mapping function
mapping correspondence.

Returns
-------
y : Series
same index as caller
Same index as caller.

Examples
--------

Map inputs to outputs (both of type `Series`)
Map inputs to outputs (both of type `Series`):

>>> x = pd.Series([1,2,3], index=['one', 'two', 'three'])
>>> x
Expand Down Expand Up @@ -2492,9 +2493,9 @@ def map(self, arg, na_action=None):

See Also
--------
Series.apply: For applying more complex functions on a Series
DataFrame.apply: Apply a function row-/column-wise
DataFrame.applymap: Apply a function elementwise on a whole DataFrame
Series.apply : For applying more complex functions on a Series.
DataFrame.apply : Apply a function row-/column-wise.
DataFrame.applymap : Apply a function elementwise on a whole DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add Series.replace ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outside the original scope of the contribution, see first PR comment.


Notes
-----
Expand Down