-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DOC: Improve the docstrings of CategoricalIndex.map #20286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Improve the docstrings of CategoricalIndex.map #20286
Conversation
pandas/core/arrays/categorical.py
Outdated
to different categories, the result will be Categorical which has | ||
the same order property as the original. Otherwise, the result will | ||
be np.ndarray. | ||
Function to be applied. | ||
|
||
Returns | ||
------- | ||
applied : Categorical or Index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a period at the end of the type. Also should namespace pandas types (except for DataFrame and Series) so pandas.Categorial or pandas.Index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is outside the original scope of the contribution, see first PR comment.
pandas/core/arrays/categorical.py
Outdated
@@ -1081,20 +1081,26 @@ def remove_unused_categories(self, inplace=False): | |||
return cat | |||
|
|||
def map(self, mapper): | |||
"""Apply mapper function to its categories (not codes). | |||
""" | |||
Map categories (not codes) using input correspondence (a dict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to fit on one line - can you condense?
pandas/core/arrays/categorical.py
Outdated
|
||
Maps the categories to new categories. If the mapping | ||
correspondence maps each original category to a different new category | ||
the result is a Categorical which has the same order property as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can link to Categorical here so :class:`~pandas.Categorical'
pandas/core/arrays/categorical.py
Outdated
Map categories (not codes) using input correspondence (a dict, | ||
Series, or function). | ||
|
||
Maps the categories to new categories. If the mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is mostly a copy / paste of the original but I find the wording rather confusing - any thoughts on how to make the explanation clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, this is not very clear but I don't know how to make it better. In my understanding a Categorical resembles in some respect a MultiIndex: the categories are a little bit like the levels and the codes are like the labels. Without that analogy in mind, any wording is going to be confusing.
What about something like that (a bit verbose):
"""
Maps the categories (the values of the elements of the array, not the
codes
that represent the positions) to new categories. If the mapping
correspondence ...
"""
I added backticks to the codes
to make it clear that we are talking about an attribute and not just a generic word.
What do you think? I'm open to suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd What is the aspect you find confusing? The first "Maps the categories to new categories" or the explanation after it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have news on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first line still need to be consolidated to one line.
As to the Extended Summary I think saying things like "mapping correspondence maps each original category to a different new category" is just very confusing. Could we not be more concise and just say something like "Maps the categories to new categories, retaining any ordering. Missing entries will be replaced with an np.ndarray
containing NaN
."?
I think that is the what the summary is trying to say now but it's not entirely clear. Adding Examples to show how the mapping works and how missing data is handled would make things a lot clearer.
pandas/core/arrays/categorical.py
Outdated
the result is a Categorical which has the same order property as | ||
the original, otherwise an np.ndarray is returned. | ||
|
||
If a dictionary or Series is used any unmapped category is mapped to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When referencing Python types use the actual name and place in backticks, so `dict` here. For pandas types can also link to the class so :class:`~pandas.Series`
pandas/core/arrays/categorical.py
Outdated
the original, otherwise an np.ndarray is returned. | ||
|
||
If a dictionary or Series is used any unmapped category is mapped to | ||
NA. Note that if this happens an np.ndarray will be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but use `NaN` instead of NA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please notice that NA is widely used across the pandas documentation:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html#pandas.Series.sum
We asked ourselves whether NA or NaN was more appropriate and decided for NA that seems more used than NaN.
What does the official guideline say on the subject?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have the good answer here, that is something we should discuss (in some places I see we use "NA/NaN" ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, shall I go for NA or NaN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use NA when referring to the concept of missing data and NaN
to refer to the actual missing value. Here you want the latter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified it.
But please notice that when we have timestamps the missing value is a NaT.
In [42]: idx = pd.CategoricalIndex(['a', 'b', 'c'])
In [43]: cat = idx.values
In [44]: type(cat)
Out[44]: pandas.core.arrays.categorical.Categorical
In [45]: cat.map({'a': pd.Timestamp('2000-01-03'), 'b': pd.Timestamp('2000-01-02')})
Out[45]: DatetimeIndex(['2000-01-03', '2000-01-02', 'NaT'], dtype='datetime64[ns]', freq=None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But please notice that when we have timestamps the missing value is a NaT.
Yes, we need some better guidelines on this, but you can ignore that inconsistency for now
pandas/core/indexes/category.py
Outdated
|
||
See Also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all of these methods are similar why don't you add a See Also in all of the instances that refer to one another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is outside the original scope of the contribution, see first PR comment.
pandas/core/indexes/category.py
Outdated
Series.map : Apply a mapping correspondence on a Series. | ||
Series.apply : Apply more complex functions on a Series. | ||
|
||
Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to see Examples here - why not add to other map
implementations as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is outside the original scope of the contribution, see first PR comment.
pandas/core/indexes/category.py
Outdated
>>> idx.map({'a': 'first', 'b': 'second', 'c': 'first'}) | ||
Index(['first', 'second', 'first'], dtype='object') | ||
|
||
If a dictionary is used, all unmapped categories are mapped to NA and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as before (may be a few other places) use `dict` instead of dictionary and `NaN` instead of NA
Make sure you review the validation script - there are definitely some errors with your examples in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Added some additional comments
pandas/core/indexes/category.py
Outdated
@@ -661,20 +661,58 @@ def is_dtype_equal(self, other): | |||
take_nd = take | |||
|
|||
def map(self, mapper): | |||
"""Apply mapper function to its categories (not codes). | |||
""" | |||
Map index values using input correspondence (a dict, Series, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you leave out 'index' it fits on one line?
|
||
Returns | ||
------- | ||
applied : Union[Index, MultiIndex], inferred | ||
The output of the mapping function applied to the index. | ||
If the function returns a tuple with more than one element | ||
a MultiIndex will be returned. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add a small example here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is outside the original scope of the contribution, see first PR comment.
DataFrame.applymap: Apply a function elementwise on a whole DataFrame | ||
Series.apply : For applying more complex functions on a Series. | ||
DataFrame.apply : Apply a function row-/column-wise. | ||
DataFrame.applymap : Apply a function elementwise on a whole DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add Series.replace ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is outside the original scope of the contribution, see first PR comment.
|
||
Parameters | ||
---------- | ||
arg : function, dict, or Series | ||
Mapping correspondence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add here some details on how a dict and Series are handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is outside the original scope of the contribution, see first PR comment.
Thanks for the comments. I will work on them this evening probably. |
@l736x Ah, sorry, I didn't fully read that comment at the top. I have renamed the PR to make that more clear :) |
For me it's fine both ways: keep it focused on CategoricalIndex (and you can ignore the comments in the others pars), or it's also fine to expand the scope of the PR to try to fo all map methods. As you prefer. |
Just to clarify: I'm not saying that the pull request cannot be improved, I completely agree that also Series.map, Index.map and Categorical.map need a rework. |
I'm more willing to work on the other map methods in a separate pull request to avoid delaying this one, if you are all ok with this. |
That's fine! |
Ok, I thing I'm getting out of my mind. Can you tell me if at the moment this code runs correctly?:
According to the examples it should give:
I swear it was working last saturday, but at the moment I get:
By inspecting the code in algos_common_helper.pxi I would guess that it should indeed fail. (The file did not change recently as far as I can see.) The troubling thing is that even a simple Index is not working with .map(dict), while from the docstring it should.
Any thoughts? |
Typically if you get errors from the Cython layer you need to rebuild the C extensions, so try running
And see if that resolves |
It says everything is up-to-date and does not rebuild anything. :( |
Ok, I got it: for a reason that I don't understand I was picking up the anaconda default pandas instead of the branch I was editing. Problem solved, back to my pull request. |
I updated the pull request. About your proposal for the Extended Summary: "Maps the categories to new categories, retaining any ordering. Missing entries will be replaced with an np.ndarray containing NaN." Let me know if you like this version better. Here the new validation script outputs:
|
I would also say that if the documentation is not clear to you as the writer (or in this case editor) then it would not be clear to the general reader either, hence the request for an update. I agree with you that examples would be nice - given the docstring before was certainly trying to call out something about how the ordering and missing values were being handled can you add examples to illustrate those caveats? Perhaps then a better wording will become clear to you. FWIW I wouldn't use words that fall under a specific technical domain like "bijection" - I think the concept(s) can be explained in general terms so again try adding at least one example each highlighting the sorting and missing value nuances and see what that yields |
@WillAyd I did! Did you read the new docstring? Better explanation:
Example (in both Categorical and CategoricalIndex):
Personally I'm happy with this version. I took into consideration your remarks, I explained why I don't agree with the correction that you proposed and I gave you a new version that to me is clear enough. [edited for a misspelling] |
Codecov Report
@@ Coverage Diff @@
## master #20286 +/- ##
=======================================
Coverage 91.8% 91.8%
=======================================
Files 152 152
Lines 49215 49215
=======================================
Hits 45181 45181
Misses 4034 4034
Continue to review full report at Codecov.
|
I agree with trying to avoid terms like "bijection", I don't think many people know what it it (although you directly explain it here ..). @WillAyd if there is still something else unclear to you in the current version, can you add some inline comments? |
pandas/core/arrays/categorical.py
Outdated
Map categories using input correspondence (dict, Series, or function). | ||
|
||
Maps the categories to new categories. If the mapping correspondence is | ||
a bijection (maps each original category to a different new category) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can use something like "one-to-one mapping" instead ?
ordered=False, dtype='category') | ||
>>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'}) | ||
CategoricalIndex(['first', 'second', 'third'], categories=['first', | ||
'second', 'third'], ordered=False, dtype='category') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche @l736x the point I was trying to make was that since we mention that the ordering property gets retained with a mapping, that we should have an example for an ordered CategoricalIndex
explicitly showing this. It could be something as simple as:
Note that one-to-one mappings will retain the ordering of the CategoricalIndex
idx = pd.CategoricalIndex(['a,'b','c'], ordered=True)
idx.map({'a': 3, 'b': 2, 'c': 1})
Just my $.02 though @jorisvandenbossche I'm good to go whenever you want to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would indeed be good to have an example for this. But maybe just adding ordered=True
to the example above is enough? (without adding another example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me
I replaced the word "bijection" with "one-to-one mapping" as suggested. You are both right, this is better. @WillAyd, sorry I didn't understand that your comment was about a new example. I agree it's a good idea so I added one for both Categorical.map and CategoricalIndex.map. Let me know if you are ok with the present version or you need other fixes.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! I like the format but just have some slight suggestions on how to improve wording.
Please be sure to also surround the NaN references in backticks (think this was called out in earlier review)
""" | ||
Map categories using input correspondence (dict, Series, or function). | ||
|
||
Maps the categories to new categories. If the mapping correspondence is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is simplified to the point that I think you can now just say "If the mapping correspondence is one-to-one the result is a ..." in the second sentence
pandas/core/arrays/categorical.py
Outdated
is returned. | ||
|
||
If a `dict` or :class:`~pandas.Series` is used any unmapped category is | ||
mapped to NaN. Note that if this happens an :class:`~pandas.Index` will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaN
should be in backticks, so 'NaN'
pandas/core/arrays/categorical.py
Outdated
[first, second, third] | ||
Categories (3, object): [first, second, third] | ||
|
||
The ordering of the categories is preserved by the map: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job, but to just contrast a little better with what you have below maybe say "If the mapping is one-to-one the ordering is preserved" (we also wouldn't want the reader who stops here to inadvertently think that ordering is always preserved)
pandas/core/arrays/categorical.py
Outdated
>>> cat.map({'a': 'first', 'b': 'second', 'c': 'first'}) | ||
Index(['first', 'second', 'first'], dtype='object') | ||
|
||
If a `dict` is used, all unmapped categories are mapped to NaN and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment with 'NaN'
pandas/core/indexes/category.py
Outdated
Map values using input correspondence (a dict, Series, or function). | ||
|
||
Maps the values (their categories, not the codes) of the index to new | ||
categories. If the mapping correspondence is a one-to-one mapping (maps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
pandas/core/indexes/category.py
Outdated
the original, otherwise an :class:`~pandas.Index` is returned. | ||
|
||
If a `dict` or :class:`~pandas.Series` is used any unmapped category is | ||
mapped to NaN. Note that if this happens an :class:`~pandas.Index` will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'NaN'
Thanks for the improvements. I missed the |
Thanks for the updates lgtm. @jorisvandenbossche I'm good with this when you are |
Hi core devs,
I worked on the CategoricalIndex.map docstring during the sprint in Paris.
Although we concentrated on this function, while doing that we couldn't help improving (marginally!) also the docstrings for Categorical.map, Index.map and Series.map.
We probably should have refrained from doing it, knowing that other cities were working on those, but it was difficult to change one and leaving the style of the others incoherent with the first.
As a result I'm proposing here a pull request that combines all the changes.
The validation script passes for CategoricalIndex.map but not for the other cases.
If that is not acceptable I can change that, but I will probably need some help with the git part.
Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):
scripts/validate_docstrings.py <your-function-or-method>
git diff upstream/master -u -- "*.py" | flake8 --diff
python doc/make.py --single <your-function-or-method>
Please include the output of the validation script below between the "```" ticks:
If the validation script still gives errors, but you think there is a good reason
to deviate in this case (and there are certainly such cases), please state this
explicitly.