Skip to content

DOC: Add notice and example for CategoricalDtype with different categories_dtype #57273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 17, 2024

Conversation

luke396
Copy link
Contributor

@luke396 luke396 commented Feb 6, 2024

@luke396 luke396 marked this pull request as draft February 6, 2024 05:07
@VladimirFokow
Copy link
Contributor

VladimirFokow commented Feb 6, 2024

Thanks @luke396 !

I was hoping that this part would be edited:

Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
whenever they have the same categories and order. When comparing two
unordered categoricals, the order of the ``categories`` is not considered.

to be a complete and definitive definition, covering all cases right here.

(instead of adding examples and having the user then guess the actual rules)

@VladimirFokow
Copy link
Contributor

VladimirFokow commented Feb 6, 2024

More generally, I am confused:

A categorical's type is fully described by
1. ``categories``: a sequence of unique values and no missing values
2. ``ordered``: a boolean

... If the CategoricalDtype is "fully described" by categories and order,
and categories is defined as a "sequence of unique values and no missing values"
--> [] is a "sequence of unique values and no missing values"
(yes, it's empty, but it is a sequence of 0 unique values and doesn't contain NaNs (the missing values))
--> there is no dtype anywhere in this definition..
So should this description be corrected as well?

These should be good precise definitions giving an accurate conceptual understanding.

@luke396
Copy link
Contributor Author

luke396 commented Feb 6, 2024

Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
whenever they have the same categories and order. When comparing two
unordered categoricals, the order of the ``categories`` is not considered.

Thank you, @VladimirFokow, for your advice on the PR. I hadn't considered it as thoroughly as you did. At the moment, the example in the file doesn't suit my needs well, which is why the PR is still in progress. Have to admit it, I actually don't have the same accurate and comprehensive understanding of Categorical as you do.

These should be good precise definitions giving an accurate conceptual understanding.

You've shown great consideration in enhancing the docstring, and your insights could be used to open a new pull request aimed at improving the general description further.

@VladimirFokow
Copy link
Contributor

VladimirFokow commented Feb 6, 2024

Maybe they meant that a Categorical👈 (not CategoricalDtype) is partly described by .categories...
But in this case, categories is NOT a sequence.
It's an Index (containing the categories, and of a certain dtype which is determined... automatically? - even if constructed from a numpy array of objects yet all integers - the dtype of this index will be int, not object...

How is this dtype determined - completely disregarding the numpy dtype and just looking at the actual values? How to operate with it - can we change it / what are the best practices when dealing with problems connected to it)


upd: Maybe what they meant here is a purely conceptual description, because after it they say:

This information can be stored in a CategoricalDtype.

@simonjayhawkins simonjayhawkins added Docs Categorical Categorical Data Type labels Feb 7, 2024
@luke396
Copy link
Contributor Author

luke396 commented Feb 16, 2024

Hi @rhshadrach, could you please review the PR and provide some comments for improvement?

@luke396 luke396 marked this pull request as ready for review February 16, 2024 12:58
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@luke396
Copy link
Contributor Author

luke396 commented Feb 17, 2024

@rhshadrach Thanks for your prompt review! I have updated the PR based on your comments.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach
Copy link
Member

/preview

Copy link
Contributor

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/57273/

@rhshadrach rhshadrach merged commit 269f0b5 into pandas-dev:main Feb 17, 2024
@rhshadrach
Copy link
Member

Thanks @luke396

@luke396 luke396 deleted the add-categoricaldtype-doc branch February 17, 2024 14:44
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: CategoricalDtype equality semantics aren't completely described
4 participants