-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Fix docs on merging categoricals. #28185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -813,16 +813,16 @@ but the categories of these categoricals need to be the same: | |||
res | |||
res.dtypes | |||
|
|||
In this case the categories are not the same, and therefore an error is raised: | |||
If the categories are not exactly the same, merging will coerce the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this wording a little hard to digest - we are just coercing the categorical to their values as part of this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it could be more clear, but I'm struggling with trying to keep the terminology consistent with existent docs. How about this?
Otherwise the result's dtype will be determined by promotion of the mergands' categories' dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
If the categories are not identical, categorical columns will be cast to a regular array with
the dtype of the underlying categories, which will likely have higher memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good. Should there be an example which doesn't cast to object to make that clear? Something like this:
In [101]: pd.concat([
...: pd.Series([1, 2], dtype="category"),
...: pd.Series([3., 4.], dtype="category")
...: ])
Out[101]:
0 1.0
1 2.0
0 3.0
1 4.0
dtype: float64
I've just noticed there is another section on Concatenation, as opposed to Merging, in the categorical docs. These docs describe different behavior, but still don't quite fit with what actually happens. Having two sections on this seems redundant to me, could these be collapsed to a single section in this PR? |
I would be OK with that. It's what we seem to have set up in the 10 minutes to pandas: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#merge |
@ivirshup can you merge master and we will look again |
846eb70
to
9fb6d67
Compare
* Added an examples where categoricals are concatenated which results in a numeric dtype. * Removed a table of examples which seemed confusion (most entries were equivalent, gave misleading typing info).
@jreback, I've merged master (albeit a couple weeks ago now) and consolidated the sections on concatenation and merging. |
union_categoricals([s1.array, s3.array]) | ||
|
||
|
||
Following table summarizes the results of ``Categoricals`` related concatenations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you keep this table? I think a good summary of what happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though update to reflect current status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful it it said a bit more about the return types, though it might be better to just point to type promotion docs for that. How about something like this?
+----------+--------------------------------------------------------+----------------------------+
| arg1 | arg2 | identical | result |
+==========+========================================================+============================+
| category | category | True | category |
+----------+--------------------------------------------------------+----------------------------+
| category (object)| category (object) | False | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+
| category (int) | category (float) | False | float (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems logical. Not sure if we have better verbiage than saying category (object)
to refer to the categories of the categorical; @TomAugspurger might have thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to keep it close to the repr, since that should make it easier to relate to practice
>>> pd.Categorical(list("abc"))
[a, b, c]
Categories (3, object): [a, b, c]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @WillAyd, I thought we were waiting on @TomAugspurger. Just pushed an update. Hopefully I got the table right.
@ivirshup can you address comment on table deletion? I think this is close |
lgtm @jreback or @TomAugspurger if you want to look |
Won't have time nearterm. Go for it when you're ready.
…On Fri, Nov 8, 2019 at 10:10 AM William Ayd ***@***.***> wrote:
lgtm @jreback <https://github.com/jreback> or @TomAugspurger
<https://github.com/TomAugspurger> if you want to look
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28185?email_source=notifications&email_token=AAKAOISRC3ADR3PICYQNXEDQSWFO3A5CNFSM4IQ5YBY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDSSLZY#issuecomment-551888359>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIT2WRDSND3RKKNOTCDQSWFO3ANCNFSM4IQ5YBYQ>
.
|
Thanks @ivirshup |
…ndexing-1row-df * upstream/master: (109 commits) stronger typing in libreduction (pandas-dev#29502) API: rename labels to codes (pandas-dev#29509) CLN: remove unnecessary type checks (pandas-dev#29517) implement _BaseGrouper (pandas-dev#29520) CLN: F-string formatting in pandas/_libs/*.pyx (pandas-dev#29527) Fixed more SS03 errors (pandas-dev#29540) consolidate dim checks (pandas-dev#29536) REF: separate out _get_cython_func_and_vals (pandas-dev#29537) remove unnecessary exception (pandas-dev#29538) TST:Add test to check single category col returns series with single row slice (pandas-dev#29521) Make color validation more forgiving (pandas-dev#29122) DOC: update bottleneck repo and documentation urls (pandas-dev#29516) TST: add test for df construction from dict with tuples (pandas-dev#29497) add test for pd.melt dtypes preservation (pandas-dev#29510) updated DataFrame.equals docstring (pandas-dev#29496) Resolved merge conflicts (pandas-dev#29506) DOC: Improved pandas/compact/__init__.py (pandas-dev#29507) DOC: Update performance comparison section of io docs (pandas-dev#28890) TST: add test for df.where() with category dtype (pandas-dev#29454) DOC: Fix docs on merging categoricals. (pandas-dev#28185) ...
Fixes #28166.
Updated docs about merging categorical to reflect current behavior.