-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Preserve EA dtype in DataFrame.stack #23285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just a WIP for now. I feel like we're lacking on tests here. Will add one for stacking a single level from a dataframe with a MultiIndex in the columns. |
Hello @TomAugspurger! Thanks for updating the PR.
Comment last updated on October 24, 2018 at 21:13 Hours UTC |
"Fixed" the sparse values. We were failing to handle I'll probably do that today, and then pick up the reshaping PRs afterwards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still WIP ?
It looks good to me.
I think we need tests where the columns are a MultiIndex.
…________________________________
From: Joris Van den Bossche <[email protected]>
Sent: Tuesday, October 23, 2018 4:24:31 PM
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] [WIP]Preserve EA dtype in DataFrame.stack (#23285)
@jorisvandenbossche commented on this pull request.
Is this still WIP ?
It looks good to me.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#23285 (review)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIqd_gEfs-HuVEgu9F_eRaTpryWp9ks5un4kPgaJpZM4X0JPS>.
|
Codecov Report
@@ Coverage Diff @@
## master #23285 +/- ##
==========================================
+ Coverage 92.24% 92.25% +<.01%
==========================================
Files 161 161
Lines 51224 51237 +13
==========================================
+ Hits 47254 47269 +15
+ Misses 3970 3968 -2
Continue to review full report at Codecov.
|
pandas/core/internals/blocks.py
Outdated
if dtype == np.object_: | ||
# sparse is "special" and preserves sparsity. | ||
# We're changing this in GH-23125 | ||
if dtype == np.object_ and is_sparse(values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_object_dtype
pandas/core/internals/blocks.py
Outdated
if is_sparse(self.values): | ||
# Series[Sparse].astype(object) is sparse. | ||
klass = ExtensionBlock | ||
elif is_object_dtype(dtype): | ||
klass = ObjectBlock | ||
elif is_extension_array_dtype(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so maybe should just move the is_extension_array_dtype
up to first, and add a is_extension_dtype(self.values)
test as well (should encompas your is_sparse check) and is more general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make that change and run the test suite.
I was kinda worried about "false positives" here, but I suppose it's exactly what we want if an extension array claims it's object dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As posted in the unstack PR, we need to special case Space here, since it's the only (internal) extension type that has special .astype(object)
behavior.
can you rebase |
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -849,7 +849,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your | |||
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`) | |||
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`). | |||
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`) | |||
- :meth:`Series.unstack` no longer converts extension arrays to object-dtype ndarrays. The output ``DataFrame`` will now have the same dtype as the input. This changes behavior for Categorical and Sparse data (:issue:`23077`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was accidentally added in the PeriodArray PR. Will be implemented for good in #23284
expected = df.astype(object).stack() | ||
# we need a second astype(object), in case the constructor inferred | ||
# object -> specialized, as is done for period. | ||
expected = expected.astype(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kinda strange. For DataFrame[ndarray[object]].stack()
of all periods, we actually infer period
-dtype. Do we want that, or should we explicitly pass dtype=object
when creating the new series / frame to ensure that we don't infer the "correct" dtype?
In [1]: import pandas as pd
In [2]: a = pd.core.arrays.period_array(['2000', '2001'], freq='D')
In [3]: pd.DataFrame({"A": a, "B": a}).astype(object).dtypes
Out[3]:
A object
B object
dtype: object
In [4]: pd.DataFrame({"A": a, "B": a}).astype(object).stack().dtype
Out[4]: period[D]
(that's on master)
All green. |
Merged master to fix the merge conflict with the unstack PR. Will ping on green. |
lgtm. |
…fixed * upstream/master: (47 commits) CLN: remove values attribute from datetimelike EAs (pandas-dev#23603) DOC/CI: Add linting to rst files, and fix issues (pandas-dev#23381) PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pandas-dev#23589) PERF: define is_all_dates to shortcut inadvertent copy when slicing an IntervalIndex (pandas-dev#23591) TST: Tests and Helpers for Datetime/Period Arrays (pandas-dev#23502) Update description of Index._values/values/ndarray_values (pandas-dev#23507) Fixes to make validate_docstrings.py not generate warnings or unwanted output (pandas-dev#23552) DOC: Added note about groupby excluding Decimal columns by default (pandas-dev#18953) ENH: Support writing timestamps with timezones with to_sql (pandas-dev#22654) CI: Auto-cancel redundant builds (pandas-dev#23523) Preserve EA dtype in DataFrame.stack (pandas-dev#23285) TST: Fix dtype mismatch on 32bit in IntervalTree get_indexer test (pandas-dev#23468) BUG: raise if invalid freq is passed (pandas-dev#23546) remove uses of (ts)?lib.(NaT|iNaT|Timestamp) (pandas-dev#23562) BUG: Fix error message for invalid HTML flavor (pandas-dev#23550) ENH: Support EAs in Series.unstack (pandas-dev#23284) DOC: Updating DataFrame.join docstring (pandas-dev#23471) TST: coverage for skipped tests in io/formats/test_to_html.py (pandas-dev#22888) BUG: Return KeyError for invalid string key (pandas-dev#23540) BUG: DatetimeIndex slicing with boolean Index raises TypeError (pandas-dev#22852) ...
closes #23077
There were two bugs in master (not present in 0.23.4), probably from the SparseArray PR
EA._concat_same_type
take
to get the correct order.