-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: GH9746 DataFrame.unstack and Series.unstack now take fill_value … #10246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This may interact with #9023 |
I can resolve conflicts where necessary, let me know when and how to proceed. |
18ee145
to
b48791a
Compare
@@ -179,6 +180,10 @@ def get_new_values(self): | |||
if self.mask.all(): | |||
dtype = values.dtype | |||
new_values = np.empty(result_shape, dtype=dtype) | |||
elif self.fill_value is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should happen before the self.mask.all()
check I think
b48791a
to
7070031
Compare
Reworked and simplified, and now just pass fill_value to _maybe_promote, which I believe is more correct. As an effect of this I had to handle a None fill_value passed to _maybe_promote as this is the default of unstack(). As an alternative the default of unstack() could be fill_value=np.nan. |
@@ -865,6 +865,7 @@ Changes to ``Categorical.unique`` | |||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||
|
|||
``Categorical.unique`` now returns new ``Categoricals`` with ``categories`` and ``codes`` that are unique, rather than returning ``np.array`` (:issue:`10508`) | |||
- ``DataFrame.unstack`` and ``Series.unstack`` now take ``fill_value`` keyword to allow direct replacement of missing values when an unstack results in missing values in the resulting ``DataFrame``. As an added benefit, specifying ``fill_value`` will preserve the data type of the original stacked data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.17.1
56ce589
to
a66be32
Compare
data.index = MultiIndex.from_tuples( | ||
[('x', 'a'), ('x', 'b'), ('y', 'b'), ('z', 'a')]) | ||
|
||
result = data.unstack() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this also test with fill_value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... its unfortunate that TestDataFrame inherits from unittest.TestCase. It would be nice to use the nose's generator functionality to yield a set of similar tests, but that doesnt work for unittest.TestCase classes. I may move the tests to another class in test_frame.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just add another class as a mix-in. nose really doesn't support parameterized tests, so just use a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed other people have done the same (added mix-in classes) for specific tests where they need to use the generator style test creation for nose. I assume this is because the main test class TestDataFrame inherits from unittest.TestCase (to allow use of self.assert* functions from unittest.TestCase), but inheriting from unittest.TestCase prevents use of generator style tests for nose. Seems a bit of a hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont have a better idea, that doesnt involve copying the assert* functions into util.testing.TestCase, or does something hacky with getattr in util.testing.TestCase, so ill just add a mix in class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with the simplest approach, just added a few more tests as it is more readable anyway.
d030f7e
to
9700827
Compare
can you rebase/update according to comments. |
9700827
to
2f9fe02
Compare
|
||
.. ipython:: python | ||
|
||
df3 = df.ix[[0, 1, 4, 7], [1, 2]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use .loc
or .iloc
@amcpherson only a couple of doc changes. pls ping when ready & green. thanks for the patience.....i periodically cycle thru PR's and sometimes don't make it all the way thru....(before I start again) :) |
can you run: |
can you rebase / update according to comments |
2f9fe02
to
a86002f
Compare
Apologies for the delay, let me know how it looks now. |
.. versionadded: 0.18.0 | ||
|
||
Unstacking can result in missing values if subgroups do not have the same | ||
set of labels. By default, missing values will be replaced with NaN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double back-ticks around NaN
(mention that for datetimelike these will be filled with NaT
), so maybe say the fill value for that dtype
286660c
to
fcf2f91
Compare
lmk when you have a chance to update |
…kw for filling NaN when unstack results in a sparse DataFrame
fcf2f91
to
7c2f9d1
Compare
Some tests failing, checking now. |
Nevermind, tests were failing due to out of date .so, all looks good. |
@amcpherson thanks! |
@@ -1127,6 +1127,12 @@ def _maybe_promote(dtype, fill_value=np.nan): | |||
# the proper thing to do here would probably be to upcast | |||
# to object (but numpy 1.6.1 doesn't do this properly) | |||
fill_value = tslib.iNaT | |||
elif issubclass(dtype.type, np.timedelta64): | |||
try: | |||
fill_value = lib.Timedelta(fill_value).value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amcpherson I think we can eliminate this section here, and just upcast to np.object_
if the fill_value cannot be coerced, (and the above about datetimes), can you create an issue for this? (and PR would be great as well!) thanks
This is some pretty old code I think
…kw for filling NaN when unstack results in a sparse DataFrame
closes #9746