Skip to content

ENH: GH9746 DataFrame.unstack and Series.unstack now take fill_value … #10246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

amcpherson
Copy link
Contributor

…kw for filling NaN when unstack results in a sparse DataFrame

closes #9746

@shoyer
Copy link
Member

shoyer commented Jun 1, 2015

This may interact with #9023

@amcpherson
Copy link
Contributor Author

I can resolve conflicts where necessary, let me know when and how to proceed.

@@ -179,6 +180,10 @@ def get_new_values(self):
if self.mask.all():
dtype = values.dtype
new_values = np.empty(result_shape, dtype=dtype)
elif self.fill_value is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should happen before the self.mask.all() check I think

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Oct 11, 2015
@amcpherson
Copy link
Contributor Author

Reworked and simplified, and now just pass fill_value to _maybe_promote, which I believe is more correct. As an effect of this I had to handle a None fill_value passed to _maybe_promote as this is the default of unstack(). As an alternative the default of unstack() could be fill_value=np.nan.

@@ -865,6 +865,7 @@ Changes to ``Categorical.unique``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``Categorical.unique`` now returns new ``Categoricals`` with ``categories`` and ``codes`` that are unique, rather than returning ``np.array`` (:issue:`10508`)
- ``DataFrame.unstack`` and ``Series.unstack`` now take ``fill_value`` keyword to allow direct replacement of missing values when an unstack results in missing values in the resulting ``DataFrame``. As an added benefit, specifying ``fill_value`` will preserve the data type of the original stacked data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.17.1

@amcpherson amcpherson force-pushed the unstackfillvalue branch 2 times, most recently from 56ce589 to a66be32 Compare October 22, 2015 14:59
data.index = MultiIndex.from_tuples(
[('x', 'a'), ('x', 'b'), ('y', 'b'), ('z', 'a')])

result = data.unstack()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this also test with fill_value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... its unfortunate that TestDataFrame inherits from unittest.TestCase. It would be nice to use the nose's generator functionality to yield a set of similar tests, but that doesnt work for unittest.TestCase classes. I may move the tests to another class in test_frame.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just add another class as a mix-in. nose really doesn't support parameterized tests, so just use a loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed other people have done the same (added mix-in classes) for specific tests where they need to use the generator style test creation for nose. I assume this is because the main test class TestDataFrame inherits from unittest.TestCase (to allow use of self.assert* functions from unittest.TestCase), but inheriting from unittest.TestCase prevents use of generator style tests for nose. Seems a bit of a hack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have a better idea, that doesnt involve copying the assert* functions into util.testing.TestCase, or does something hacky with getattr in util.testing.TestCase, so ill just add a mix in class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with the simplest approach, just added a few more tests as it is more readable anyway.

@amcpherson amcpherson force-pushed the unstackfillvalue branch 2 times, most recently from d030f7e to 9700827 Compare October 28, 2015 18:13
@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

can you rebase/update according to comments.


.. ipython:: python

df3 = df.ix[[0, 1, 4, 7], [1, 2]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .loc or .iloc

@jreback
Copy link
Contributor

jreback commented Dec 11, 2015

@amcpherson only a couple of doc changes. pls ping when ready & green.

thanks for the patience.....i periodically cycle thru PR's and sometimes don't make it all the way thru....(before I start again) :)

@jreback
Copy link
Contributor

jreback commented Jan 2, 2016

can you run: git diff master | flake8 --diff and fix anything. going to shortly be enforcing PEP8.

@jreback
Copy link
Contributor

jreback commented Jan 20, 2016

can you rebase / update according to comments

@amcpherson
Copy link
Contributor Author

Apologies for the delay, let me know how it looks now.

.. versionadded: 0.18.0

Unstacking can result in missing values if subgroups do not have the same
set of labels. By default, missing values will be replaced with NaN.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double back-ticks around NaN (mention that for datetimelike these will be filled with NaT), so maybe say the fill value for that dtype

@jreback jreback added this to the 0.18.0 milestone Jan 20, 2016
@amcpherson amcpherson force-pushed the unstackfillvalue branch 2 times, most recently from 286660c to fcf2f91 Compare January 20, 2016 19:13
@jreback
Copy link
Contributor

jreback commented Jan 30, 2016

lmk when you have a chance to update

…kw for filling NaN when unstack results in a sparse DataFrame
@amcpherson
Copy link
Contributor Author

Some tests failing, checking now.

@amcpherson
Copy link
Contributor Author

Nevermind, tests were failing due to out of date .so, all looks good.

@jreback jreback closed this in de46056 Jan 30, 2016
@jreback
Copy link
Contributor

jreback commented Jan 30, 2016

@amcpherson thanks!

@@ -1127,6 +1127,12 @@ def _maybe_promote(dtype, fill_value=np.nan):
# the proper thing to do here would probably be to upcast
# to object (but numpy 1.6.1 doesn't do this properly)
fill_value = tslib.iNaT
elif issubclass(dtype.type, np.timedelta64):
try:
fill_value = lib.Timedelta(fill_value).value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amcpherson I think we can eliminate this section here, and just upcast to np.object_ if the fill_value cannot be coerced, (and the above about datetimes), can you create an issue for this? (and PR would be great as well!) thanks

This is some pretty old code I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fill_value kwarg for unstack
3 participants