Skip to content

BUG: Wrong dtype when resetting a multiindex with missing values. (#1… #27370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed

Conversation

ldmnt
Copy link

@ldmnt ldmnt commented Jul 12, 2019

…9602 and #24206)

…9602 and #24206)

- Fixed the bad path in reset_index where the dtype was ignored if there are only
missing values.

- Simplified the structure of _maybe_casted_values in reset_index by
using as much as possible the take method of ExtensionArray's
@ldmnt
Copy link
Author

ldmnt commented Jul 12, 2019

Two possible improvements that i noticed while working on this :

  1. The multiindex path of _maybe_casted_values is essentially a take(…, allow_fill=True) function that works on both ExtensionArrays and ndarrays. Since it's the type of the _values attribute, it might be a good idea to extract it outside of reset_index if it's used in other places. (I thought that it likely is but I didn't really know.)

  2. When refactoring _maybe_casted_values in reset_index, I tried removing the cast at the start of the function (frame.py line 4607) :

            if not isinstance(index, (PeriodIndex, DatetimeIndex)):
                if values.dtype == np.object_:
                    values = lib.maybe_convert_objects(values)`

It was introduced a very long time ago (#440) to try and infer the type of indexes with dtype object. Removing it breaks one test that relies on the fact that an index with integer underlying data and object dtype will be correctly casted to an Int64Index. I didn't have enough knowledge of the code base to know if inferring the type of object indexes is still the intended behavior, but if it's not these lines could be removed.

Louis Dumont added 3 commits July 13, 2019 10:18
- Removed unused import
- Enforced column order in tests for compat with python 3.5
@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions labels Jul 15, 2019
@ldmnt
Copy link
Author

ldmnt commented Aug 26, 2019

@jreback for info: I have made all the modifications you requested up to now and on my side its ready. If you need further changes I can try to handle them before next friday, but after that I will be away for 3 weeks and unable to update the pull request.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you ensure that we have a basic test for #19602. Something like

In [11]: idx = pd.MultiIndex.from_product([[0, 1], [1, 2]])

In [12]: pd.DataFrame(index=idx)[:0].reset_index().dtypes
Out[12]:
level_0    float64
level_1    float64
dtype: object

asserting that those are ints, not floats?

@@ -1391,3 +1391,42 @@ def maybe_cast_to_integer_array(arr, dtype, copy=False):

if is_integer_dtype(dtype) and (is_float_dtype(arr) or is_object_dtype(arr)):
raise ValueError("Trying to coerce float values to integers")


def maybe_casted_values(index, codes=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mark this as private with a leading underscore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure ? It's not really private since it's used in frame.py, and @jreback already asked me to remove the underscore when moving it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe wait to hear from @jreback but .core.* is supposed to be private, but people still use methods from there. We plan to deprecate it, but until then it's probably best to prefix methods with underscores.

# we can have situations where the whole mask is -1,
# meaning there is nothing found in labels, so make all nan's
if mask.all():
values = np.empty(len(mask), dtype=values.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get here with an extension dtype? Like pd.date_range('2000', periods=4, tz="CET")? If so, I suspect that will fail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have time to really get into it, but after a quick look I would say that it's possible. I'll have to do something about it when i come back.

@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

@ldmnt can you merge master and we'll take a look.

@jreback
Copy link
Contributor

jreback commented Oct 18, 2019

can you merge master

@ldmnt
Copy link
Author

ldmnt commented Nov 1, 2019

Sorry I haven't found time to do that recently. I saw your message, I will try to do the merge and think about Tom's question this week.

@WillAyd
Copy link
Member

WillAyd commented Dec 17, 2019

Thanks for the PR but I think this has gone stale. Closing to clean up the queue for now, but certainly ping if you'd like to pick back up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
4 participants