Skip to content

BUG: set_index() and reset_index() not preserving object dtypes #30870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

trevorbye
Copy link

@trevorbye trevorbye commented Jan 9, 2020

  • changes to set_index() and ensure_index_from_sequences() to pass current dtype of key through call stack onto Index constructor, rather than letting Index constructor infer the dtype. For now I just aimed to patch a single key param from set_index(), as that is the lowest hanging fruit for this issue I think.

  • removed maybe_convert_objects() call from _maybe_casted_values(). This will preserve an object dtype on a set_index().reset_index() round-trip. I came to the exact same conclusion as the user in this PR comment, for context. This makes sense conceptually, if your index was an object dtype, it should be an object dtype as a column. Of course, this is somewhat subjective so let me know your opinions, but this is what I interpreted from the discussion on the git issue.

  • both of these changes break quite a few existing tests, most of which are testing specifically for the exact opposite conditions that this PR is intended to patch e.g. tests that look for object dtypes being coerced to more-specific types. What's the process for handling failed tests for a change like this? Would you attempt to refactor all, or delete the ones where the logic is no longer something to test for? I also ran pytest pandas on a fresh pull of master and had 13 fails, with 108 failures on my branch specifically, so I am also curious what to do about those previously-failing tests, or what is acceptable for a merge.

@pep8speaks
Copy link

pep8speaks commented Jan 9, 2020

Hello @trevorbye! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-31 18:25:09 UTC

@trevorbye
Copy link
Author

@jreback for visibility

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to make an even smaller change this likely will break lots of things

@trevorbye
Copy link
Author

try to make an even smaller change this likely will break lots of things

Any suggestions for making these changes smaller? I tried to touch as little as possible, but not sure how I could slim it down further.

Or another question would be, what do you see breaking a lot?

Thanks

@trevorbye
Copy link
Author

@jreback @TomAugspurger any suggestions how to make this change smaller per your comment?

What are your thoughts on implementing this as non-default behavior? For example, adding a new optional param to both set_index() and reset_index() something like a boolean for preserve_dtype, that would enable the changes I've made here rather than make them the default. This would stop all these current tests from failing, and not break user's code. The tradeoff is the user would have to specify this param to get the new behavior, so I suppose that is a design question.

From the conversation in the issue I had the impression it should be the new default behavior, but I could have misinterpreted. Either way let me know what you think the path forward is and I will work on it. Thanks

@TomAugspurger
Copy link
Contributor

FYI @trevorbye you've commited some pickle files that are being left around by a some test (we haven't discovered which one yet).

I not sure what Jeff had in mind for making this smaller.

I think round-tripping the dtypes should be the default behavior. It not doing so today is a bug.

In theory, we could handle this in the Index constructor and just not have it infer the dtype when given a typed object. But that would probably be an API breaking change.

@trevorbye
Copy link
Author

@TomAugspurger thanks, I'll get rid of those files.

What is your best practice for refactoring/removing tests for a breaking change like this? There are quite a few failing, would you change the logic to reflect the new type inference (or lack thereof, in this case)?

For Index constructor, if you call it yourself you can pass the dtype param which will disable the type coercion. But, when the constructor is called from inside set_index(), it doesn't pass dtype param. The change I made there simply grabs the type of the col you're trying to create an index from, and passes that on to the constructor and uses the logic already in place.

Still learning the code base, but I'm wondering if there's a way to do this with less overall impact than the way I did it.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 14, 2020 via email

@jreback
Copy link
Contributor

jreback commented Jan 14, 2020

@trevorbye yeah I agree with @TomAugspurger what we need to do here is actually fix this. This will by-definition break some tests which will need to be updated, some because we are specifically testing that functionaility, and some by happen-stance (we happened to construct the expected value with an inferred type).

So as I said above, my preference here is to leave the Index constructor alone, and simply pass in the dtype of what we are setting (now this will have to be slightly adjusted, so create a function to do this), e.g. a int* column -> np.dtype64 (as we don't support anything but Int64Index).

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Jan 14, 2020
@trevorbye
Copy link
Author

@trevorbye yeah I agree with @TomAugspurger what we need to do here is actually fix this. This will by-definition break some tests which will need to be updated, some because we are specifically testing that functionaility, and some by happen-stance (we happened to construct the expected value with an inferred type).

So as I said above, my preference here is to leave the Index constructor alone, and simply pass in the dtype of what we are setting (now this will have to be slightly adjusted, so create a function to do this), e.g. a int* column -> np.dtype64 (as we don't support anything but Int64Index).

@jreback thanks, I can refactor all those tests once we figure out the logic.

I think the change I made here is what you described; No change to Index constructor, I just passed it the current dtype. For reference, here is the main change I made, line 4266 in frame.py in my commit:

if found:
    # get current dtype to preserve through index creation
    current_dtype = self.dtypes.get(col).type

This is then passed to ensure_index_from_sequences(arrays, names, current_dtype) (I created the new optional param here), which is then forwarded on to the Index constructor. I've done some debugging, and from what I'm seeing, the Index (and subclasses like Int64Index etc.) constructors will automatically handle unsupported index types if you attempt to create them. Here's an example:

In [1]: df = pd.DataFrame({'A': pd.Series([1, 2, 3], dtype=np.int32), 'B': [1, 2, 3]})

In [2]: print(df.dtypes.get('A').type)
Out [2]: <class 'numpy.int32'>

In [3]: print(df.set_index('A').index)
Out [3]: Int64Index([1, 2, 3], dtype='int64', name='A')

I ran the above code with my changes, and even though it's grabbing np.int32 from the series type and using that as a param to the Index constructor dtype, it automatically detects that it's unsupported as an index type and coerces to int64.

So I don't think we need a new function to pre-convert these types before passing to Index constructors, as the logic is already there. Is this what you meant by a new function? Apologies if I'm going in circles, but to me it seems like the implementation we're discussing, and the implementation I made, are the same core logic, minus any edge cases or additional considerations I may be missing.

Thanks for your help!

@trevorbye
Copy link
Author

@jreback any feedback on my above comment? Want to see what you think before I go refactor a ton of tests

@jreback
Copy link
Contributor

jreback commented Jan 22, 2020

will look soon; you have a lot of pickle files committed; remove those

@trevorbye trevorbye requested a review from WillAyd January 24, 2020 01:33
@trevorbye
Copy link
Author

@jreback @TomAugspurger want to take another look? See my comment on your review, I tried your suggestion of doing arrays.append(Index(col, dtype=col.dtype), but setting the dtype gets overridden by the outer constructor in that case.

There was some additional code I added as well to allow as many tests to pass as possible without altering them. I excluded datetime64 from this new logic, because there is a lot that breaks if you don't leave that as it is. Also added a try/catch for some edge cases that occur in some tests.

@trevorbye
Copy link
Author

@jreback @jorisvandenbossche @WillAyd @TomAugspurger hi guys, can anyone take a look/see if we can merge? Thanks for your help

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 31, 2020 via email

@@ -19,6 +19,7 @@ def test_grouping_grouper(self, data_for_grouping):
tm.assert_numpy_array_equal(gr1.grouper, df.A.values)
tm.assert_extension_array_equal(gr2.grouper, data_for_grouping)

@pytest.mark.skip(reason="logic change to stop coercing dtypes on set_index()")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are skipping these because the existing tests are incompatible with the new logic right?

If that's correct then we should actually update the tests and not just skip

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will refactor them. From what I could see looking through tests, it seemed to me like a lot of them are skipped rather than refactored.

Would best practice be to try and always refactor? Thanks for your feedback

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is rarer than others as it intentionally is breaking existing behavior; in such case yes would want to update existing tests and not just skip them

Also needs a little more communication in whatsnew around backwards compatibility but can leave that until later; I think would be helpful to see tests updated for reviewers to have a better understanding of scope of changes this would cause

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that make sense. I'll work on getting these tests updated so it is clearer what the scope of the change is. In general, these few I skipped were asserting two df's to be equal, which is true except of course for the index of each df were no longer equal, thus they failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you need to actually change the results

@@ -148,6 +148,8 @@ Indexing
^^^^^^^^
- Bug in slicing on a :class:`DatetimeIndex` with a partial-timestamp dropping high-resolution indices near the end of a year, quarter, or month (:issue:`31064`)
- Bug in :meth:`PeriodIndex.get_loc` treating higher-resolution strings differently from :meth:`PeriodIndex.get_value` (:issue:`31172`)
- Bug in :meth:`DataFrame.set_index` not preserving column dtype when creating :class:`Index` from a single column (:issue:`30517`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to a sub-section; show a typical example of before & after (see how this is done in the Enhancements section)

@@ -4320,6 +4320,16 @@ def set_index(
# everything else gets tried as a key; see GH 24969
try:
found = col in self.columns
if found:
# get current dtype to preserve through index creation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this all mean? why would we not always preserve?

@@ -5458,6 +5458,7 @@ def ensure_index_from_sequences(sequences, names=None):
----------
sequences : sequence of sequences
names : sequence of str
dtype : NumPy dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a NumPy dtype? type dtype: Dtype

@@ -19,6 +19,7 @@ def test_grouping_grouper(self, data_for_grouping):
tm.assert_numpy_array_equal(gr1.grouper, df.A.values)
tm.assert_extension_array_equal(gr2.grouper, data_for_grouping)

@pytest.mark.skip(reason="logic change to stop coercing dtypes on set_index()")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you need to actually change the results

@WillAyd
Copy link
Member

WillAyd commented Mar 14, 2020

@trevorbye is this still active? Can you address comments and fix merge conflicts?

@WillAyd
Copy link
Member

WillAyd commented Apr 7, 2020

Closing as I think stale, but ping if you'd like to continue and can reopen

@WillAyd WillAyd closed this Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.set_index() may not preserve dtype
5 participants