-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: set_index() and reset_index() not preserving object dtypes #30870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @trevorbye! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-01-31 18:25:09 UTC |
@jreback for visibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to make an even smaller change this likely will break lots of things
Any suggestions for making these changes smaller? I tried to touch as little as possible, but not sure how I could slim it down further. Or another question would be, what do you see breaking a lot? Thanks |
@jreback @TomAugspurger any suggestions how to make this change smaller per your comment? What are your thoughts on implementing this as non-default behavior? For example, adding a new optional param to both From the conversation in the issue I had the impression it should be the new default behavior, but I could have misinterpreted. Either way let me know what you think the path forward is and I will work on it. Thanks |
FYI @trevorbye you've commited some pickle files that are being left around by a some test (we haven't discovered which one yet). I not sure what Jeff had in mind for making this smaller. I think round-tripping the dtypes should be the default behavior. It not doing so today is a bug. In theory, we could handle this in the Index constructor and just not have it infer the dtype when given a typed object. But that would probably be an API breaking change. |
@TomAugspurger thanks, I'll get rid of those files. What is your best practice for refactoring/removing tests for a breaking change like this? There are quite a few failing, would you change the logic to reflect the new type inference (or lack thereof, in this case)? For Index constructor, if you call it yourself you can pass the dtype param which will disable the type coercion. But, when the constructor is called from inside Still learning the code base, but I'm wondering if there's a way to do this with less overall impact than the way I did it. |
I'm not sure. But if we're calling the dtype not being preserved a bug,
then we'll just need to update the tests that rely on the constructors
re-inferring dtypes.
…On Mon, Jan 13, 2020 at 10:19 PM trevorbye ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> thanks, I'll get rid of
those files.
What is your best practice for refactoring/removing tests for a breaking
change like this? There are quite a few failing, would you change the logic
to reflect the new type inference (or lack thereof, in this case)?
For Index constructor, if you call it yourself you can pass the dtype
param which will disable the type coercion. But, when the constructor is
called from *inside* set_index(), it doesn't pass dtype param. The change
I made there simply grabs the type of the col you're trying to create an
index from, and passes that on to the constructor and uses the logic
already in place.
Still learning the code base, but I'm wondering if there's a way to do
this with less overall impact than the way I did it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30870?email_source=notifications&email_token=AAKAOISLMZCB6XOKWHH6HWDQ5U4OBA5CNFSM4KE7USY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI3G6YY#issuecomment-573992803>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIQYIG3QXRLHUHHMLY3Q5U4OBANCNFSM4KE7USYQ>
.
|
@trevorbye yeah I agree with @TomAugspurger what we need to do here is actually fix this. This will by-definition break some tests which will need to be updated, some because we are specifically testing that functionaility, and some by happen-stance (we happened to construct the expected value with an inferred type). So as I said above, my preference here is to leave the Index constructor alone, and simply pass in the dtype of what we are setting (now this will have to be slightly adjusted, so create a function to do this), e.g. a int* column -> np.dtype64 (as we don't support anything but Int64Index). |
@jreback thanks, I can refactor all those tests once we figure out the logic. I think the change I made here is what you described; No change to Index constructor, I just passed it the current dtype. For reference, here is the main change I made, line 4266 in frame.py in my commit: if found:
# get current dtype to preserve through index creation
current_dtype = self.dtypes.get(col).type This is then passed to
I ran the above code with my changes, and even though it's grabbing So I don't think we need a new function to pre-convert these types before passing to Index constructors, as the logic is already there. Is this what you meant by a new function? Apologies if I'm going in circles, but to me it seems like the implementation we're discussing, and the implementation I made, are the same core logic, minus any edge cases or additional considerations I may be missing. Thanks for your help! |
@jreback any feedback on my above comment? Want to see what you think before I go refactor a ton of tests |
will look soon; you have a lot of pickle files committed; remove those |
c4ff16c
to
d45b358
Compare
1085100
to
23bab18
Compare
@jreback @TomAugspurger want to take another look? See my comment on your review, I tried your suggestion of doing There was some additional code I added as well to allow as many tests to pass as possible without altering them. I excluded datetime64 from this new logic, because there is a lot that breaks if you don't leave that as it is. Also added a try/catch for some edge cases that occur in some tests. |
@jreback @jorisvandenbossche @WillAyd @TomAugspurger hi guys, can anyone take a look/see if we can merge? Thanks for your help |
Focused on regressions from the 1.0 release right now. Should have a chance
to look next week.
…On Fri, Jan 31, 2020 at 1:06 PM trevorbye ***@***.***> wrote:
@jreback <https://github.com/jreback> @jorisvandenbossche
<https://github.com/jorisvandenbossche> @WillAyd
<https://github.com/WillAyd> @TomAugspurger
<https://github.com/TomAugspurger> hi guys, can anyone take a look/see if
we can merge? Thanks for your help
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30870?email_source=notifications&email_token=AAKAOIT4VHU6W6B4AK54D2TRARZB3A5CNFSM4KE7USY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKPVQMI#issuecomment-580868145>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIVLXTXFG4I7M3YPZWTRARZB3ANCNFSM4KE7USYQ>
.
|
@@ -19,6 +19,7 @@ def test_grouping_grouper(self, data_for_grouping): | |||
tm.assert_numpy_array_equal(gr1.grouper, df.A.values) | |||
tm.assert_extension_array_equal(gr2.grouper, data_for_grouping) | |||
|
|||
@pytest.mark.skip(reason="logic change to stop coercing dtypes on set_index()") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you are skipping these because the existing tests are incompatible with the new logic right?
If that's correct then we should actually update the tests and not just skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will refactor them. From what I could see looking through tests, it seemed to me like a lot of them are skipped rather than refactored.
Would best practice be to try and always refactor? Thanks for your feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is rarer than others as it intentionally is breaking existing behavior; in such case yes would want to update existing tests and not just skip them
Also needs a little more communication in whatsnew around backwards compatibility but can leave that until later; I think would be helpful to see tests updated for reviewers to have a better understanding of scope of changes this would cause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that make sense. I'll work on getting these tests updated so it is clearer what the scope of the change is. In general, these few I skipped were asserting two df's to be equal, which is true except of course for the index of each df were no longer equal, thus they failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you need to actually change the results
@@ -148,6 +148,8 @@ Indexing | |||
^^^^^^^^ | |||
- Bug in slicing on a :class:`DatetimeIndex` with a partial-timestamp dropping high-resolution indices near the end of a year, quarter, or month (:issue:`31064`) | |||
- Bug in :meth:`PeriodIndex.get_loc` treating higher-resolution strings differently from :meth:`PeriodIndex.get_value` (:issue:`31172`) | |||
- Bug in :meth:`DataFrame.set_index` not preserving column dtype when creating :class:`Index` from a single column (:issue:`30517`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to a sub-section; show a typical example of before & after (see how this is done in the Enhancements section)
@@ -4320,6 +4320,16 @@ def set_index( | |||
# everything else gets tried as a key; see GH 24969 | |||
try: | |||
found = col in self.columns | |||
if found: | |||
# get current dtype to preserve through index creation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this all mean? why would we not always preserve?
@@ -5458,6 +5458,7 @@ def ensure_index_from_sequences(sequences, names=None): | |||
---------- | |||
sequences : sequence of sequences | |||
names : sequence of str | |||
dtype : NumPy dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a NumPy dtype? type dtype: Dtype
@@ -19,6 +19,7 @@ def test_grouping_grouper(self, data_for_grouping): | |||
tm.assert_numpy_array_equal(gr1.grouper, df.A.values) | |||
tm.assert_extension_array_equal(gr2.grouper, data_for_grouping) | |||
|
|||
@pytest.mark.skip(reason="logic change to stop coercing dtypes on set_index()") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you need to actually change the results
@trevorbye is this still active? Can you address comments and fix merge conflicts? |
Closing as I think stale, but ping if you'd like to continue and can reopen |
changes to
set_index()
andensure_index_from_sequences()
to pass current dtype of key through call stack onto Index constructor, rather than letting Index constructor infer the dtype. For now I just aimed to patch a single key param fromset_index()
, as that is the lowest hanging fruit for this issue I think.removed
maybe_convert_objects()
call from_maybe_casted_values()
. This will preserve anobject
dtype on aset_index().reset_index()
round-trip. I came to the exact same conclusion as the user in this PR comment, for context. This makes sense conceptually, if your index was an object dtype, it should be an object dtype as a column. Of course, this is somewhat subjective so let me know your opinions, but this is what I interpreted from the discussion on the git issue.both of these changes break quite a few existing tests, most of which are testing specifically for the exact opposite conditions that this PR is intended to patch e.g. tests that look for object dtypes being coerced to more-specific types. What's the process for handling failed tests for a change like this? Would you attempt to refactor all, or delete the ones where the logic is no longer something to test for? I also ran
pytest pandas
on a fresh pull of master and had 13 fails, with 108 failures on my branch specifically, so I am also curious what to do about those previously-failing tests, or what is acceptable for a merge.black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff