BUG: set_index() and reset_index() not preserving object dtypes #30870

trevorbye · 2020-01-09T22:51:42Z

changes to set_index() and ensure_index_from_sequences() to pass current dtype of key through call stack onto Index constructor, rather than letting Index constructor infer the dtype. For now I just aimed to patch a single key param from set_index(), as that is the lowest hanging fruit for this issue I think.
removed maybe_convert_objects() call from _maybe_casted_values(). This will preserve an object dtype on a set_index().reset_index() round-trip. I came to the exact same conclusion as the user in this PR comment, for context. This makes sense conceptually, if your index was an object dtype, it should be an object dtype as a column. Of course, this is somewhat subjective so let me know your opinions, but this is what I interpreted from the discussion on the git issue.
both of these changes break quite a few existing tests, most of which are testing specifically for the exact opposite conditions that this PR is intended to patch e.g. tests that look for object dtypes being coerced to more-specific types. What's the process for handling failed tests for a change like this? Would you attempt to refactor all, or delete the ones where the logic is no longer something to test for? I also ran pytest pandas on a fresh pull of master and had 13 fails, with 108 failures on my branch specifically, so I am also curious what to do about those previously-failing tests, or what is acceptable for a merge.

closes DataFrame.set_index() may not preserve dtype #30517
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2020-01-09T22:51:49Z

Hello @trevorbye! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-31 18:25:09 UTC

trevorbye · 2020-01-09T22:53:06Z

@jreback for visibility

jreback

try to make an even smaller change this likely will break lots of things

pandas/tests/frame/test_set_index_preserve_dtype.py

trevorbye · 2020-01-10T01:06:46Z

try to make an even smaller change this likely will break lots of things

Any suggestions for making these changes smaller? I tried to touch as little as possible, but not sure how I could slim it down further.

Or another question would be, what do you see breaking a lot?

Thanks

trevorbye · 2020-01-13T20:12:06Z

@jreback @TomAugspurger any suggestions how to make this change smaller per your comment?

What are your thoughts on implementing this as non-default behavior? For example, adding a new optional param to both set_index() and reset_index() something like a boolean for preserve_dtype, that would enable the changes I've made here rather than make them the default. This would stop all these current tests from failing, and not break user's code. The tradeoff is the user would have to specify this param to get the new behavior, so I suppose that is a design question.

From the conversation in the issue I had the impression it should be the new default behavior, but I could have misinterpreted. Either way let me know what you think the path forward is and I will work on it. Thanks

TomAugspurger · 2020-01-13T20:23:38Z

FYI @trevorbye you've commited some pickle files that are being left around by a some test (we haven't discovered which one yet).

I not sure what Jeff had in mind for making this smaller.

I think round-tripping the dtypes should be the default behavior. It not doing so today is a bug.

In theory, we could handle this in the Index constructor and just not have it infer the dtype when given a typed object. But that would probably be an API breaking change.

trevorbye · 2020-01-14T04:19:43Z

@TomAugspurger thanks, I'll get rid of those files.

What is your best practice for refactoring/removing tests for a breaking change like this? There are quite a few failing, would you change the logic to reflect the new type inference (or lack thereof, in this case)?

For Index constructor, if you call it yourself you can pass the dtype param which will disable the type coercion. But, when the constructor is called from inside set_index(), it doesn't pass dtype param. The change I made there simply grabs the type of the col you're trying to create an index from, and passes that on to the constructor and uses the logic already in place.

Still learning the code base, but I'm wondering if there's a way to do this with less overall impact than the way I did it.

TomAugspurger · 2020-01-14T12:08:34Z

I'm not sure. But if we're calling the dtype not being preserved a bug, then we'll just need to update the tests that rely on the constructors re-inferring dtypes.

…

On Mon, Jan 13, 2020 at 10:19 PM trevorbye ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> thanks, I'll get rid of those files. What is your best practice for refactoring/removing tests for a breaking change like this? There are quite a few failing, would you change the logic to reflect the new type inference (or lack thereof, in this case)? For Index constructor, if you call it yourself you can pass the dtype param which will disable the type coercion. But, when the constructor is called from *inside* set_index(), it doesn't pass dtype param. The change I made there simply grabs the type of the col you're trying to create an index from, and passes that on to the constructor and uses the logic already in place. Still learning the code base, but I'm wondering if there's a way to do this with less overall impact than the way I did it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30870?email_source=notifications&email_token=AAKAOISLMZCB6XOKWHH6HWDQ5U4OBA5CNFSM4KE7USY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI3G6YY#issuecomment-573992803>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIQYIG3QXRLHUHHMLY3Q5U4OBANCNFSM4KE7USYQ> .

jreback · 2020-01-14T12:42:38Z

@trevorbye yeah I agree with @TomAugspurger what we need to do here is actually fix this. This will by-definition break some tests which will need to be updated, some because we are specifically testing that functionaility, and some by happen-stance (we happened to construct the expected value with an inferred type).

So as I said above, my preference here is to leave the Index constructor alone, and simply pass in the dtype of what we are setting (now this will have to be slightly adjusted, so create a function to do this), e.g. a int* column -> np.dtype64 (as we don't support anything but Int64Index).

trevorbye · 2020-01-14T22:25:32Z

@trevorbye yeah I agree with @TomAugspurger what we need to do here is actually fix this. This will by-definition break some tests which will need to be updated, some because we are specifically testing that functionaility, and some by happen-stance (we happened to construct the expected value with an inferred type).

So as I said above, my preference here is to leave the Index constructor alone, and simply pass in the dtype of what we are setting (now this will have to be slightly adjusted, so create a function to do this), e.g. a int* column -> np.dtype64 (as we don't support anything but Int64Index).

@jreback thanks, I can refactor all those tests once we figure out the logic.

I think the change I made here is what you described; No change to Index constructor, I just passed it the current dtype. For reference, here is the main change I made, line 4266 in frame.py in my commit:

if found:
    # get current dtype to preserve through index creation
    current_dtype = self.dtypes.get(col).type

This is then passed to ensure_index_from_sequences(arrays, names, current_dtype) (I created the new optional param here), which is then forwarded on to the Index constructor. I've done some debugging, and from what I'm seeing, the Index (and subclasses like Int64Index etc.) constructors will automatically handle unsupported index types if you attempt to create them. Here's an example:

In [1]: df = pd.DataFrame({'A': pd.Series([1, 2, 3], dtype=np.int32), 'B': [1, 2, 3]})

In [2]: print(df.dtypes.get('A').type)
Out [2]: <class 'numpy.int32'>

In [3]: print(df.set_index('A').index)
Out [3]: Int64Index([1, 2, 3], dtype='int64', name='A')

I ran the above code with my changes, and even though it's grabbing np.int32 from the series type and using that as a param to the Index constructor dtype, it automatically detects that it's unsupported as an index type and coerces to int64.

So I don't think we need a new function to pre-convert these types before passing to Index constructors, as the logic is already there. Is this what you meant by a new function? Apologies if I'm going in circles, but to me it seems like the implementation we're discussing, and the implementation I made, are the same core logic, minus any edge cases or additional considerations I may be missing.

Thanks for your help!

trevorbye · 2020-01-22T01:01:54Z

@jreback any feedback on my above comment? Want to see what you think before I go refactor a ton of tests

jreback · 2020-01-22T01:49:34Z

will look soon; you have a lot of pickle files committed; remove those

pandas/core/frame.py

trevorbye · 2020-01-30T22:41:25Z

@jreback @TomAugspurger want to take another look? See my comment on your review, I tried your suggestion of doing arrays.append(Index(col, dtype=col.dtype), but setting the dtype gets overridden by the outer constructor in that case.

There was some additional code I added as well to allow as many tests to pass as possible without altering them. I excluded datetime64 from this new logic, because there is a lot that breaks if you don't leave that as it is. Also added a try/catch for some edge cases that occur in some tests.

…ype-patch

trevorbye · 2020-01-31T19:06:03Z

@jreback @jorisvandenbossche @WillAyd @TomAugspurger hi guys, can anyone take a look/see if we can merge? Thanks for your help

TomAugspurger · 2020-01-31T19:34:42Z

Focused on regressions from the 1.0 release right now. Should have a chance to look next week.

…

On Fri, Jan 31, 2020 at 1:06 PM trevorbye ***@***.***> wrote: @jreback <https://github.com/jreback> @jorisvandenbossche <https://github.com/jorisvandenbossche> @WillAyd <https://github.com/WillAyd> @TomAugspurger <https://github.com/TomAugspurger> hi guys, can anyone take a look/see if we can merge? Thanks for your help — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30870?email_source=notifications&email_token=AAKAOIT4VHU6W6B4AK54D2TRARZB3A5CNFSM4KE7USY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKPVQMI#issuecomment-580868145>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIVLXTXFG4I7M3YPZWTRARZB3ANCNFSM4KE7USYQ> .

WillAyd · 2020-01-31T20:50:58Z

pandas/tests/extension/base/groupby.py

@@ -19,6 +19,7 @@ def test_grouping_grouper(self, data_for_grouping):
        tm.assert_numpy_array_equal(gr1.grouper, df.A.values)
        tm.assert_extension_array_equal(gr2.grouper, data_for_grouping)

+    @pytest.mark.skip(reason="logic change to stop coercing dtypes on set_index()")


So you are skipping these because the existing tests are incompatible with the new logic right?

If that's correct then we should actually update the tests and not just skip

Yes, I will refactor them. From what I could see looking through tests, it seemed to me like a lot of them are skipped rather than refactored.

Would best practice be to try and always refactor? Thanks for your feedback

This PR is rarer than others as it intentionally is breaking existing behavior; in such case yes would want to update existing tests and not just skip them

Also needs a little more communication in whatsnew around backwards compatibility but can leave that until later; I think would be helpful to see tests updated for reviewers to have a better understanding of scope of changes this would cause

Okay, that make sense. I'll work on getting these tests updated so it is clearer what the scope of the change is. In general, these few I skipped were asserting two df's to be equal, which is true except of course for the index of each df were no longer equal, thus they failed.

yes, you need to actually change the results

jreback · 2020-02-02T23:05:36Z

doc/source/whatsnew/v1.1.0.rst

@@ -148,6 +148,8 @@ Indexing
 ^^^^^^^^
 - Bug in slicing on a :class:`DatetimeIndex` with a partial-timestamp dropping high-resolution indices near the end of a year, quarter, or month (:issue:`31064`)
 - Bug in :meth:`PeriodIndex.get_loc` treating higher-resolution strings differently from :meth:`PeriodIndex.get_value` (:issue:`31172`)
+- Bug in :meth:`DataFrame.set_index` not preserving column dtype when creating :class:`Index` from a single column (:issue:`30517`)


move this to a sub-section; show a typical example of before & after (see how this is done in the Enhancements section)

jreback · 2020-02-02T23:06:41Z

pandas/core/frame.py

@@ -4320,6 +4320,16 @@ def set_index(
                # everything else gets tried as a key; see GH 24969
                try:
                    found = col in self.columns
+                    if found:
+                        # get current dtype to preserve through index creation,


what does this all mean? why would we not always preserve?

jreback · 2020-02-02T23:07:28Z

pandas/core/indexes/base.py

@@ -5458,6 +5458,7 @@ def ensure_index_from_sequences(sequences, names=None):
    ----------
    sequences : sequence of sequences
    names : sequence of str
+    dtype : NumPy dtype


why is this a NumPy dtype? type dtype: Dtype

jreback · 2020-02-02T23:08:06Z

pandas/tests/extension/base/groupby.py

@@ -19,6 +19,7 @@ def test_grouping_grouper(self, data_for_grouping):
        tm.assert_numpy_array_equal(gr1.grouper, df.A.values)
        tm.assert_extension_array_equal(gr2.grouper, data_for_grouping)

+    @pytest.mark.skip(reason="logic change to stop coercing dtypes on set_index()")


yes, you need to actually change the results

WillAyd · 2020-03-14T21:50:51Z

@trevorbye is this still active? Can you address comments and fix merge conflicts?

WillAyd · 2020-04-07T16:22:42Z

Closing as I think stale, but ping if you'd like to continue and can reopen

trevorbye mentioned this pull request Jan 9, 2020

DataFrame.set_index() may not preserve dtype #30517

Open

jreback requested changes Jan 9, 2020

View reviewed changes

pandas/tests/frame/test_set_index_preserve_dtype.py Outdated Show resolved Hide resolved

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Jan 14, 2020

trevorbye requested a review from WillAyd January 24, 2020 01:33

jreback requested changes Jan 24, 2020

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

pandas/core/frame.py Show resolved Hide resolved

trevorbye force-pushed the dtype-patch branch from c4ff16c to d45b358 Compare January 29, 2020 06:42

Trevor Bye added 13 commits January 30, 2020 13:50

initial test case

d37df39

test change

c572d3a

changes to set_index to preserve dtype from field

cf2100f

fixing test error

5229cea

removed object type inf from _maybe_casted_values

c6b02a4

missing char

d52bbe8

moved tests

9c8439d

removing test pkl files

7eadae8

hard reset

0fcd059

pep fix

a33e307

black

5c65100

isort

e3e27a3

restoring actions file

23bab18

trevorbye force-pushed the dtype-patch branch from 1085100 to 23bab18 Compare January 30, 2020 22:07

Trevor Bye added 2 commits January 31, 2020 09:39

whatsnew entry

57f7e6d

Merge branch 'master' of https://github.com/pandas-dev/pandas into dt…

85155d8

…ype-patch

WillAyd requested changes Jan 31, 2020

View reviewed changes

jreback requested changes Feb 2, 2020

View reviewed changes

WillAyd closed this Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: set_index() and reset_index() not preserving object dtypes #30870

BUG: set_index() and reset_index() not preserving object dtypes #30870

trevorbye commented Jan 9, 2020 •

edited

Loading

pep8speaks commented Jan 9, 2020 •

edited

Loading

trevorbye commented Jan 9, 2020

jreback left a comment

trevorbye commented Jan 10, 2020

trevorbye commented Jan 13, 2020

TomAugspurger commented Jan 13, 2020

trevorbye commented Jan 14, 2020

TomAugspurger commented Jan 14, 2020 via email

jreback commented Jan 14, 2020

trevorbye commented Jan 14, 2020

trevorbye commented Jan 22, 2020

jreback commented Jan 22, 2020

trevorbye commented Jan 30, 2020

trevorbye commented Jan 31, 2020

TomAugspurger commented Jan 31, 2020 via email

WillAyd Jan 31, 2020

trevorbye Jan 31, 2020

WillAyd Jan 31, 2020

trevorbye Feb 1, 2020

jreback Feb 2, 2020

jreback Feb 2, 2020

jreback Feb 2, 2020

jreback Feb 2, 2020

jreback Feb 2, 2020

WillAyd commented Mar 14, 2020

WillAyd commented Apr 7, 2020

BUG: set_index() and reset_index() not preserving object dtypes #30870

BUG: set_index() and reset_index() not preserving object dtypes #30870

Conversation

trevorbye commented Jan 9, 2020 • edited Loading

pep8speaks commented Jan 9, 2020 • edited Loading

Comment last updated at 2020-01-31 18:25:09 UTC

trevorbye commented Jan 9, 2020

jreback left a comment

Choose a reason for hiding this comment

trevorbye commented Jan 10, 2020

trevorbye commented Jan 13, 2020

TomAugspurger commented Jan 13, 2020

trevorbye commented Jan 14, 2020

TomAugspurger commented Jan 14, 2020 via email

jreback commented Jan 14, 2020

trevorbye commented Jan 14, 2020

trevorbye commented Jan 22, 2020

jreback commented Jan 22, 2020

trevorbye commented Jan 30, 2020

trevorbye commented Jan 31, 2020

TomAugspurger commented Jan 31, 2020 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Mar 14, 2020

WillAyd commented Apr 7, 2020

trevorbye commented Jan 9, 2020 •

edited

Loading

pep8speaks commented Jan 9, 2020 •

edited

Loading