Fix select_dtypes(include='int') for Windows. #36808

OlehKSS · 2020-10-02T15:51:32Z

closes BUG: select_dtypes(include="int") has different behaviour on windows and linux #36596
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

simonjayhawkins · 2020-10-02T17:22:33Z

pandas/core/dtypes/common.py

@@ -1796,6 +1796,11 @@ def pandas_dtype(dtype) -> DtypeObj:
    # try a numpy dtype
    # raise a consistent TypeError if failed
    try:
+        # int is mapped to different types (int32, in64) on Windows and Linux


hmm, unfortunately, so is numpy array construction,

existing behaviour

>>> arr = np.array([1, 2, 3]) # or arr = np.array([1, 2, 3], dtype=int) >>> arr.dtype dtype('int32') >>> >>> ser = pd.Series(arr) >>> ser 0 1 1 2 2 3 dtype: int32 >>> >>> df = pd.DataFrame(ser) >>> >>> df.select_dtypes(include="int") 0 0 1 1 2 2 3 >>>

this workflow would break on Windows with this change?

Yes, it wouldn't work on windows with these changes:

>>>> df.select_dtypes(include="int") Empty DataFrame Columns: [] Index: [0, 1, 2]

In this case, pandas will interpret int as np.int64 and numpy as np.int32 on Windows. Their behavior on Linux will be identical. Though, based on the example in #36596, I see that pandas was mapping integers to np.int64 on Windows by default.

we don't want to change this routine at all which has far reaching effects. not averse to a tactical change in .select_dtypes itself.

In that case, I can roll back the changes here and add a warning that int is ambiguous and np.int64 or np.int32 should be used in df.select_dtypes instead. Another option will be to see why pandas does not follow numpy's approach and, in some cases, treats int as int64 on all platforms (numpy maps int to np.int64 on Linux, and on Windows it will be np.int32).

github-actions · 2020-11-03T00:10:40Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

arw2019

Small comment. @OlehKSS can you merge master & resolve conflicts?

cc @simonjayhawkins

arw2019 · 2020-11-06T20:03:07Z

doc/source/whatsnew/v1.2.0.rst

@@ -331,6 +331,8 @@ Numeric
 - Bug in :class:`Series` where two :class:`Series` each have a :class:`DatetimeIndex` with different timezones having those indexes incorrectly changed when performing arithmetic operations (:issue:`33671`)
 - Bug in :meth:`pd._testing.assert_almost_equal` was incorrect for complex numeric types (:issue:`28235`)
 - Bug in :meth:`DataFrame.__rmatmul__` error handling reporting transposed shapes (:issue:`21581`)
+- Bug in :func:`select_dtypes` different behaviour on windows and linux for ``select_dtypes(include="int")`` (:issue:`36569`)


- Bug in :func:`select_dtypes` different behavior between Windows and Linux with ``include="int"`` (:issue:`36569`)

The changes in this pull request will create problems with the existing pandas behavior, as you can see in the comment above. Do you have any suggestions on how should I proceed with this pull request?

jreback · 2020-11-09T03:06:52Z

pandas/core/dtypes/common.py

@@ -1796,6 +1796,11 @@ def pandas_dtype(dtype) -> DtypeObj:
    # try a numpy dtype
    # raise a consistent TypeError if failed
    try:
+        # int is mapped to different types (int32, in64) on Windows and Linux


we don't want to change this routine at all which has far reaching effects. not averse to a tactical change in .select_dtypes itself.

…issue-36596

OlehKSS · 2020-11-10T22:15:25Z

@jreback
What I see as possible changes to .select_dtypes:

Treat int both as np.int32 and np.int64. If we have columns with both 64- and 32-bit integers we'll keep all of them
I implemented this behavior in the current version, so now this example and this exaple should both work
Map int to either np.int32 or np.int64 depending on the type of the given dataframe (np.int32, np.int64)
Raise a warning / error that int is ambiguous, and either np.int32 or np.int64 should be provided

Another possible routes:

Change maybe_convert_objects so it would follow numpy's behavior
Map_int_ to np.int64 all the time, as was done previously in this pull request

jbrockmendel · 2020-11-12T04:13:20Z

pandas/tests/frame/methods/test_select_dtypes.py

+        include = np.bool_, "int"
+        r = df.select_dtypes(include=include, exclude=exclude)
+        e = df[["b", "e"]]
+        tm.assert_frame_equal(r, e)


pls separate this out into a dedicated test with a descriptive name

arw2019

@OlehKSS if you merge master and address @jbrockmendel comment we'll re-review

OlehKSS · 2020-11-30T17:32:05Z

@arw2019 I merged this branch to the latest master.
@jbrockmendel I added a separate unit test.
@jreback As you suggested, I moved code from pandas_dtype into DataFrame.select_dtypes. In detail, this method will treat int both as np.int32 and np.int64. Having columns with both 64- and 32-bit integers it will return all of them.
I implemented this behavior in the current version, so now this example and that exaple should both work.

jreback

looks ok, can you add a whatsnew note in other enhancements for 1.3 and merge master

jreback · 2020-12-29T17:27:13Z

pandas/tests/frame/methods/test_select_dtypes.py

+
+        exclude = ("datetime",)
+        include = "bool", int
+        r = df.select_dtypes(include=include, exclude=exclude)


call this result and expected

pandas/tests/frame/methods/test_select_dtypes.py

jreback · 2020-12-29T17:28:04Z

pandas/tests/frame/methods/test_select_dtypes.py

+        )
+        exclude = (np.datetime64,)
+        include = np.bool_, "int"
+        r = df.select_dtypes(include=include, exclude=exclude)


can you also add the case where inlclude='integer'

OlehKSS · 2021-01-18T11:27:58Z

@jreback I have updated this pull request according to your latest review. Could you take a look at it again?

jreback · 2021-01-19T16:55:37Z

pandas/tests/frame/methods/test_select_dtypes.py

+                "f": pd.date_range("now", periods=3).values,
+            }
+        )
+        exclude = (np.datetime64,)


can you parameterize on these types (e.g. you can add include & excude as paramters), the this is much simpler to grok.

Yes, that will be nicer for sure. I updated this pull request with the requested changes.

jreback · 2021-02-07T17:06:47Z

thanks @OlehKSS very nice!

OlehKSS added 2 commits October 2, 2020 17:54

Fix select_dtypes(include='int') for Windows.

05406dc

Whatsnew entry was added.

4b75067

OlehKSS force-pushed the issue-36596 branch from 8cd2b8a to 4b75067 Compare October 2, 2020 15:55

OlehKSS mentioned this pull request Oct 2, 2020

BUG: select_dtypes(include="int") has different behaviour on windows and linux #36596

Closed

2 tasks

simonjayhawkins reviewed Oct 2, 2020

View reviewed changes

simonjayhawkins added Dtype Conversions Unexpected or buggy dtype conversions Windows Windows OS labels Oct 2, 2020

Merge branch 'master' into issue-36596

fb8cae9

github-actions bot added the Stale label Nov 3, 2020

arw2019 reviewed Nov 6, 2020

View reviewed changes

Oleh Kozynets added 2 commits November 8, 2020 20:57

Merge Master

49826cb

Fix code style

bd6e724

jreback requested changes Nov 9, 2020

View reviewed changes

Oleh Kozynets added 2 commits November 10, 2020 23:04

Merge branch 'issue-36596' of https://github.com/OlehKSS/pandas into …

288efe9

…issue-36596

Fix int inference in select_dtypes.

39ef89b

Merge branch 'master' into issue-36596

f94722b

jbrockmendel reviewed Nov 12, 2020

View reviewed changes

Add a separate unit test.

810bbab

OlehKSS force-pushed the issue-36596 branch from 43ba17f to 810bbab Compare November 12, 2020 20:42

arw2019 added Needs Review and removed Stale labels Nov 18, 2020

arw2019 reviewed Nov 30, 2020

View reviewed changes

Oleh Kozynets added 2 commits November 30, 2020 16:59

Merge branch 'master' into issue-36596

9769b9d

Fix dtype in the unit test.

f45f271

Merge branch 'master' into issue-36596

e8b523a

jreback requested changes Dec 29, 2020

View reviewed changes

Merge branch 'master' into issue-36596

ac82794

Fix whatsnew, add new unit test.

a913bd8

jreback reviewed Jan 19, 2021

View reviewed changes

OlehKSS added 3 commits February 5, 2021 17:11

Parametrize tests.

502093b

Merge branch 'master' into issue-36596

9f26e18

Trigger CI.

06b7077

jreback approved these changes Feb 7, 2021

View reviewed changes

jreback added this to the 1.3 milestone Feb 7, 2021

jreback merged commit 8d48776 into pandas-dev:master Feb 7, 2021

CyberQin pushed a commit to CyberQin/pandas that referenced this pull request Feb 8, 2021

Fix select_dtypes(include='int') for Windows. (pandas-dev#36808)

362f039

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix select_dtypes(include='int') for Windows. #36808

Fix select_dtypes(include='int') for Windows. #36808

OlehKSS commented Oct 2, 2020

simonjayhawkins Oct 2, 2020

OlehKSS Oct 2, 2020 •

edited

Loading

OlehKSS Oct 2, 2020

jreback Nov 9, 2020

OlehKSS Nov 9, 2020 •

edited

Loading

github-actions bot commented Nov 3, 2020

arw2019 left a comment

arw2019 Nov 6, 2020

OlehKSS Nov 8, 2020

OlehKSS Nov 8, 2020 •

edited

Loading

jreback Nov 9, 2020

OlehKSS commented Nov 10, 2020

jbrockmendel Nov 12, 2020

OlehKSS Nov 12, 2020

arw2019 left a comment

OlehKSS commented Nov 30, 2020

jreback left a comment

jreback Dec 29, 2020

OlehKSS Jan 18, 2021

jreback Dec 29, 2020

OlehKSS Jan 18, 2021

OlehKSS commented Jan 18, 2021

jreback Jan 19, 2021

OlehKSS Feb 5, 2021

jreback commented Feb 7, 2021

Fix select_dtypes(include='int') for Windows. #36808

Fix select_dtypes(include='int') for Windows. #36808

Conversation

OlehKSS commented Oct 2, 2020

Choose a reason for hiding this comment

OlehKSS Oct 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OlehKSS Nov 9, 2020 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Nov 3, 2020

arw2019 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OlehKSS Nov 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OlehKSS commented Nov 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arw2019 left a comment

Choose a reason for hiding this comment

OlehKSS commented Nov 30, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OlehKSS commented Jan 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 7, 2021

OlehKSS Oct 2, 2020 •

edited

Loading

OlehKSS Nov 9, 2020 •

edited

Loading

OlehKSS Nov 8, 2020 •

edited

Loading