Skip to content

Fix select_dtypes(include='int') for Windows. #36808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Feb 7, 2021
Merged
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -433,6 +433,7 @@ Numeric
- Bug in :class:`DataFrame` allowing arithmetic operations with list of array-likes with undefined results. Behavior changed to raising ``ValueError`` (:issue:`36702`)
- Bug in :meth:`DataFrame.std`` with ``timedelta64`` dtype and ``skipna=False`` (:issue:`37392`)
- Bug in :meth:`DataFrame.min` and :meth:`DataFrame.max` with ``datetime64`` dtype and ``skipna=False`` (:issue:`36907`)
- Bug in :func:`select_dtypes` different behavior between Windows and Linux with ``include="int"`` (:issue:`36569`)

Conversion
^^^^^^^^^^
Expand Down
5 changes: 5 additions & 0 deletions pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1796,6 +1796,11 @@ def pandas_dtype(dtype) -> DtypeObj:
# try a numpy dtype
# raise a consistent TypeError if failed
try:
# int is mapped to different types (int32, in64) on Windows and Linux
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, unfortunately, so is numpy array construction,

existing behaviour

>>> arr = np.array([1, 2, 3])  # or arr = np.array([1, 2, 3], dtype=int)
>>> arr.dtype
dtype('int32')
>>>
>>> ser = pd.Series(arr)
>>> ser
0    1
1    2
2    3
dtype: int32
>>>
>>> df = pd.DataFrame(ser)
>>>
>>> df.select_dtypes(include="int")
   0
0  1
1  2
2  3
>>>

this workflow would break on Windows with this change?

Copy link
Contributor Author

@OlehKSS OlehKSS Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it wouldn't work on windows with these changes:

>>>> df.select_dtypes(include="int")

Empty DataFrame
Columns: []
Index: [0, 1, 2]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, pandas will interpret int as np.int64 and numpy as np.int32 on Windows. Their behavior on Linux will be identical. Though, based on the example in #36596, I see that pandas was mapping integers to np.int64 on Windows by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to change this routine at all which has far reaching effects. not averse to a tactical change in .select_dtypes itself.

Copy link
Contributor Author

@OlehKSS OlehKSS Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I can roll back the changes here and add a warning that int is ambiguous and np.int64 or np.int32 should be used in df.select_dtypes instead. Another option will be to see why pandas does not follow numpy's approach and, in some cases, treats int as int64 on all platforms (numpy maps int to np.int64 on Linux, and on Windows it will be np.int32).

# see https://github.com/numpy/numpy/issues/9464
if (isinstance(dtype, str) and dtype == "int") or (dtype is int):
dtype = "int64"

npdtype = np.dtype(dtype)
except SyntaxError as err:
# np.dtype uses `eval` which can raise SyntaxError
Expand Down
24 changes: 13 additions & 11 deletions pandas/tests/dtypes/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,20 +43,22 @@ def test_invalid_dtype_error(self, box):
com.pandas_dtype(box)

@pytest.mark.parametrize(
"dtype",
"dtype_input,dtype_output",
[
object,
"float64",
np.object_,
np.dtype("object"),
"O",
np.float64,
float,
np.dtype("float64"),
(object, object),
("float64", np.float64),
(np.object_, np.object_),
(np.dtype("object"), np.object_),
("O", object),
(np.float64, np.float64),
(float, float),
(np.dtype("float64"), np.float64),
("int", np.int64),
(int, np.int64),
],
)
def test_pandas_dtype_valid(self, dtype):
assert com.pandas_dtype(dtype) == dtype
def test_pandas_dtype_valid(self, dtype_input, dtype_output):
assert com.pandas_dtype(dtype_input) == dtype_output

@pytest.mark.parametrize(
"dtype", ["M8[ns]", "m8[ns]", "object", "float64", "int64"]
Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/frame/methods/test_select_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,12 @@ def test_select_dtypes_exclude_include_using_list_like(self):
e = df[["b", "c", "e"]]
tm.assert_frame_equal(r, e)

exclude = (np.datetime64,)
include = np.bool_, "int"
r = df.select_dtypes(include=include, exclude=exclude)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add the case where inlclude='integer'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

e = df[["b", "e"]]
tm.assert_frame_equal(r, e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls separate this out into a dedicated test with a descriptive name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok


exclude = ("datetime",)
include = "bool", "int64", "int32"
r = df.select_dtypes(include=include, exclude=exclude)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this result and expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Expand Down