-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix select_dtypes(include='int') for Windows. #36808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/dtypes/common.py
Outdated
@@ -1796,6 +1796,11 @@ def pandas_dtype(dtype) -> DtypeObj: | |||
# try a numpy dtype | |||
# raise a consistent TypeError if failed | |||
try: | |||
# int is mapped to different types (int32, in64) on Windows and Linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, unfortunately, so is numpy array construction,
existing behaviour
>>> arr = np.array([1, 2, 3]) # or arr = np.array([1, 2, 3], dtype=int)
>>> arr.dtype
dtype('int32')
>>>
>>> ser = pd.Series(arr)
>>> ser
0 1
1 2
2 3
dtype: int32
>>>
>>> df = pd.DataFrame(ser)
>>>
>>> df.select_dtypes(include="int")
0
0 1
1 2
2 3
>>>
this workflow would break on Windows with this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it wouldn't work on windows with these changes:
>>>> df.select_dtypes(include="int")
Empty DataFrame
Columns: []
Index: [0, 1, 2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, pandas will interpret int as np.int64 and numpy as np.int32 on Windows. Their behavior on Linux will be identical. Though, based on the example in #36596, I see that pandas was mapping integers to np.int64 on Windows by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to change this routine at all which has far reaching effects. not averse to a tactical change in .select_dtypes itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I can roll back the changes here and add a warning that int
is ambiguous and np.int64
or np.int32
should be used in df.select_dtypes
instead. Another option will be to see why pandas does not follow numpy's approach and, in some cases, treats int
as int64
on all platforms (numpy
maps int
to np.int64
on Linux, and on Windows it will be np.int32
).
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment. @OlehKSS can you merge master & resolve conflicts?
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -331,6 +331,8 @@ Numeric | |||
- Bug in :class:`Series` where two :class:`Series` each have a :class:`DatetimeIndex` with different timezones having those indexes incorrectly changed when performing arithmetic operations (:issue:`33671`) | |||
- Bug in :meth:`pd._testing.assert_almost_equal` was incorrect for complex numeric types (:issue:`28235`) | |||
- Bug in :meth:`DataFrame.__rmatmul__` error handling reporting transposed shapes (:issue:`21581`) | |||
- Bug in :func:`select_dtypes` different behaviour on windows and linux for ``select_dtypes(include="int")`` (:issue:`36569`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Bug in :func:`select_dtypes` different behavior between Windows and Linux with ``include="int"`` (:issue:`36569`)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this pull request will create problems with the existing pandas behavior, as you can see in the comment above. Do you have any suggestions on how should I proceed with this pull request?
pandas/core/dtypes/common.py
Outdated
@@ -1796,6 +1796,11 @@ def pandas_dtype(dtype) -> DtypeObj: | |||
# try a numpy dtype | |||
# raise a consistent TypeError if failed | |||
try: | |||
# int is mapped to different types (int32, in64) on Windows and Linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to change this routine at all which has far reaching effects. not averse to a tactical change in .select_dtypes itself.
@jreback
Another possible routes:
|
include = np.bool_, "int" | ||
r = df.select_dtypes(include=include, exclude=exclude) | ||
e = df[["b", "e"]] | ||
tm.assert_frame_equal(r, e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls separate this out into a dedicated test with a descriptive name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OlehKSS if you merge master and address @jbrockmendel comment we'll re-review
@arw2019 I merged this branch to the latest master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok, can you add a whatsnew note in other enhancements for 1.3 and merge master
|
||
exclude = ("datetime",) | ||
include = "bool", int | ||
r = df.select_dtypes(include=include, exclude=exclude) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this result and expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
) | ||
exclude = (np.datetime64,) | ||
include = np.bool_, "int" | ||
r = df.select_dtypes(include=include, exclude=exclude) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add the case where inlclude='integer'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
@jreback I have updated this pull request according to your latest review. Could you take a look at it again? |
"f": pd.date_range("now", periods=3).values, | ||
} | ||
) | ||
exclude = (np.datetime64,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize on these types (e.g. you can add include & excude as paramters), the this is much simpler to grok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that will be nicer for sure. I updated this pull request with the requested changes.
thanks @OlehKSS very nice! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff