Skip to content

Fix select_dtypes(include='int') for Windows. #36808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Feb 7, 2021

Conversation

OlehKSS
Copy link
Contributor

@OlehKSS OlehKSS commented Oct 2, 2020

@@ -1796,6 +1796,11 @@ def pandas_dtype(dtype) -> DtypeObj:
# try a numpy dtype
# raise a consistent TypeError if failed
try:
# int is mapped to different types (int32, in64) on Windows and Linux
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, unfortunately, so is numpy array construction,

existing behaviour

>>> arr = np.array([1, 2, 3])  # or arr = np.array([1, 2, 3], dtype=int)
>>> arr.dtype
dtype('int32')
>>>
>>> ser = pd.Series(arr)
>>> ser
0    1
1    2
2    3
dtype: int32
>>>
>>> df = pd.DataFrame(ser)
>>>
>>> df.select_dtypes(include="int")
   0
0  1
1  2
2  3
>>>

this workflow would break on Windows with this change?

Copy link
Contributor Author

@OlehKSS OlehKSS Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it wouldn't work on windows with these changes:

>>>> df.select_dtypes(include="int")

Empty DataFrame
Columns: []
Index: [0, 1, 2]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, pandas will interpret int as np.int64 and numpy as np.int32 on Windows. Their behavior on Linux will be identical. Though, based on the example in #36596, I see that pandas was mapping integers to np.int64 on Windows by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to change this routine at all which has far reaching effects. not averse to a tactical change in .select_dtypes itself.

Copy link
Contributor Author

@OlehKSS OlehKSS Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I can roll back the changes here and add a warning that int is ambiguous and np.int64 or np.int32 should be used in df.select_dtypes instead. Another option will be to see why pandas does not follow numpy's approach and, in some cases, treats int as int64 on all platforms (numpy maps int to np.int64 on Linux, and on Windows it will be np.int32).

@simonjayhawkins simonjayhawkins added Dtype Conversions Unexpected or buggy dtype conversions Windows Windows OS labels Oct 2, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2020

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 3, 2020
Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment. @OlehKSS can you merge master & resolve conflicts?

cc @simonjayhawkins

@@ -331,6 +331,8 @@ Numeric
- Bug in :class:`Series` where two :class:`Series` each have a :class:`DatetimeIndex` with different timezones having those indexes incorrectly changed when performing arithmetic operations (:issue:`33671`)
- Bug in :meth:`pd._testing.assert_almost_equal` was incorrect for complex numeric types (:issue:`28235`)
- Bug in :meth:`DataFrame.__rmatmul__` error handling reporting transposed shapes (:issue:`21581`)
- Bug in :func:`select_dtypes` different behaviour on windows and linux for ``select_dtypes(include="int")`` (:issue:`36569`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Bug in :func:`select_dtypes` different behavior between Windows and Linux with ``include="int"`` (:issue:`36569`)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor Author

@OlehKSS OlehKSS Nov 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this pull request will create problems with the existing pandas behavior, as you can see in the comment above. Do you have any suggestions on how should I proceed with this pull request?

@@ -1796,6 +1796,11 @@ def pandas_dtype(dtype) -> DtypeObj:
# try a numpy dtype
# raise a consistent TypeError if failed
try:
# int is mapped to different types (int32, in64) on Windows and Linux
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to change this routine at all which has far reaching effects. not averse to a tactical change in .select_dtypes itself.

@OlehKSS
Copy link
Contributor Author

OlehKSS commented Nov 10, 2020

@jreback
What I see as possible changes to .select_dtypes:

  1. Treat int both as np.int32 and np.int64. If we have columns with both 64- and 32-bit integers we'll keep all of them
    I implemented this behavior in the current version, so now this example and this exaple should both work
  2. Map int to either np.int32 or np.int64 depending on the type of the given dataframe (np.int32, np.int64)
  3. Raise a warning / error that int is ambiguous, and either np.int32 or np.int64 should be provided

Another possible routes:

  1. Change maybe_convert_objects so it would follow numpy's behavior
  2. Map_int_ to np.int64 all the time, as was done previously in this pull request

include = np.bool_, "int"
r = df.select_dtypes(include=include, exclude=exclude)
e = df[["b", "e"]]
tm.assert_frame_equal(r, e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls separate this out into a dedicated test with a descriptive name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OlehKSS if you merge master and address @jbrockmendel comment we'll re-review

@OlehKSS
Copy link
Contributor Author

OlehKSS commented Nov 30, 2020

@arw2019 I merged this branch to the latest master.
@jbrockmendel I added a separate unit test.
@jreback As you suggested, I moved code from pandas_dtype into DataFrame.select_dtypes. In detail, this method will treat int both as np.int32 and np.int64. Having columns with both 64- and 32-bit integers it will return all of them.
I implemented this behavior in the current version, so now this example and that exaple should both work.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok, can you add a whatsnew note in other enhancements for 1.3 and merge master


exclude = ("datetime",)
include = "bool", int
r = df.select_dtypes(include=include, exclude=exclude)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this result and expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

)
exclude = (np.datetime64,)
include = np.bool_, "int"
r = df.select_dtypes(include=include, exclude=exclude)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add the case where inlclude='integer'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@OlehKSS
Copy link
Contributor Author

OlehKSS commented Jan 18, 2021

@jreback I have updated this pull request according to your latest review. Could you take a look at it again?

"f": pd.date_range("now", periods=3).values,
}
)
exclude = (np.datetime64,)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize on these types (e.g. you can add include & excude as paramters), the this is much simpler to grok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that will be nicer for sure. I updated this pull request with the requested changes.

@jreback jreback added this to the 1.3 milestone Feb 7, 2021
@jreback jreback merged commit 8d48776 into pandas-dev:master Feb 7, 2021
@jreback
Copy link
Contributor

jreback commented Feb 7, 2021

thanks @OlehKSS very nice!

CyberQin pushed a commit to CyberQin/pandas that referenced this pull request Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Windows Windows OS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: select_dtypes(include="int") has different behaviour on windows and linux
5 participants