Skip to content

BUG: is_dtype_equal(dtype, "string[pyarrow]") raises if pyarrow not installed #44327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Dec 15, 2021
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -509,7 +509,7 @@ Conversion

Strings
^^^^^^^
-
- Fixed bug in :bug:`is_dtype_equal` raising if pyarrow not installed (:issue:`44327`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't render & is not public facing anyhow. remove the note

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand the part where "is not public facing anyhow.". I will anyhow remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I remove this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can keep it, but reword it a bit to not mention is_dtype_equal explicitly. Something like "Fixed a bug in checking if a dtype is the string[pyarrow] dtype incorrectly raising an error when pyarrow is not installed (:issue:...)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

-

Interval
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -599,9 +599,12 @@ def is_dtype_equal(source, target) -> bool:
if not isinstance(source, str):
# GH#38516 ensure we get the same behavior from
# is_dtype_equal(CDT, "category") and CDT == "category"
# GH-44276 ensures behaviour is consistent
# is_dtype_equal (dtype, "string[pyarrow]")
try:
src = get_dtype(source)
if isinstance(src, ExtensionDtype):
if isinstance(src, (ExtensionDtype, np.dtype)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed because dtypes such as int64 float64 and other dtypes actually are np.dtype, so if I am not adding this check, the test is failing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed because dtypes such as int64 float64 and other dtypes actually are np.dtype, so if I am not adding this check, the test is failing.

It is not needed if we catch ImportError more generally, I think; I pushed some small commits to verify this.


return src == target
except (TypeError, AttributeError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can probably catch the ImportError here (so it always returns False if the above raises an ImportError)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return False
Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/dtypes/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,12 @@ def test_dtype_equal(name1, dtype1, name2, dtype2):
assert not com.is_dtype_equal(dtype1, dtype2)


@pytest.mark.parametrize("name,dtype", list(dtypes.items()), ids=lambda x: str(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the above dtypes dict, can you add an entry for "string": pd.StringDtype() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a mistake because isinstance(pd.StringDtype(), str) is False. Adding such an entry in taking to the wrong location in the code segment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(pd.StringDtype(), str)

Where is this being checked?

My question is to ensure that is_dtype_equal(pd.StringDtype(), "string[pyarrow]") correctly works, since that was the original bug report.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(target, str):
    if not isinstance(source, str):
        # GH#38516 ensure we get the same behavior from
        #  is_dtype_equal(CDT, "category") and CDT == "category"
        # GH-44276 ensures behaviour is consistent
        # is_dtype_equal (dtype, "string[pyarrow]")
        try:
            src = get_dtype(source)
            if isinstance(src, ExtensionDtype):
            if isinstance(src, (ExtensionDtype, np.dtype)):

In this code if not isinstance(source, str): returns True when source is pd.StringDtype()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line returns True for all cases in the current test. Also the other dtype values in the parametrization of this test are actual dtype objects, not strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but on adding this line into the parameters, this if not isinstance(source, str): returns True which is wrong. Shouldn't isinstance(source, str) be True if source is pd.StringDtype()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, isinstance(source, str) checks whether source is an actual python string object, while pd.StringDtype is a dtype object, which is a different class, and thus this is expected to return False (the fact that StringDtype is a dtype object for an array that stores strings doesn't matter in that case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the testcase and fixed the Import Bug as well

def test_pyarrow_string_import_error(name, dtype):
# GH-44276
assert not com.is_dtype_equal(dtype, "string[pyarrow]")


@pytest.mark.parametrize(
"dtype1,dtype2",
[
Expand Down