Skip to content

BUG: is_dtype_equal(dtype, "string[pyarrow]") raises if pyarrow not installed #44327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Dec 15, 2021

Conversation

shubham11941140
Copy link
Contributor

@shubham11941140 shubham11941140 commented Nov 5, 2021

Added instance to check string preventing import error

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew note

@@ -150,6 +151,8 @@ def test_access_by_position(index):
assert index[-1] == index[size - 1]

msg = f"index {size} is out of bounds for axis 0 with size {size}"

assert not is_dtype_equal(index.dtype, "string[pyarrow]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to the tests for is_dtype_equal,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -601,7 +601,8 @@ def is_dtype_equal(source, target) -> bool:
# is_dtype_equal(CDT, "category") and CDT == "category"
try:
src = get_dtype(source)
if isinstance(src, ExtensionDtype):
if isinstance(src, ExtensionDtype) or isinstance(src, np.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the example to the doc-string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single isinstance call isinstance(src, (ExtensionDtype, np.dtype))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback
Copy link
Contributor

jreback commented Nov 6, 2021

do we actually have a test which exercises this? e.g. not on a build where pyarrow is installed?

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Nov 6, 2021
@shubham11941140
Copy link
Contributor Author

do we actually have a test which exercises this? e.g. not on a build where pyarrow is installed?

I do not have pyarrow running on my system and got the initial import error. Actually I am able to run this on a system where pyarrow is not installed. It does not matter as pyarrow is not imported, so installation will not matter.

@shubham11941140
Copy link
Contributor Author

Whatsnew note added

@@ -115,6 +115,10 @@ def test_construction_from_string(self, dtype):
with pytest.raises(TypeError, match=msg):
CategoricalDtype.construct_from_string("foo")

def test_pyarrow_string_import_error(self, dtype):
# GH-44276
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you flesh this out a tiny bit e.g. 'dont raise if pyarrow is not installed'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def test_pyarrow_string_import_error(self, dtype):
# GH-44276
assert not is_dtype_equal(dtype, "string[pyarrow]")
with pytest.raises(ImportError, match="pyarrow"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm we don't do this, rather use the decorator to skip the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -24,7 +24,7 @@ Fixed regressions

Bug fixes
~~~~~~~~~
-
- Fixed bug in :bug:`is_dtype_equal` raising if pyarrow not installed (:issue:`44327`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this goes in 1.4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback
Copy link
Contributor

jreback commented Nov 14, 2021

pls merge master as well

@@ -115,6 +115,12 @@ def test_construction_from_string(self, dtype):
with pytest.raises(TypeError, match=msg):
CategoricalDtype.construct_from_string("foo")

def test_pyarrow_string_import_error(self, dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note: this test is added to TestCategoricalDtype, so categorical specific tests. I think we should move it to a more generic location. For example in tests/dtypes/test_common.py there is a test_dtype_equal that tests the basic functionality of is_dtype_equal. Maybe add the new test there (or in the arrow string dtype specific tests, @jbrockmendel any preference?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request you to inform me where do I exactly put the test, in which file and which function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put it in tests/dtypes/test_common.py, just after the existing test_dtype_equal test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def test_pyarrow_string_import_error(self, dtype):
# GH-44276
assert not is_dtype_equal(dtype, "string[pyarrow]")
pytest.importorskip("pyarrow")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the decorator on the function itself to skip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -115,6 +115,11 @@ def test_construction_from_string(self, dtype):
with pytest.raises(TypeError, match=msg):
CategoricalDtype.construct_from_string("foo")

@pytest.importorskip("pyarrow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test should be skipped if pyarrow is not installed? Because I understood the issue that checking the dtype equality currently raises if pyarrow is not installed, while it should return False instead.

So both if pyarrow is installed or not, this test should pass and is_dtype_equal should return False.

I think you can simply remove the decorator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -115,6 +115,12 @@ def test_construction_from_string(self, dtype):
with pytest.raises(TypeError, match=msg):
CategoricalDtype.construct_from_string("foo")

def test_pyarrow_string_import_error(self, dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put it in tests/dtypes/test_common.py, just after the existing test_dtype_equal test.

@jorisvandenbossche jorisvandenbossche changed the title BUG: is_dtype_equal(dtype, "string[pyarrow]") raises if pyarrow not installedimport error resolved BUG: is_dtype_equal(dtype, "string[pyarrow]") raises if pyarrow not installed Nov 26, 2021
@@ -128,6 +128,12 @@ def test_dtype_equal(name1, dtype1, name2, dtype2):
assert not com.is_dtype_equal(dtype1, dtype2)


@pytest.mark.parametrize("name,dtype", list(dtypes.items()), ids=lambda x: str(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the above dtypes dict, can you add an entry for "string": pd.StringDtype() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a mistake because isinstance(pd.StringDtype(), str) is False. Adding such an entry in taking to the wrong location in the code segment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(pd.StringDtype(), str)

Where is this being checked?

My question is to ensure that is_dtype_equal(pd.StringDtype(), "string[pyarrow]") correctly works, since that was the original bug report.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(target, str):
    if not isinstance(source, str):
        # GH#38516 ensure we get the same behavior from
        #  is_dtype_equal(CDT, "category") and CDT == "category"
        # GH-44276 ensures behaviour is consistent
        # is_dtype_equal (dtype, "string[pyarrow]")
        try:
            src = get_dtype(source)
            if isinstance(src, ExtensionDtype):
            if isinstance(src, (ExtensionDtype, np.dtype)):

In this code if not isinstance(source, str): returns True when source is pd.StringDtype()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line returns True for all cases in the current test. Also the other dtype values in the parametrization of this test are actual dtype objects, not strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but on adding this line into the parameters, this if not isinstance(source, str): returns True which is wrong. Shouldn't isinstance(source, str) be True if source is pd.StringDtype()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, isinstance(source, str) checks whether source is an actual python string object, while pd.StringDtype is a dtype object, which is a different class, and thus this is expected to return False (the fact that StringDtype is a dtype object for an array that stores strings doesn't matter in that case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the testcase and fixed the Import Bug as well

@shubham11941140
Copy link
Contributor Author

@jorisvandenbossche @jreback any updates for me on this?

@@ -509,7 +509,7 @@ Conversion

Strings
^^^^^^^
-
- Fixed bug in :bug:`is_dtype_equal` raising if pyarrow not installed (:issue:`44327`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't render & is not public facing anyhow. remove the note

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand the part where "is not public facing anyhow.". I will anyhow remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I remove this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can keep it, but reword it a bit to not mention is_dtype_equal explicitly. Something like "Fixed a bug in checking if a dtype is the string[pyarrow] dtype incorrectly raising an error when pyarrow is not installed (:issue:...)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -596,12 +596,17 @@ def is_dtype_equal(source, target) -> bool:
False
"""
if isinstance(target, str):
if not isinstance(source, str):
if get_dtype(source) == "string":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is directly False, there is no need for further checking for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason i ask is this might be a performance issue here (this code is hit a lot so we care about the code path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the code to remove this code segment, now there should not be a problem.

return src == target
if isinstance(src, (ExtensionDtype, np.dtype)):

return type(src) == type(target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't change this to only checking the class of the dtype, we want to check the equality of the instances. See the failing tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted Back to the Original Code, saw the mistake.

return src == target
if isinstance(src, (ExtensionDtype, np.dtype)):

return type(src) == type(target)
except (TypeError, AttributeError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can probably catch the ImportError here (so it always returns False if the above raises an ImportError)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

try:
src = get_dtype(source)
if isinstance(src, ExtensionDtype):
if isinstance(src, (ExtensionDtype, np.dtype)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed because dtypes such as int64 float64 and other dtypes actually are np.dtype, so if I am not adding this check, the test is failing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed because dtypes such as int64 float64 and other dtypes actually are np.dtype, so if I am not adding this check, the test is failing.

It is not needed if we catch ImportError more generally, I think; I pushed some small commits to verify this.

@shubham11941140
Copy link
Contributor Author

@jorisvandenbossche @jreback any updates for me on this?

@jreback
Copy link
Contributor

jreback commented Dec 14, 2021

@shubham11941140 you need to merge master

@shubham11941140
Copy link
Contributor Author

Done.

@jorisvandenbossche jorisvandenbossche merged commit 85c221a into pandas-dev:master Dec 15, 2021
@jorisvandenbossche
Copy link
Member

Thanks @shubham11941140

@jorisvandenbossche jorisvandenbossche added this to the 1.4 milestone Dec 15, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: is_dtype_equal(dtype, "string[pyarrow]") raises if pyarrow not installed
4 participants