Skip to content

TST: [ArrowStringArray] more parameterised testing - part 2 #40749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
2 changes: 2 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1146,6 +1146,8 @@ def nullable_string_dtype(request):
* 'string'
* 'arrow_string'
"""
from pandas.core.arrays.string_arrow import ArrowStringDtype # noqa: F401
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this to force the registration of the arrow_string dtype. running some tests individually sometimes fails. will be removed in #39908

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to handle the import error here and xfail

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate or I can just remove.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would raise an import error right on some builds? if so simply xfail if this errors and the tests would also xfail

that way u can use this fixture w/o worrying too much

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are skipped with the fixture. It's passing tests that may fail if the arrow_string dtype is not registered, it's not part of the public api yet. and won't be since we are going to use the parameterized dtype in #39908 instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see also #39908 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would raise an import error right on some builds?

no. importing the ArrowStringDtype is safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c, ok then


return request.param


Expand Down
15 changes: 14 additions & 1 deletion pandas/tests/frame/methods/test_astype.py
Original file line number Diff line number Diff line change
Expand Up @@ -566,7 +566,6 @@ def test_astype_empty_dtype_dict(self):
@pytest.mark.parametrize(
"df",
[
DataFrame(Series(["x", "y", "z"], dtype="string")),
DataFrame(Series(["x", "y", "z"], dtype="category")),
DataFrame(Series(3 * [Timestamp("2020-01-01", tz="UTC")])),
DataFrame(Series(3 * [Interval(0, 1)])),
Expand All @@ -584,6 +583,20 @@ def test_astype_ignores_errors_for_extension_dtypes(self, df, errors):
with pytest.raises((ValueError, TypeError), match=msg):
df.astype(float, errors=errors)

@pytest.mark.parametrize("errors", ["raise", "ignore"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may want to consider using a shared function as the impl to avoid repeating things. though maybe not worth it (e.g. just copy in the cases where this is needed if there are not that many)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a fixture for errors? if not can u create one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, could you parametrize the test above, but have the dtype be a separate param, so it can be skipped for arrow_string?

Something like

    @pytest.mark.parametrize(
        "data, dtype",
        [
            (["x", "y", "z"], "string"),
            pytest.param((["x", "y", "z"], "string_arrow"), mark=..skip...),
            (["x", "y", "z"], "category"),
            (3 * [Timestamp("2020-01-01", tz="UTC")], None),
            (3 * [Interval(0, 1)], None),
        ],
    )

and then inside the test create the dataframe (df = DataFrame(Series(data, dtype=dtype))).

That could also avoid having to duplicate the full test body.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(although for the other tests where pd.array is used in the parametrization, this might get a bit more annoying to do it in this way ..)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a fixture for errors? if not can u create one

we have no fixture yet in conftest.py (common arguments section) for errors

we have 4 tests parameterized on errors...

test_astype_ignores_errors_for_extension_dtypes in pandas/tests/frame/methods/test_astype.py and pandas/tests/series/methods/test_astype.py parameterised on ["raise", "ignore"]

test_custom_dateparsing_error in pandas/tests/io/test_sql.py parameterised on ["ignore", "raise", "coerce"]

and

test_coerce_outside_ns_bounds in pandas/tests/tslibs/test_array_to_datetime.py parameterised on ["coerce", "raise"]

will be pushing a commit shortly with @jorisvandenbossche suggested changes to avoid duplication so doing anything with errors parameterisation is now outside the scope of this PR.

I don't think worth opening a separate PR for that yet, especially considering the differences in the parameterisation

Copy link
Member Author

@simonjayhawkins simonjayhawkins Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(although for the other tests where pd.array is used in the parametrization, this might get a bit more annoying to do it in this way ..)

the other two tests have simpler code so duplication is probably better than complex setup

test_astype_ignores_errors_for_extension_dtypes is already duplicated in frame and series methods modules so makes more sense to avoid adding more duplication there. have updated with suggestion.

def test_astype_ignores_errors_for_nullable_string_dtypes(
self, nullable_string_dtype, errors
):
df = DataFrame(Series(["x", "y", "z"], dtype=nullable_string_dtype))
if errors == "ignore":
expected = df
result = df.astype(float, errors=errors)
tm.assert_frame_equal(result, expected)
else:
msg = "(Cannot cast)|(could not convert)"
with pytest.raises((ValueError, TypeError), match=msg):
df.astype(float, errors=errors)

def test_astype_tz_conversion(self):
# GH 35973
val = {"tz": date_range("2020-08-30", freq="d", periods=2, tz="Europe/London")}
Expand Down
7 changes: 6 additions & 1 deletion pandas/tests/frame/methods/test_select_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -391,7 +391,6 @@ def test_select_dtypes_typecodes(self):
(
(np.array([1, 2], dtype=np.int32), True),
(pd.array([1, 2], dtype="Int32"), True),
(pd.array(["a", "b"], dtype="string"), False),
(DummyArray([1, 2], dtype=DummyDtype(numeric=True)), True),
(DummyArray([1, 2], dtype=DummyDtype(numeric=False)), False),
),
Expand All @@ -402,3 +401,9 @@ def test_select_dtypes_numeric(self, arr, expected):
df = DataFrame(arr)
is_selected = df.select_dtypes(np.number).shape == df.shape
assert is_selected == expected

def test_select_dtypes_numeric_nullable_string(self, nullable_string_dtype):
arr = pd.array(["a", "b"], dtype=nullable_string_dtype)
df = DataFrame(arr)
is_selected = df.select_dtypes(np.number).shape == df.shape
assert not is_selected
10 changes: 9 additions & 1 deletion pandas/tests/indexing/test_check_indexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,6 @@ def test_int_raise_missing_values(indexer):
np.array([1.0, 2.0], dtype="float64"),
np.array([True, False], dtype=object),
pd.Index([True, False], dtype=object),
pd.array(["a", "b"], dtype="string"),
],
)
def test_raise_invalid_array_dtypes(indexer):
Expand All @@ -89,6 +88,15 @@ def test_raise_invalid_array_dtypes(indexer):
check_array_indexer(arr, indexer)


def test_raise_nullable_string_dtype(nullable_string_dtype):
indexer = pd.array(["a", "b"], dtype=nullable_string_dtype)
arr = np.array([1, 2, 3])

msg = "arrays used as indices must be of integer or boolean type"
with pytest.raises(IndexError, match=msg):
check_array_indexer(arr, indexer)


@pytest.mark.parametrize("indexer", [None, Ellipsis, slice(0, 3), (None,)])
def test_pass_through_non_array_likes(indexer):
arr = np.array([1, 2, 3])
Expand Down