Skip to content

TST: check compatibility with pyarrow types_mapper parameter #44369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Dec 1, 2021
38 changes: 38 additions & 0 deletions pandas/tests/arrays/masked/test_arrow_compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,22 @@ def data(request):
return request.param


@pytest.fixture(
params=[
pd.Int8Dtype,
pd.Int16Dtype,
pd.Int32Dtype,
pd.Int64Dtype,
pd.UInt8Dtype,
pd.UInt16Dtype,
pd.UInt32Dtype,
pd.UInt64Dtype,
]
)
def int_dtype(request):
return request.param()


def test_arrow_array(data):
arr = pa.array(data)
expected = pa.array(
Expand All @@ -39,6 +55,28 @@ def test_arrow_roundtrip(data):
tm.assert_frame_equal(result, df)


@td.skip_if_no("pyarrow")
def test_dataframe_from_arrow_type_mapper(int_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use this fixture: any_int_ea_dtype and remove the int_dtype from above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are strings, but types_mapper expects actual dtype objects. Maybe there's a way to convert, though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.api.types.pandas_dtype is the function I needed.

pyarrow = pytest.importorskip("pyarrow", minversion="3.0.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the version IN the skip_if_no

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a reference to this issue in a comment (e.g. the gh number)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the module-level limit of 1.0.1 seems to be OK. Just tested with that version.


def types_mapper(arrow_type):
if pyarrow.types.is_boolean(arrow_type):
return pd.BooleanDtype()
elif pyarrow.types.is_integer(arrow_type):
return int_dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually needed to parametrize this over all int EA dtypes? Could also just use pd.Int64Dtype() like you use BooleanDtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to make sure types_mapper works for any integer dtype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I could have a separate column in the arrow record batch for each integer data type, but that seemed too much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to make sure types_mapper works for any integer dtype?

This is already implicitly tested in the test_arrow_roundtrip above (since that is parametrized for every dtype, and will under the hood make use of the same dtype.__from_arrow__)

But I am also fine with keeping the test as you have it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll simplify, but add one column just so we test a code path that requires a copy because of differing data sizes.


bools_array = pyarrow.array([True, None, False], type=pyarrow.bool_())
ints_array = pyarrow.array([1, None, 2], type=pyarrow.int64())
record_batch = pyarrow.RecordBatch.from_arrays(
[bools_array, ints_array], ["bools", "ints"]
)
result = record_batch.to_pandas(types_mapper=types_mapper)
bools = pd.Series([True, None, False], dtype="boolean")
ints = pd.Series([1, None, 2], dtype=int_dtype.name)
expected = pd.DataFrame({"bools": bools, "ints": ints})
tm.assert_frame_equal(result, expected)


@td.skip_if_no("pyarrow")
def test_arrow_load_from_zero_chunks(data):
# GH-41040
Expand Down