Skip to content

BUG: Slice Arrow buffer before passing it to numpy (#40896) #41046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 28, 2021

Conversation

ThomasBlauthQC
Copy link
Contributor

@ThomasBlauthQC ThomasBlauthQC commented Apr 19, 2021

Add Arrow buffer slicing before handing it over to numpy which is
needed in case the Arrow buffer contains padding or offset.

Points I'm not sure about:

  • I created a new test file. Would it be better to place the test somewhere else?
  • The test only tests int32. Should I add more dtypes?

I'm still new to the internals of pandas and always trying to improve so feedback of any kind is highly appreciated!

Add Arrow buffer slicing before handing it over to numpy which is
needed in case the Arrow buffer contains padding or offset.
@jorisvandenbossche
Copy link
Member

Thanks for the PR!

I created a new test file. Would it be better to place the test somewhere else?

There is a pandas/tests/arrays/masked/test_arrow_compat.py. I think you can put it there.

The test only tests int32. Should I add more types?

Yes, I think it would be good to parameterize the test across different dtypes.

Move tests to pandas/tests/arrays/masked/test_arrow_compat.py
Add more dtypes to test.
@ThomasBlauthQC
Copy link
Contributor Author

I moved the test and added parameterization for all numeric types.

@jreback jreback added the IO Parquet parquet, feather label Apr 21, 2021
@jreback jreback added this to the 1.3 milestone Apr 21, 2021
@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 21, 2021

@pytest.fixture
def np_dtype_to_arrays(request):
import pyarrow as pa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just add an import_optional_dependency at the top of this file, rather than having skips on every test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used pyarrow 0.16.0 as the minimal version so test_arrow_array no longer gets run with pyarrow 0.15.0. Hope this is appropriate.


@td.skip_if_no("pyarrow")
@pytest.mark.parametrize(
"np_dtype_to_arrays",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use the fixture: any_real_dtype or any_numpy_dtype instead

Modify test_arrow_compat.py:
- Use import_optional_dependency to skip the whole module if
  pyarrow is not available.
- Use any_real_dtype fixture.
try:
pa = import_optional_dependency("pyarrow", min_version="0.16.0")
except ImportError:
pytestmark = pytest.mark.skip(reason="Pyarrow not available")
Copy link
Member

@jorisvandenbossche jorisvandenbossche Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the imports in every test (as you did now), you could also do a

try:
    import pyarrow as pa
except ImportError:
    pa = None

(assuming we keep the skip_if_no)

np_dtype = np.dtype(any_real_dtype)
pa_type = pa.from_numpy_dtype(np_dtype)

pa_array = pa.array([0, 1, 2], type=pa_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pa_array = pa.array([0, 1, 2], type=pa_type)
pa_array = pa.array([0, 1, 2, None], type=pa_type)

Maybe good to include a missing value, to ensure there is a bitmask buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point!

@ThomasBlauthQC ThomasBlauthQC force-pushed the issue-40896 branch 3 times, most recently from 80fb1c7 to 0f3302c Compare April 22, 2021 14:46
@@ -882,6 +882,7 @@ ExtensionArray
- Fixed bug where :meth:`Series.idxmax`, :meth:`Series.idxmin` and ``argmax/min`` fail when the underlying data is :class:`ExtensionArray` (:issue:`32749`, :issue:`33719`, :issue:`36566`)
- Fixed a bug where some properties of subclasses of :class:`PandasExtensionDtype` where improperly cached (:issue:`40329`)
- Bug in :meth:`DataFrame.mask` where masking a :class:`Dataframe` with an :class:`ExtensionArray` dtype raises ``ValueError`` (:issue:`40941`)
- Bug in :func:`pyarrow_array_to_numpy_and_mask` where Numpy raises ``ValueError`` if buffer size does not match multiple of dtype size (:issue:`40896`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change this to a user facing note.

import pytest

import pandas.util._test_decorators as td

import pandas as pd
import pandas._testing as tm

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use pyimportorskip here. i am not sure we can test anything w/o it.

@@ -882,6 +882,7 @@ ExtensionArray
- Fixed bug where :meth:`Series.idxmax`, :meth:`Series.idxmin` and ``argmax/min`` fail when the underlying data is :class:`ExtensionArray` (:issue:`32749`, :issue:`33719`, :issue:`36566`)
- Fixed a bug where some properties of subclasses of :class:`PandasExtensionDtype` where improperly cached (:issue:`40329`)
- Bug in :meth:`DataFrame.mask` where masking a :class:`Dataframe` with an :class:`ExtensionArray` dtype raises ``ValueError`` (:issue:`40941`)
- Bug in :func:`pyarrow_array_to_numpy_and_mask` when converting a pyarrow array who's data buffer size is not a multiple of dtype size (:issue:`40896`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Bug in :func:`pyarrow_array_to_numpy_and_mask` when converting a pyarrow array who's data buffer size is not a multiple of dtype size (:issue:`40896`)
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)

pyarrow_array_to_numpy_and_mask is not a public method, so we should try to avoid mentioning it. So a suggested rewording above.

Also, I would maybe move this to the I/O section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed that as suggested. Also, let me know if I should rebase to get rid of the small commits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let me know if I should rebase to get rid of the small commits.

No need to do that, we will squash everything when merging (well, github does that for us ;))

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Looking good to me

@@ -796,6 +796,7 @@ I/O
- Bug in :func:`read_excel` dropping empty values from single-column spreadsheets (:issue:`39808`)
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would not object to actually showing read_parquet reference here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's needed (the original bug report didn't actually involve a plain reading of a Parquet file (not sure this can happen with that), but with a dataset scan+filter operation (from any source format).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was creating the arrow table using only the arrow api. The bug then occured on the conversion to pandas.

I did not actually try to reproduce the bug using read_parquet, so I'm also not sure if it can happen this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it could happen when reading a parquet file using the new dataset API by passing a certain filter. But anyway, I don't think we need to make that explicit, IMO the above error message is clear enough.

@@ -89,12 +86,89 @@ def test_arrow_sliced(data):
tm.assert_frame_equal(result, expected)


@pytest.fixture
def np_dtype_to_arrays(any_real_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type this (mainly the output for better readability) also adding a doc-string explaining the purpose

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fixture in the tests, I don't think we generally ask for annotations there? (eg in conftest.py there is no typing)

Also tests empty pyarrow arrays with non empty buffers.
See https://github.com/pandas-dev/pandas/issues/40896
"""
from pandas.core.arrays._arrow_utils import pyarrow_array_to_numpy_and_mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you import at the top (after the pyarrow check is fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I move the rest of the pandas imports below the pyarrow check as well? Looks a bit weird having them seperated by the check.

@jreback
Copy link
Contributor

jreback commented Apr 26, 2021

thanks @ThomasBlauthQC

i think the checks are running, but can merge on green.

@mlondschien
Copy link
Contributor

Looks green @jreback. Merge?

@jorisvandenbossche jorisvandenbossche merged commit 17c0b44 into pandas-dev:master Apr 28, 2021
@jorisvandenbossche
Copy link
Member

@ThomasBlauthQC Thanks a lot for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: the __from_arrow__ conversion for numeric arrays broken if buffer size doesn't match
4 participants