BUG: Slice Arrow buffer before passing it to numpy (#40896) #41046

ThomasBlauthQC · 2021-04-19T18:14:00Z

Add Arrow buffer slicing before handing it over to numpy which is
needed in case the Arrow buffer contains padding or offset.

closes BUG: the __from_arrow__ conversion for numeric arrays broken if buffer size doesn't match #40896
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Points I'm not sure about:

I created a new test file. Would it be better to place the test somewhere else?
The test only tests int32. Should I add more dtypes?

I'm still new to the internals of pandas and always trying to improve so feedback of any kind is highly appreciated!

Add Arrow buffer slicing before handing it over to numpy which is needed in case the Arrow buffer contains padding or offset.

Use numpy dtype for bitwidth information. Fix test signature.

Fix mask buffer in test.

jorisvandenbossche · 2021-04-20T12:17:12Z

Thanks for the PR!

I created a new test file. Would it be better to place the test somewhere else?

There is a pandas/tests/arrays/masked/test_arrow_compat.py. I think you can put it there.

The test only tests int32. Should I add more types?

Yes, I think it would be good to parameterize the test across different dtypes.

pandas/core/arrays/_arrow_utils.py

Move tests to pandas/tests/arrays/masked/test_arrow_compat.py Add more dtypes to test.

ThomasBlauthQC · 2021-04-20T16:01:49Z

I moved the test and added parameterization for all numeric types.

jreback · 2021-04-21T12:46:07Z

pandas/tests/arrays/masked/test_arrow_compat.py

+
+@pytest.fixture
+def np_dtype_to_arrays(request):
+    import pyarrow as pa


can we just add an import_optional_dependency at the top of this file, rather than having skips on every test.

I used pyarrow 0.16.0 as the minimal version so test_arrow_array no longer gets run with pyarrow 0.15.0. Hope this is appropriate.

jreback · 2021-04-21T12:47:53Z

pandas/tests/arrays/masked/test_arrow_compat.py

+
+@td.skip_if_no("pyarrow")
+@pytest.mark.parametrize(
+    "np_dtype_to_arrays",


can you use the fixture: any_real_dtype or any_numpy_dtype instead

Modify test_arrow_compat.py: - Use import_optional_dependency to skip the whole module if pyarrow is not available. - Use any_real_dtype fixture.

pandas/tests/arrays/masked/test_arrow_compat.py

jorisvandenbossche · 2021-04-22T06:28:28Z

pandas/tests/arrays/masked/test_arrow_compat.py

+try:
+    pa = import_optional_dependency("pyarrow", min_version="0.16.0")
+except ImportError:
+    pytestmark = pytest.mark.skip(reason="Pyarrow not available")


To avoid the imports in every test (as you did now), you could also do a

try: import pyarrow as pa except ImportError: pa = None

(assuming we keep the skip_if_no)

jorisvandenbossche · 2021-04-22T06:34:34Z

pandas/tests/arrays/masked/test_arrow_compat.py

+    np_dtype = np.dtype(any_real_dtype)
+    pa_type = pa.from_numpy_dtype(np_dtype)
+
+    pa_array = pa.array([0, 1, 2], type=pa_type)


Suggested change

pa_array = pa.array([0, 1, 2], type=pa_type)

pa_array = pa.array([0, 1, 2, None], type=pa_type)

Maybe good to include a missing value, to ensure there is a bitmask buffer?

Yes, good point!

Add whatsnew entry.

Enforce creation of Arrow bitmask buffer.

Merge branch 'master' into issue-40896

Fix whatsnew.rst

jreback · 2021-04-22T19:25:59Z

doc/source/whatsnew/v1.3.0.rst

@@ -882,6 +882,7 @@ ExtensionArray
 - Fixed bug where :meth:`Series.idxmax`, :meth:`Series.idxmin` and ``argmax/min`` fail when the underlying data is :class:`ExtensionArray` (:issue:`32749`, :issue:`33719`, :issue:`36566`)
 - Fixed a bug where some properties of subclasses of :class:`PandasExtensionDtype` where improperly cached (:issue:`40329`)
 - Bug in :meth:`DataFrame.mask` where masking a :class:`Dataframe` with an :class:`ExtensionArray` dtype raises ``ValueError`` (:issue:`40941`)
+- Bug in :func:`pyarrow_array_to_numpy_and_mask` where Numpy raises ``ValueError`` if buffer size does not match multiple of dtype size (:issue:`40896`)


can you change this to a user facing note.

jreback · 2021-04-22T19:26:32Z

pandas/tests/arrays/masked/test_arrow_compat.py

 import pytest

 import pandas.util._test_decorators as td

 import pandas as pd
 import pandas._testing as tm

+try:


just use pyimportorskip here. i am not sure we can test anything w/o it.

jorisvandenbossche · 2021-04-23T09:07:18Z

doc/source/whatsnew/v1.3.0.rst

@@ -882,6 +882,7 @@ ExtensionArray
 - Fixed bug where :meth:`Series.idxmax`, :meth:`Series.idxmin` and ``argmax/min`` fail when the underlying data is :class:`ExtensionArray` (:issue:`32749`, :issue:`33719`, :issue:`36566`)
 - Fixed a bug where some properties of subclasses of :class:`PandasExtensionDtype` where improperly cached (:issue:`40329`)
 - Bug in :meth:`DataFrame.mask` where masking a :class:`Dataframe` with an :class:`ExtensionArray` dtype raises ``ValueError`` (:issue:`40941`)
+- Bug in :func:`pyarrow_array_to_numpy_and_mask` when converting a pyarrow array who's data buffer size is not a multiple of dtype size (:issue:`40896`)


Suggested change

- Bug in :func:`pyarrow_array_to_numpy_and_mask` when converting a pyarrow array who's data buffer size is not a multiple of dtype size (:issue:`40896`)

- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)

pyarrow_array_to_numpy_and_mask is not a public method, so we should try to avoid mentioning it. So a suggested rewording above.

Also, I would maybe move this to the I/O section

I changed that as suggested. Also, let me know if I should rebase to get rid of the small commits.

Also, let me know if I should rebase to get rid of the small commits.

No need to do that, we will squash everything when merging (well, github does that for us ;))

jorisvandenbossche

Thanks for the updates! Looking good to me

jreback · 2021-04-26T12:08:47Z

doc/source/whatsnew/v1.3.0.rst

@@ -796,6 +796,7 @@ I/O
 - Bug in :func:`read_excel` dropping empty values from single-column spreadsheets (:issue:`39808`)
 - Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
 - Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
+- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)


would not object to actually showing read_parquet reference here

I don't think that's needed (the original bug report didn't actually involve a plain reading of a Parquet file (not sure this can happen with that), but with a dataset scan+filter operation (from any source format).

Yes, I was creating the arrow table using only the arrow api. The bug then occured on the conversion to pandas.

I did not actually try to reproduce the bug using read_parquet, so I'm also not sure if it can happen this way.

I suppose it could happen when reading a parquet file using the new dataset API by passing a certain filter. But anyway, I don't think we need to make that explicit, IMO the above error message is clear enough.

jreback · 2021-04-26T12:09:36Z

pandas/tests/arrays/masked/test_arrow_compat.py

@@ -89,12 +86,89 @@ def test_arrow_sliced(data):
    tm.assert_frame_equal(result, expected)


+@pytest.fixture
+def np_dtype_to_arrays(any_real_dtype):


can you type this (mainly the output for better readability) also adding a doc-string explaining the purpose

This is a fixture in the tests, I don't think we generally ask for annotations there? (eg in conftest.py there is no typing)

pandas/tests/arrays/masked/test_arrow_compat.py

jreback · 2021-04-26T12:10:14Z

pandas/tests/arrays/masked/test_arrow_compat.py

+    Also tests empty pyarrow arrays with non empty buffers.
+    See https://github.com/pandas-dev/pandas/issues/40896
+    """
+    from pandas.core.arrays._arrow_utils import pyarrow_array_to_numpy_and_mask


can you import at the top (after the pyarrow check is fine)

Should I move the rest of the pandas imports below the pyarrow check as well? Looks a bit weird having them seperated by the check.

jreback · 2021-04-26T21:54:48Z

thanks @ThomasBlauthQC

i think the checks are running, but can merge on green.

mlondschien · 2021-04-28T12:49:57Z

Looks green @jreback. Merge?

jorisvandenbossche · 2021-04-28T13:07:44Z

@ThomasBlauthQC Thanks a lot for the fix!

…pandas-dev#41046)

BUG: Slice Arrow buffer before passing it to numpy (pandas-dev#40896)

f5cc8a8

Add Arrow buffer slicing before handing it over to numpy which is needed in case the Arrow buffer contains padding or offset.

ThomasBlauthQC mentioned this pull request Apr 19, 2021

BUG: the __from_arrow__ conversion for numeric arrays broken if buffer size doesn't match #40896

Closed

ThomasBlauthQC added 2 commits April 19, 2021 21:00

BUG: Slice Arrow buffer before passing it to numpy (pandas-dev#40896)

2a8042c

Use numpy dtype for bitwidth information. Fix test signature.

BUG: Slice Arrow buffer before passing it to numpy (pandas-dev#40896)

fd94972

Fix mask buffer in test.

jorisvandenbossche reviewed Apr 20, 2021

View reviewed changes

pandas/core/arrays/_arrow_utils.py Show resolved Hide resolved

BUG: Slice Arrow buffer before passing it to numpy (pandas-dev#40896)

1586d50

Move tests to pandas/tests/arrays/masked/test_arrow_compat.py Add more dtypes to test.

ThomasBlauthQC force-pushed the issue-40896 branch from 2ca741b to 1586d50 Compare April 21, 2021 07:14

jreback added the IO Parquet parquet, feather label Apr 21, 2021

jreback added this to the 1.3 milestone Apr 21, 2021

jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 21, 2021

jreback requested changes Apr 21, 2021

View reviewed changes

BUG: Slice Arrow buffer before passing it to numpy (pandas-dev#40896)

5652869

Modify test_arrow_compat.py: - Use import_optional_dependency to skip the whole module if pyarrow is not available. - Use any_real_dtype fixture.

ThomasBlauthQC force-pushed the issue-40896 branch from ec58ea4 to 5652869 Compare April 22, 2021 05:57

jorisvandenbossche reviewed Apr 22, 2021

View reviewed changes

ThomasBlauthQC added 3 commits April 22, 2021 06:58

BUG: Slice Arrow buffer before passing it to numpy (pandas-dev#40896)

bd70705

Add whatsnew entry.

BUG: Slice Arrow buffer before passing it to numpy (pandas-dev#40896)

ff85a80

Enforce creation of Arrow bitmask buffer.

BUG: Slice Arrow buffer before passing it to numpy (pandas-dev#40896)

df87e14

Merge branch 'master' into issue-40896

ThomasBlauthQC force-pushed the issue-40896 branch 3 times, most recently from 80fb1c7 to 0f3302c Compare April 22, 2021 14:46

BUG: Slice Arrow buffer before passing it to numpy (pandas-dev#40896)

c195089

Fix whatsnew.rst

ThomasBlauthQC force-pushed the issue-40896 branch from 0f3302c to c195089 Compare April 22, 2021 15:24

jreback requested changes Apr 22, 2021

View reviewed changes

ThomasBlauthQC added 2 commits April 23, 2021 08:33

Fix whatsnew

f6555df

Use pytest.importorskip

03648eb

jorisvandenbossche reviewed Apr 23, 2021

View reviewed changes

jorisvandenbossche approved these changes Apr 23, 2021

View reviewed changes

Fix whatsnew

9aa5df6

ThomasBlauthQC force-pushed the issue-40896 branch from d18922e to 9aa5df6 Compare April 26, 2021 06:51

jreback requested changes Apr 26, 2021

View reviewed changes

Move import to top

b86f9fc

jreback approved these changes Apr 26, 2021

View reviewed changes

jorisvandenbossche merged commit 17c0b44 into pandas-dev:master Apr 28, 2021

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

BUG: Slice Arrow buffer before passing it to numpy (pandas-dev#40896) (…

1dd3e6e

…pandas-dev#41046)

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

BUG: Slice Arrow buffer before passing it to numpy (pandas-dev#40896) (…

800ade1

…pandas-dev#41046)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Slice Arrow buffer before passing it to numpy (#40896) #41046

BUG: Slice Arrow buffer before passing it to numpy (#40896) #41046

ThomasBlauthQC commented Apr 19, 2021 •

edited

Loading

jorisvandenbossche commented Apr 20, 2021

ThomasBlauthQC commented Apr 20, 2021

jreback Apr 21, 2021

ThomasBlauthQC Apr 21, 2021

jreback Apr 21, 2021

jorisvandenbossche Apr 22, 2021 •

edited

Loading

jorisvandenbossche Apr 22, 2021

ThomasBlauthQC Apr 22, 2021

jreback Apr 22, 2021

jreback Apr 22, 2021

jorisvandenbossche Apr 23, 2021

ThomasBlauthQC Apr 23, 2021

jorisvandenbossche Apr 23, 2021

jorisvandenbossche left a comment

jreback Apr 26, 2021

jorisvandenbossche Apr 26, 2021

ThomasBlauthQC Apr 26, 2021

jorisvandenbossche Apr 26, 2021

jreback Apr 26, 2021

jorisvandenbossche Apr 26, 2021

jreback Apr 26, 2021

ThomasBlauthQC Apr 26, 2021

jreback commented Apr 26, 2021

mlondschien commented Apr 28, 2021

jorisvandenbossche commented Apr 28, 2021

	pa_array = pa.array([0, 1, 2], type=pa_type)
	pa_array = pa.array([0, 1, 2, None], type=pa_type)

	- Bug in :func:`pyarrow_array_to_numpy_and_mask` when converting a pyarrow array who's data buffer size is not a multiple of dtype size (:issue:`40896`)
	- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)

BUG: Slice Arrow buffer before passing it to numpy (#40896) #41046

BUG: Slice Arrow buffer before passing it to numpy (#40896) #41046

Conversation

ThomasBlauthQC commented Apr 19, 2021 • edited Loading

jorisvandenbossche commented Apr 20, 2021

ThomasBlauthQC commented Apr 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 26, 2021

mlondschien commented Apr 28, 2021

jorisvandenbossche commented Apr 28, 2021

ThomasBlauthQC commented Apr 19, 2021 •

edited

Loading

jorisvandenbossche Apr 22, 2021 •

edited

Loading