Skip to content

BUG: the __from_arrow__ conversion for numeric arrays broken if buffer size doesn't match #40896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Apr 12, 2021 · 3 comments · Fixed by #41046
Assignees
Labels
Bug IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Report from https://issues.apache.org/jira/browse/ARROW-12336

The original reproducer is (but it would be good to see if we can find a simpler case for a test):

df = pd.DataFrame({"Int_col": [1, 2, 10], "str_col": ["A", "B", "Z"]})
df = df.astype({"Int_col": "Int64"})
table = pa.table(df)
path_1 = "./test_1.parquet"
pa.parquet.write_table(table, path_1)

schema = pa.parquet.read_schema(path_1)
ds = pa.dataset.FileSystemDataset.from_paths(
    paths=[path_1],
    filesystem=pa.fs.LocalFileSystem(),
    schema=schema, 
    format=pa.dataset.ParquetFileFormat(),
)
table = ds.to_table(filter=(pa.dataset.field("str_col") == "C"))

print("Size of array: " + str(table.column(0).nbytes))
df = table.to_pandas()

which gives

Traceback (most recent call last):
  File "/Users/xxx/empty_array_buffer_size.py", line 47, in <module>
    df = table.to_pandas()
  File "pyarrow/array.pxi", line 756, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1740, in pyarrow.lib.Table._to_pandas
  File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 794, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1135, in _table_to_blocks
    return [_reconstruct_block(item, columns, extension_columns)
  File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1135, in <listcomp>
    return [_reconstruct_block(item, columns, extension_columns)
  File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 753, in _reconstruct_block
    pd_ext_arr = pandas_dtype.__from_arrow__(arr)
  File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/integer.py", line 117, in __from_arrow__
    data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.type)
  File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/_arrow_utils.py", line 32, in pyarrow_array_to_numpy_and_mask
    data = np.frombuffer(buflist[1], dtype=dtype)[arr.offset : arr.offset + len(arr)]
ValueError: buffer size must be a multiple of element size
@jorisvandenbossche jorisvandenbossche added Bug IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Apr 12, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Apr 12, 2021
@jorisvandenbossche
Copy link
Member Author

The issue comes from those lines:

buflist = arr.buffers()
data = np.frombuffer(buflist[1], dtype=dtype)[arr.offset : arr.offset + len(arr)]

Where we need to slice the buffer from buflist[1] before passing it to np.frombuffer, instead of afterwards.

@ThomasBlauthQC
Copy link
Contributor

I'd be happy to look into this!

@mlondschien
Copy link
Contributor

Related: JDASoftwareGroup/kartothek#410

ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 19, 2021
Add Arrow buffer slicing before handing it over to numpy which is
needed in case the Arrow buffer contains padding or offset.
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 19, 2021
Add Arrow buffer slicing before handing it over to numpy which is
needed in case the Arrow buffer contains padding or offset.
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 19, 2021
Use numpy dtype for bitwidth information.
Fix test signature.
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 20, 2021
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 20, 2021
Move tests to pandas/tests/arrays/masked/test_arrow_compat.py
Add more dtypes to test.
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 20, 2021
Change argument passed to pyarrow_array_to_numpy_and_mask() from str to
np.dtype.
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 21, 2021
Modify test_arrow_compat.py:
- Use import_optional_dependency to skip the whole module if
  pyarrow is not available.
- Use any_real_dtype fixture.
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 22, 2021
No changes to the code, only to one docstring. Test if CI passes when
started at a different time of the day.
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 22, 2021
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 22, 2021
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 22, 2021
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 22, 2021
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 22, 2021
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 22, 2021
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 22, 2021
ThomasBlauthQC added a commit to ThomasBlauthQC/pandas that referenced this issue Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants