Skip to content

BUG: ValueError: buffer source array is read-only during groupby #33410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
erik-hasse opened this issue Apr 8, 2020 · 6 comments · Fixed by #33446
Closed
3 tasks done

BUG: ValueError: buffer source array is read-only during groupby #33410

erik-hasse opened this issue Apr 8, 2020 · 6 comments · Fixed by #33446
Milestone

Comments

@erik-hasse
Copy link
Contributor

erik-hasse commented Apr 8, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
df = pd.DataFrame(data={'x': [1], 'y': [2]})
df.to_parquet('pq_df', partition_cols='x')
df = pd.read_parquet('pq_df')
df.groupby('x', sort=False)

Problem description

The above code raises an exception:

Traceback (most recent call last):
  File "mve.py", line 5, in <module>
    df.groupby('x', sort=False)
  File "/Users/ehasse/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 5798, in groupby
    return groupby_generic.DataFrameGroupBy(
  File "/Users/ehasse/.local/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 402, in __init__
    grouper, exclusions, obj = get_grouper(
  File "/Users/ehasse/.local/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 615, in get_grouper
    Grouping(
  File "/Users/ehasse/.local/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 312, in __init__
    self.grouper, self.all_grouper = recode_for_groupby(
  File "/Users/ehasse/.local/lib/python3.8/site-packages/pandas/core/groupby/categorical.py", line 72, in recode_for_groupby
    cat = cat.add_categories(c.categories[~c.categories.isin(cat.categories)])
  File "/Users/ehasse/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4667, in isin
    return algos.isin(self, values)
  File "/Users/ehasse/.local/lib/python3.8/site-packages/pandas/core/algorithms.py", line 447, in isin
    return f(comps, values)
  File "pandas/_libs/hashtable_func_helper.pxi", line 555, in pandas._libs.hashtable.ismember_int64
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only

This specifically requires the following:

  • The dataframe is loaded from a parquet file.
  • The column being grouped by was used to partition the file.
  • sort=False is passed.

In addtion, passing observed=True stops the error from occurring.

I believe this is related to #31710, but they were unable to provide an example for groupby, and the issue remains on 1.0.3.

Expected Output

A DataFrameGroupBy object.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.1.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0.post20200309
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.47.0

@erik-hasse erik-hasse added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 8, 2020
@TomAugspurger
Copy link
Contributor

Here's a reproducer without parquet.

In [29]: cats = np.array([1])

In [30]: cats.flags.writeable = False

In [31]: df = pd.DataFrame({"a": [1], "b": pd.Categorical([1], categories=pd.Index(cats))})

In [32]: df.groupby("b", sort=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-882468ddac04> in <module>
----> 1 df.groupby("b", sort=False)

~/sandbox/pandas/pandas/core/frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed)
   5825             group_keys=group_keys,
   5826             squeeze=squeeze,
-> 5827             observed=observed,
   5828         )
   5829

~/sandbox/pandas/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated)
    408                 sort=sort,
    409                 observed=observed,
--> 410                 mutated=self.mutated,
    411             )
    412

~/sandbox/pandas/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
    623                 in_axis=in_axis,
    624             )
--> 625             if not isinstance(gpr, Grouping)
    626             else gpr
    627         )

~/sandbox/pandas/pandas/core/groupby/grouper.py in __init__(self, index, grouper, obj, name, level, sort, observed, in_axis)
    310
    311                 self.grouper, self.all_grouper = recode_for_groupby(
--> 312                     self.grouper, self.sort, observed
    313                 )
    314                 categories = self.grouper.categories

~/sandbox/pandas/pandas/core/groupby/categorical.py in recode_for_groupby(c, sort, observed)
     69     # including those missing from the data (GH-13179), which .unique()
     70     # above dropped
---> 71     cat = cat.add_categories(c.categories[~c.categories.isin(cat.categories)])
     72
     73     return c.reorder_categories(cat.categories), None

~/sandbox/pandas/pandas/core/indexes/base.py in isin(self, values, level)
   4872         if level is not None:
   4873             self._validate_index_level(level)
-> 4874         return algos.isin(self, values)
   4875
   4876     def _get_string_slice(self, key: str_t, use_lhs: bool = True, use_rhs: bool = True):

~/sandbox/pandas/pandas/core/algorithms.py in isin(comps, values)
    452             comps = comps.astype(object)
    453
--> 454     return f(comps, values)
    455
    456

~/sandbox/pandas/pandas/_libs/hashtable_func_helper.pxi in pandas._libs.hashtable.ismember_int64()
    553 @cython.wraparound(False)
    554 @cython.boundscheck(False)
--> 555 def ismember_int64(int64_t[:] arr, int64_t[:] values):
    556     """
    557     Return boolean of values in arr on an

~/sandbox/pandas/pandas/_libs/hashtable.cpython-37m-darwin.so in View.MemoryView.memoryview_cwrapper()

~/sandbox/pandas/pandas/_libs/hashtable.cpython-37m-darwin.so in View.MemoryView.memoryview.__cinit__()

ValueError: buffer source array is read-only

There's a standard way to get cython to accept readonly arrays (adding const?) but I don't recall exactly.

@TomAugspurger TomAugspurger added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2020
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Apr 9, 2020
@jbrockmendel
Copy link
Member

in hashtable_func_helper.pxi.in L 209 {{c_type}}[:] arr needs to be changed to const {{c_type}}[:] arr

@TomAugspurger
Copy link
Contributor

Thanks. @erik-hasse are you interested in making a PR with that change and tests?

@erik-hasse
Copy link
Contributor Author

Sure. I haven't contributed to Pandas before, anything I should read before making the change and writing the test?

@TomAugspurger
Copy link
Contributor

Our contributing doc page is apparent broken right, now doc/source/development/contributing.rst should have all the information you need.

@jorisvandenbossche
Copy link
Member

Our contributing doc page is apparent broken right, now doc/source/development/contributing.rst should have all the information you need.

The dev docs are still (or again since a few days) working fine: https://pandas.pydata.org/docs/dev/development/contributing.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants