Skip to content

COMPAT: hashtable vectors depend on refcount semantics which do not work on PyPy #15854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mattip opened this issue Mar 31, 2017 · 3 comments
Closed
Labels
Compat pandas objects compatability with Numpy or Python functions Internals Related to non-user accessible pandas implementation
Milestone

Comments

@mattip
Copy link
Contributor

mattip commented Mar 31, 2017

PyPy 5.7 can build and pip install pandas. Most of the tests pass, instructions to reproduce are here However this code

import pandas; 
df = pandas.Series(['b','b','b','a','a','b']); 
print df.unique()

does not work on PyPy:

Traceback (most recent call last):
  File "<module>", line 1, in <module>
  File "pypy_stuff/pypy-latest/site-packages/pandas/core/series.py", line 1241, in unique
    result = super(Series, self).unique()
  File "pypy_stuff/pypy-latest/site-packages/pandas/core/base.py", line 973, in unique
    result = unique1d(values)
  File "pypy_stuff/pypy-latest/site-packages/pandas/core/nanops.py", line 811, in unique1d
    uniques = table.unique(_ensure_object(values))
  File "pandas/src/hashtable_class_helper.pxi", line 826, in pandas.hashtable.PyObjectHashTable.unique (pandas/hashtable.c:14521)
ValueError: cannot resize an array with refcheck=True on PyPy.
Use the resize function or refcheck=False

Calls to array.resize(..., refcheck=True) (note that refcheck=True is the default) check that data reallocation is needed, and if so checks that no other object is dependent on array by testing the array object's refcount. This check is unreliable on PyPy, since we do not use a reference counting garbage collector.

I started a branch to simply expose the refcheck keword through the various places needed, the commit can be seen on my fork of pandas here.

The caller in this patch can know with certainty that there are no other users of the object, and so refcheck=False is safe. The changes are a bit intrusive, so I am opening this as an issue first hoping it generates some discussion before I issue a pull request.

@jreback
Copy link
Contributor

jreback commented Mar 31, 2017

@mattip I think you could just change this to refcheck=False. This is used internally to avoid holding the GIL for a variable length array (for numeric types); for object dtypes this doesn't do anything. The .resize is purely internal to the routine; ultimately this is then returned to the user.

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Internals Related to non-user accessible pandas implementation labels Mar 31, 2017
@mattip mattip mentioned this issue May 2, 2017
@mattip
Copy link
Contributor Author

mattip commented May 2, 2017

The change unfortunately percolates out to pure python code where uniques is allocated, see for instance the diff in factorize() from algorithms.py in pull request #16193

@jreback
Copy link
Contributor

jreback commented May 11, 2017

closed by #16258

@jreback jreback closed this as completed May 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants