Skip to content

BUG: Hashtable size hint cap #22805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 15, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1828,6 +1828,7 @@ Groupby/Resample/Rolling
- Calling :meth:`pandas.core.groupby.GroupBy.rank` with empty groups and ``pct=True`` was raising a ``ZeroDivisionError`` (:issue:`22519`)
- Bug in :meth:`DataFrame.resample` when resampling ``NaT`` in ``TimeDeltaIndex`` (:issue:`13223`).
- Bug in :meth:`DataFrame.groupby` did not respect the ``observed`` argument when selecting a column and instead always used ``observed=False`` (:issue:`23970`)
- Bug preventing hash table creation with very large number (2^32) of rows (:issue:`22805`)

Reshaping
^^^^^^^^^
Expand Down
12 changes: 8 additions & 4 deletions pandas/_libs/hashtable_class_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -262,9 +262,10 @@ dtypes = [('Float64', 'float64', True, 'np.nan'),

cdef class {{name}}HashTable(HashTable):

def __cinit__(self, size_hint=1):
def __cinit__(self, int64_t size_hint=1):
self.table = kh_init_{{dtype}}()
if size_hint is not None:
size_hint = min(size_hint, _SIZE_HINT_LIMIT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can type the size_hint as a uint_t I think to avoid this problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getting typed as a uint on the next line - we actually want it untyped here so it could accept arbitrarily large values, just needs capped before before being converted to the c-type (what was causing the original error)

kh_resize_{{dtype}}(self.table, size_hint)

def __len__(self):
Expand Down Expand Up @@ -573,9 +574,10 @@ cdef class StringHashTable(HashTable):
# or a sentinel np.nan / None missing value
na_string_sentinel = '__nan__'

def __init__(self, int size_hint=1):
def __init__(self, int64_t size_hint=1):
self.table = kh_init_str()
if size_hint is not None:
size_hint = min(size_hint, _SIZE_HINT_LIMIT)
kh_resize_str(self.table, size_hint)

def __dealloc__(self):
Expand Down Expand Up @@ -876,9 +878,11 @@ cdef class StringHashTable(HashTable):

cdef class PyObjectHashTable(HashTable):

def __init__(self, size_hint=1):
def __init__(self, int64_t size_hint=1):
self.table = kh_init_pymap()
kh_resize_pymap(self.table, size_hint)
if size_hint is not None:
size_hint = min(size_hint, _SIZE_HINT_LIMIT)
kh_resize_pymap(self.table, size_hint)

def __dealloc__(self):
if self.table is not NULL:
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/test_algos.py
Original file line number Diff line number Diff line change
Expand Up @@ -1410,6 +1410,14 @@ def test_hashtable_factorize(self, htable, tm_dtype, writable):
expected_reconstruct = s_duplicated.dropna().values
tm.assert_numpy_array_equal(result_reconstruct, expected_reconstruct)

@pytest.mark.parametrize('hashtable', [
ht.PyObjectHashTable, ht.StringHashTable,
ht.Float64HashTable, ht.Int64HashTable, ht.UInt64HashTable])
def test_hashtable_large_sizehint(self, hashtable):
# GH 22729
size_hint = np.iinfo(np.uint32).max + 1
tbl = hashtable(size_hint=size_hint) # noqa


def test_quantile():
s = Series(np.random.randn(100))
Expand Down