Skip to content

BUG: 1.3.0rc1 pd.util.hash_array on Index fails to access _values_for_factorize #42003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
TheNeuralBit opened this issue Jun 14, 2021 · 8 comments · Fixed by #45301
Closed
2 of 3 tasks

BUG: 1.3.0rc1 pd.util.hash_array on Index fails to access _values_for_factorize #42003

TheNeuralBit opened this issue Jun 14, 2021 · 8 comments · Fixed by #45301
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas
Milestone

Comments

@TheNeuralBit
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas. (does not exist in 1.2.4, only seen in 1.3.0rc1 and master)

  • (optional) I have confirmed this bug exists on the master branch of pandas. (confirmed on 0b68d87)


Code Sample, a copy-pastable example

In [1]: import pandas as pd

In [2]: pd.util.hash_array(pd.DatetimeIndex(['2018-10-28 01:20:00'], tz='Europe/Berlin'))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-389898c5af02> in <module>
----> 1 pd.util.hash_array(pd.DatetimeIndex(['2018-10-28 01:20:00'], tz='Europe/Berlin'))

~/working_dir/pandas/pandas/core/util/hashing.py in hash_array(vals, encoding, hash_key, categorize)
    287     elif not isinstance(vals, np.ndarray):
    288         # i.e. ExtensionArray
--> 289         vals, _ = vals._values_for_factorize()
    290 
    291     return _hash_ndarray(vals, encoding, hash_key, categorize)

AttributeError: 'DatetimeIndex' object has no attribute '_values_for_factorize'

In [3]: pd.util.hash_array(pd.Index([1,2,3]))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-e66efa244441> in <module>
----> 1 pd.util.hash_array(pd.Index([1,2,3]))

~/working_dir/pandas/pandas/core/util/hashing.py in hash_array(vals, encoding, hash_key, categorize)
    287     elif not isinstance(vals, np.ndarray):
    288         # i.e. ExtensionArray
--> 289         vals, _ = vals._values_for_factorize()
    290 
    291     return _hash_ndarray(vals, encoding, hash_key, categorize)

AttributeError: 'Int64Index' object has no attribute '_values_for_factorize'

In [4]: pd.util.hash_array(pd.RangeIndex(1,3))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-af0900aae979> in <module>
----> 1 pd.util.hash_array(pd.RangeIndex(1,3))

~/working_dir/pandas/pandas/core/util/hashing.py in hash_array(vals, encoding, hash_key, categorize)
    287     elif not isinstance(vals, np.ndarray):
    288         # i.e. ExtensionArray
--> 289         vals, _ = vals._values_for_factorize()
    290 
    291     return _hash_ndarray(vals, encoding, hash_key, categorize)

AttributeError: 'RangeIndex' object has no attribute '_values_for_factorize'

Problem description

This issue looks similar to #41817, but that is specifically for DateTimeIndex with tz defined, while this seems to happen for any Index instance.

Expected Output

A hash of the input index, as in pandas < 1.3.0.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 2dd9e9b python : 3.8.6.final.0 python-bits : 64 OS : Linux OS-release : 5.10.28-1rodete1-amd64 Version : #1 SMP Debian 5.10.28-1rodete1 (2021-04-30) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.0rc1
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 20.2.1
setuptools : 49.2.1
Cython : 0.29.22
pytest : 6.2.2
hypothesis : 6.4.0
sphinx : 3.5.1
blosc : 1.10.2
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.21.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.7
fastparquet : 0.5.0
gcsfs : 0.7.2
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : 0.5.2
scipy : 1.6.1
sqlalchemy : 1.3.23
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.17.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.52.0

@TheNeuralBit TheNeuralBit added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 14, 2021
@jbrockmendel
Copy link
Member

hash_array is only supposed to get ndarray or ExtensionArray (see the annotation)

@TheNeuralBit
Copy link
Contributor Author

(see the annotation)

I'm sorry I'm not sure what annotation you're referring to?

It does look this is WAI though, and I should prefer pd.util.hash_pandas_object for my application.

Would a cleaner error message be a welcome change here?

@jbrockmendel
Copy link
Member

I'm sorry I'm not sure what annotation you're referring to?

The annotation for values on hash_array is ArrayLike, which is an alias for np.ndarray | ExtensionArray.

and I should prefer pd.util.hash_pandas_object for my application

Yes, that should accept Indexes just fine.

@jbrockmendel
Copy link
Member

Would a cleaner error message be a welcome change here?

sure

@lithomas1 lithomas1 added Error Reporting Incorrect or improved errors from pandas Enhancement and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 14, 2021
@simonjayhawkins
Copy link
Member

This issue looks similar to #41817, but that is specifically for DateTimeIndex with tz defined, while this seems to happen for any Index instance.

the code sample in the OP includes a tz. so that is not a change in 1.3.0rc1

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 15, 2021
@simonjayhawkins
Copy link
Member

simonjayhawkins commented Jun 15, 2021

The public api for hash_array is at https://pandas.pydata.org/docs/reference/api/pandas.util.hash_array.html

The docstring has been updated in #39949 and the api widened to accept any EA and numpy array. https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.util.hash_array.html

so Index is not an accepted input. but maybe we should also add something to the 1.3 release notes since it worked for some Index types before. @jbrockmendel?

@simonjayhawkins
Copy link
Member

  • does not exist in 1.2.4, only seen in 1.3.0rc1 and master

first bad commit: [9664284] TYP: fix ignores (#40452)

@jbrockmendel
Copy link
Member

maybe we should also add something to the 1.3 release notes since it worked for some Index types before

makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas
Projects
None yet
5 participants