Skip to content

BUG: pd.Series.duplicated/pd.core.algorithms.duplicated fails with float-based SparseArray #48788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
tehunter opened this issue Sep 26, 2022 · 2 comments · Fixed by #55255
Closed
3 tasks done
Labels
Bug duplicated duplicated, drop_duplicates Sparse Sparse Data Type

Comments

@tehunter
Copy link
Contributor

tehunter commented Sep 26, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> import pandas as pd
>>> import numpy as np
>>> arr = pd.arrays.SparseArray([0,1,np.nan,1])
>>> pd.core.algorithms.duplicated(arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pandas/pandas/core/algorithms.py", line 1063, in duplicated
    values = _ensure_data(values)
  File "/home/pandas/pandas/core/algorithms.py", line 177, in _ensure_data
    if values.dtype.itemsize in [2, 12, 16]:  # type: ignore[union-attr]
AttributeError: 'SparseDtype' object has no attribute 'itemsize'

Same thing happens with pd.Series(arr).duplicated()

Issue Description

_ensure_data tries to access dtype.itemsize for float types. SparseArray can be considered a float type, but doesn't implement an itemsize attribute. Addressing #48424 may solve this, but the itemsize attribute is also causing an issue in #42070

Expected Behavior

Returns without error

Installed Versions

INSTALLED VERSIONS

commit : 24da5be
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release :
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : C.UTF-8
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.6.0.dev0+203.g24da5be949.dirty
numpy : 1.22.4
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 65.3.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.1.3
hypothesis : 6.54.6
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.5.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : 0.8.3
fsspec : 2021.11.0
gcsfs : 2021.11.0
matplotlib : 3.6.0
numba : 0.55.2
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.8
pyarrow : 9.0.0
pyreadstat : 1.1.9
pyxlsb : 1.0.9
s3fs : 2021.11.0
scipy : 1.9.1
snappy :
sqlalchemy : 1.4.41
tables : 3.7.0
tabulate : 0.8.10
xarray : 2022.6.0
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : 0.18.0
tzdata : None

@tehunter tehunter added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 26, 2022
@tehunter tehunter changed the title BUG: pd.core.algorithms.duplicated fails with float-based SparseArray BUG: pd.Series.duplicated/pd.core.algorithms.duplicated fails with float-based SparseArray Sep 26, 2022
@EarsAndEggs
Copy link

Greetings, I'm also having the same issue where it says that AttributeError: 'SparseDtype' object has no attribute 'itemsize'

This is how I'm currently generating the dataframe, and trying to use HDFStore obviously won't work with the error

mat = pd.DataFrame.sparse.from_spmatrix(sparse_matrix, columns=column_names)
store = pd.HDFStore('sparse_matrix.h5')
store.put('mutation_matrix', mat, format='table', data_columns=True)
store.close()

Are there any updates on fixes?

@topper-123 topper-123 added Sparse Sparse Data Type duplicated duplicated, drop_duplicates and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 16, 2023
@topper-123
Copy link
Contributor

I can confirm this bug. A PR would be welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug duplicated duplicated, drop_duplicates Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants