We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
import pandas as pd df1 = pd.DataFrame({"a":["a1","a2","a3"]}) df1.to_csv("test.csv", index=False) df2 = pd.read_csv("test.csv",dtype_backend="pyarrow") df2.drop_duplicates(subset="a") # error
read_csv with dtype_backend="pyarrow". assure the type of a is string[pyarrow]. drop_duplicates will raise NotImplementedError
read_csv
dtype_backend="pyarrow"
a
string[pyarrow]
drop_duplicates
NotImplementedError
The detailed error info: NotImplementedError Traceback (most recent call last) Cell In[6], line 1 ----> 1 df2.drop_duplicates(subset="a")
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/frame.py:6569, in DataFrame.drop_duplicates(self, subset, keep, inplace, ignore_index) 6566 inplace = validate_bool_kwarg(inplace, "inplace") 6567 ignore_index = validate_bool_kwarg(ignore_index, "ignore_index") -> 6569 result = self[-self.duplicated(subset, keep=keep)] 6570 if ignore_index: 6571 result.index = default_index(len(result))
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/frame.py:6705, in DataFrame.duplicated(self, subset, keep) 6701 raise KeyError(Index(diff)) 6703 if len(subset) == 1 and self.columns.is_unique: 6704 # GH#45236 This is faster than get_group_index below -> 6705 result = self[subset[0]].duplicated(keep) 6706 result.name = None 6707 else:
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/series.py:2484, in Series.duplicated(self, keep) 2408 def duplicated(self, keep: DropKeep = "first") -> Series: 2409 """ 2410 Indicate duplicate Series values. 2411 (...) 2482 dtype: bool 2483 """ -> 2484 res = self._duplicated(keep=keep) 2485 result = self._constructor(res, index=self.index, copy=False) 2486 return result.finalize(self, method="duplicated")
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/base.py:1368, in IndexOpsMixin._duplicated(self, keep) 1366 @Final 1367 def duplicated(self, keep: DropKeep = "first") -> npt.NDArray[np.bool]: -> 1368 return algorithms.duplicated(self._values, keep=keep)
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/algorithms.py:1002, in duplicated(values, keep) 1000 if hasattr(values, "dtype"): 1001 if isinstance(values.dtype, ArrowDtype): -> 1002 values = values._to_masked() # type: ignore[union-attr] 1004 if isinstance(values.dtype, BaseMaskedDtype): 1005 values = cast("BaseMaskedArray", values)
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py:1964, in ArrowExtensionArray._to_masked(self) 1962 na_value = True 1963 else: -> 1964 raise NotImplementedError 1966 dtype = _arrow_dtype_mapping()[pa_dtype] 1967 mask = self.isna()
NotImplementedError:
drop_duplicates can work well on a string[pyarrow] column when read a local csv using pyarrow.
pyarrow
commit : ba1cccd python : 3.11.5.final.0 python-bits : 64 OS : Darwin OS-release : 23.0.0 Version : Darwin Kernel Version 23.0.0: Tue Aug 1 03:25:17 PDT 2023; root:xnu-10002.0.242.0.6~31/RELEASE_ARM64_T8103 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8
pandas : 2.1.0 numpy : 1.25.2 pytz : 2023.3 dateutil : 2.8.2 setuptools : 68.1.2 pip : 23.2.1 Cython : None pytest : 7.4.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.14.0 pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 13.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
The text was updated successfully, but these errors were encountered:
I think this is a duplicate of #54904 (fixed by #54913) It's working correctly on the main branch (need to wait for 2.1.1 release)
Sorry, something went wrong.
No branches or pull requests
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
read_csv
withdtype_backend="pyarrow"
.assure the type of
a
isstring[pyarrow]
.drop_duplicates
will raiseNotImplementedError
The detailed error info:
NotImplementedError Traceback (most recent call last)
Cell In[6], line 1
----> 1 df2.drop_duplicates(subset="a")
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/frame.py:6569, in DataFrame.drop_duplicates(self, subset, keep, inplace, ignore_index)
6566 inplace = validate_bool_kwarg(inplace, "inplace")
6567 ignore_index = validate_bool_kwarg(ignore_index, "ignore_index")
-> 6569 result = self[-self.duplicated(subset, keep=keep)]
6570 if ignore_index:
6571 result.index = default_index(len(result))
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/frame.py:6705, in DataFrame.duplicated(self, subset, keep)
6701 raise KeyError(Index(diff))
6703 if len(subset) == 1 and self.columns.is_unique:
6704 # GH#45236 This is faster than get_group_index below
-> 6705 result = self[subset[0]].duplicated(keep)
6706 result.name = None
6707 else:
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/series.py:2484, in Series.duplicated(self, keep)
2408 def duplicated(self, keep: DropKeep = "first") -> Series:
2409 """
2410 Indicate duplicate Series values.
2411
(...)
2482 dtype: bool
2483 """
-> 2484 res = self._duplicated(keep=keep)
2485 result = self._constructor(res, index=self.index, copy=False)
2486 return result.finalize(self, method="duplicated")
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/base.py:1368, in IndexOpsMixin._duplicated(self, keep)
1366 @Final
1367 def duplicated(self, keep: DropKeep = "first") -> npt.NDArray[np.bool]:
-> 1368 return algorithms.duplicated(self._values, keep=keep)
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/algorithms.py:1002, in duplicated(values, keep)
1000 if hasattr(values, "dtype"):
1001 if isinstance(values.dtype, ArrowDtype):
-> 1002 values = values._to_masked() # type: ignore[union-attr]
1004 if isinstance(values.dtype, BaseMaskedDtype):
1005 values = cast("BaseMaskedArray", values)
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py:1964, in ArrowExtensionArray._to_masked(self)
1962 na_value = True
1963 else:
-> 1964 raise NotImplementedError
1966 dtype = _arrow_dtype_mapping()[pa_dtype]
1967 mask = self.isna()
NotImplementedError:
Expected Behavior
drop_duplicates
can work well on astring[pyarrow]
column when read a local csv usingpyarrow
.Installed Versions
INSTALLED VERSIONS
commit : ba1cccd
python : 3.11.5.final.0
python-bits : 64
OS : Darwin
OS-release : 23.0.0
Version : Darwin Kernel Version 23.0.0: Tue Aug 1 03:25:17 PDT 2023; root:xnu-10002.0.242.0.6~31/RELEASE_ARM64_T8103
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 2.1.0
numpy : 1.25.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.1.2
pip : 23.2.1
Cython : None
pytest : 7.4.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: