We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
import pandas as pd for keep_val in ['first','last']: print(f"{keep_val = }") series = pd.Series([1, 2, 2, 3, 4, 4, 5, 5, 5]) # Identify duplicates (erroneously finds 5 twice) mask = series.duplicated(keep=keep_val) print(series[mask]) data = pd.Series(['1', '2', '2', '3', '4', '4', '5', '5', '5']) # Identify duplicates (erroneously finds 5 twice) mask = series.duplicated(keep=keep_val) print(series[mask])
keep='first'|'last' should only return one instance of each duplicated values.
In the above examples it returns ['2', '4', '5', '5'] not ['2', '4', '5'].
keep='last' returns the '5' at index 6, 7 keep='first' returns the '5' at index 7, 8 eg.
keep='last'
keep='first'
series.duplicated(keep='first'|'last') should only return one instance of each duplicated values.
pandas : 2.2.2 numpy : 1.25.2 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.1.2 pip : 23.2.1 Cython : 3.0.0a10 pytest : 7.4.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : None pymysql : None psycopg2 : 2.9.7 jinja2 : 3.1.2 IPython : 8.14.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : 1.3.7 dataframe-api-compat : None fastparquet : None fsspec : 2023.9.0 gcsfs : None matplotlib : 3.7.2 numba : 0.58.1 numexpr : 2.8.5 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 13.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : 2023.9.0 scipy : 1.11.2 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : 2023.8.0 xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
The text was updated successfully, but these errors were encountered:
take
Sorry, something went wrong.
Idiot mistake.
Series.duplicated marks all duplicates as True except for the first/last occurrence.
Series.duplicated
Therefore
mask = series.duplicated(keep=keep_val) print(series[mask])
shows the items to drop in order to remove duplicates, not first occurrance of every duplicate.
Therefore there is no bug here.
Fix issue pandas-dev#59333
1b7dc81
NavaneetthaSundararaj
No branches or pull requests
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
keep='first'|'last' should only return one instance of each duplicated values.
In the above examples it returns ['2', '4', '5', '5'] not ['2', '4', '5'].
keep='last'
returns the '5' at index 6, 7keep='first'
returns the '5' at index 7, 8eg.
Expected Behavior
series.duplicated(keep='first'|'last') should only return one instance of each duplicated values.
Installed Versions
pandas : 2.2.2
numpy : 1.25.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.1.2
pip : 23.2.1
Cython : 3.0.0a10
pytest : 7.4.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : 2.9.7
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : 2023.9.0
gcsfs : None
matplotlib : 3.7.2
numba : 0.58.1
numexpr : 2.8.5
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2023.9.0
scipy : 1.11.2
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2023.8.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: