Skip to content

BUG: DataFrame.drop_duplicates method fails when a column with a list dtype is present #56784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
zdafoe opened this issue Jan 8, 2024 · 7 comments · Fixed by #57670
Closed
3 tasks done
Assignees
Labels
Docs duplicated duplicated, drop_duplicates good first issue

Comments

@zdafoe
Copy link

zdafoe commented Jan 8, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame(
    [ 
        {"number": 1, "item_ids": [1, 2, 3]},
        {"number": 1, "item_ids": [1, 2, 3]},
    ]
)
df.drop_duplicates()

Issue Description

DataFrame.drop_duplicates and DataFrame.duplicated fail if there is a column of the list dtype in the subset parameter and there is more than one column in subset

Expected Behavior

should output

   number   item_ids
0      1   [1, 2, 3]

Installed Versions

INSTALLED VERSIONS

commit : 04b45b1
python : 3.11.7.final.0
python-bits : 64
OS : Linux
OS-release : 6.5.0-14-generic
Version : #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.3.0.dev0+54.g04b45b10b1
numpy : 1.26.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 69.0.3
pip : 23.3.1
Cython : None
pytest : 7.4.3
...
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None

@zdafoe zdafoe added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 8, 2024
@rhshadrach
Copy link
Member

rhshadrach commented Jan 8, 2024

Thanks for the report, but these methods rely on hashing and mutable objects are not hashable. As such, I don't think we can support such columns.

However, this limitation is not in the documentation. I think it would be good to add this documentation.

@rhshadrach rhshadrach added Docs duplicated duplicated, drop_duplicates good first issue and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 8, 2024
@EdDonlon
Copy link

EdDonlon commented Jan 9, 2024

take

@manlattan manlattan removed their assignment Jan 19, 2024
@manlattan
Copy link
Contributor

take

@dvl-mehnaz
Copy link

import pandas as pd

data={ "number": 1, "item_ids":[[1, 2, 3]],
"number": 1, "item_ids":[[1, 2, 3]]}
df=pd.DataFrame(data)
df

drop_duplicates and drop_duplicated is not working with subset list ,
if we make some changes while creating a data frame then we don't need to use pandas function duplicated and duplicates .
it will remove duplicates without using functions

@flaviaouyang
Copy link
Contributor

take

@KonstantineTzaferis
Copy link

This is a nice issue to work since I am a newcomer.
Can I work on this?

@KonstantineTzaferis
Copy link

import pandas as pd

data={ "number": 1, "item_ids":[[1, 2, 3]], "number": 1, "item_ids":[[1, 2, 3]]} df=pd.DataFrame(data) df

drop_duplicates and drop_duplicated is not working with subset list , if we make some changes while creating a data frame then we don't need to use pandas function duplicated and duplicates . it will remove duplicates without using functions

Still in my code the function fails when we have a list as a data type in a column:
Throws a TypeError: unhashable type: 'list'
Could you elaborate on how to create the dataframe in way that we don't get an error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment