-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
df.duplicated and drop_duplicates raise TypeError with unhashable values. #12693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I guess. you are using a list-like value INSIDE a cell of a frame. This is quite inefficient and not generally supported. pull-requests accepts to fix in any event. |
Current pandas gives a slightly different TypeError ( In any case, since you're dealing with an object dtype, there is no guarantee that the next row won't contain a set or a list, so this deduplication gets only worse from then on. So pandas treats each value as a separate one and processes them as long as they are hashable. Just try a column with three tuples, it will work, then change the last one to be a set and it will fail on that very value. So, I'm not sure there's a solid implementation that would work here given the lack of hashability in lists, there could potentially be a fix for sets, which would be converted to frozensets upon hash map insertion, but that does seem hacky and arbitrary. |
How about ignoring unhashable columns for the purposes of dropping duplicates? |
The case in the OP is fixed on main print(pd.__version__)
df = pd.DataFrame([[{"a", "b"}], [{"b", "c"}], [{"b", "a"}]])
print(df.duplicated())
print(df.drop_duplicates())
and for lists too df = pd.DataFrame([[["a", "b"]], [["b"]], [["a", "b"]]])
print(df.duplicated())
print(df.drop_duplicates())
fixed in commit: [235113e] PERF: Improve performance for df.duplicated with one column subset (#45534) but will still fail for multi-column DataFrame print(pd.__version__)
df = pd.DataFrame([[{"a", "b"}], [{"b", "c"}], [{"b", "a"}]]).T
print(df.duplicated())
|
I have a test case that also throws this error, when trying to use uncertainties in anything other than a Series (or one-column DataFrame):
Not only does the third case fail (using a combination of uncertainties and quantities), but the fourth case fails with the aforementioned TypeError:
AffineScalarFunc is a synonym for UFloat from the uncertainties package. It results from a ufloat(nominal_value, error_value) having math done to it, making it Affine and no longer simply a ufloat. |
IN:
OUT:
IN:
OUT:
I expect:
pd.show_versions()
output:The text was updated successfully, but these errors were encountered: