Skip to content

Hashable DataFrames #3882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hayd opened this issue Jun 13, 2013 · 12 comments · Fixed by #3884
Closed

Hashable DataFrames #3882

hayd opened this issue Jun 13, 2013 · 12 comments · Fixed by #3884
Milestone

Comments

@hayd
Copy link
Contributor

hayd commented Jun 13, 2013

See this SO answer, they want to use memoisation.

OP points out this gets different results from (presumably it does it off id)

hash(pd.DataFrame([1,2,3])) 

Should they be hashable or should hash raise? (does it defeat the point of hashing if hashing is expensive?) cc @cpcloud

@cpcloud
Copy link
Member

cpcloud commented Jun 13, 2013

i've been thinking about this off and on. a somewhat related issue is that of the empty frame, i.e., DataFrame(). i think the PandasObject NDFrame should raise in all cases since that's what numpy does (overriding where it makes sense and is useful). i guess u could have the empty DataFrame be hashable but that seems like it's not worth the effort it would take to do, who needs to hash empty DataFrames?

@jreback
Copy link
Contributor

jreback commented Jun 13, 2013

series raises on __hash__ as should all NDFrame, because they are mutable hashing is meaningless. OTOH, index are hashable, as they are immutable

@cpcloud
Copy link
Member

cpcloud commented Jun 13, 2013

Indexes are currently not hashable, since they try to hash the underlying ndarray.

@jreback
Copy link
Contributor

jreback commented Jun 13, 2013

yes..you are right...oh well my argument is bad then!

@hayd
Copy link
Contributor Author

hayd commented Jun 13, 2013

Ah, you're right, I didn't even check series, it's just DataFrame which should raise.

Easy fix (raise __hash__ for generics) pr on the way.

@cpcloud
Copy link
Member

cpcloud commented Jun 13, 2013

could implement this for indices...thoughts?

@cpcloud
Copy link
Member

cpcloud commented Jun 13, 2013

in that case u should probably hash the name, number of levels, class, and dtype

@jreback
Copy link
Contributor

jreback commented Jun 13, 2013

still have the mutability issue

though I suppose if the user accepts this it would be nice to deal with it

I would table to 0.12 for now

@hayd
Copy link
Contributor Author

hayd commented Jun 13, 2013

So, at the moment I've put this in NDFrame.

Maybe it should go in PandasObject, and then have objects which should hash override it (like if we can get indices to hash using that clever method). Are there any besides Index/MultiIndex?

@cpcloud
Copy link
Member

cpcloud commented Jun 13, 2013

i vote for default to not hashable. better to alert the user to non-hashability rather than possibly giving misleading ideas about the hashability of things

@jreback
Copy link
Contributor

jreback commented Jun 13, 2013

agree....not hashability is/should be default until we change API

@hayd
Copy link
Contributor Author

hayd commented Jun 13, 2013

ok I've moved it to PandasObject, removes repeated code too. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants