ENH: consider using sets and not maps for isin, unique and duplicated #39799
Labels
duplicated
duplicated, drop_duplicates
Enhancement
isin
isin method
Performance
Memory or execution speed performance
Is your feature request related to a problem?
Currently, in
isin
,unique
andduplicated
temporary hash-maps are created, albeit sets would be enough (forisin
and for some inputs tounique
andduplicated
). Thus has at least two drawbacks:Describe the solution you'd like
Pandas uses khash, which also provides set-implementations, which would solve the above two problems.
Given the current infrastructure it would be no big deal to add sets as well.
Another advantage would be: prior investigations have shown, that the usages of sets (as in
isin
) and of maps lead to different optimal strategies for hash-functions and collision-resolution-strategies (see e.g. #36729 (comment))Describe alternatives you've considered
Given that the most OSes nowdays don't really reserve memory for a proccess if it is not used/written to, one could claim, that the above two problems aren't problems at all (at least for bigger datasets) - because in the above algorithms no values are written to
values
.The above claim is true, as long as there is no rehashing involved. Once a map is rehashed, the garbage values are copied to the new location, which leads OS to reserve this memory to the process. Thus, preallocating big enough maps/hashes would not trigger the usage of additional memory. However, right now it doesn't not happend for different reasons, one of them being the limitation of the size of the preallocated table to 2²⁰ elements (which is a sane approach imo).
However, using sets and not maps would be a more robust and explicit way to avoid unnecessary memory usage.
The text was updated successfully, but these errors were encountered: