Skip to content

ENH: consider using sets and not maps for isin, unique and duplicated #39799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
realead opened this issue Feb 13, 2021 · 2 comments
Open

ENH: consider using sets and not maps for isin, unique and duplicated #39799

realead opened this issue Feb 13, 2021 · 2 comments
Labels
duplicated duplicated, drop_duplicates Enhancement isin isin method Performance Memory or execution speed performance

Comments

@realead
Copy link
Contributor

realead commented Feb 13, 2021

Is your feature request related to a problem?

Currently, in isin, unique and duplicated temporary hash-maps are created, albeit sets would be enough (for isin and for some inputs to unique and duplicated). Thus has at least two drawbacks:

  1. Quite a memory overhead: For every key there are additional 8 bytes overhead for values. This ratio will become even worse for 32bit/16bit keys, once they are be used.
  2. For big datasets, the running time of the above algorithms is dominated by cache-misses. Thus having twice as many cache-misses, because also values are touched could mean a factor 2 slowdown.

Describe the solution you'd like

Pandas uses khash, which also provides set-implementations, which would solve the above two problems.

Given the current infrastructure it would be no big deal to add sets as well.

Another advantage would be: prior investigations have shown, that the usages of sets (as in isin) and of maps lead to different optimal strategies for hash-functions and collision-resolution-strategies (see e.g. #36729 (comment))

Describe alternatives you've considered

Given that the most OSes nowdays don't really reserve memory for a proccess if it is not used/written to, one could claim, that the above two problems aren't problems at all (at least for bigger datasets) - because in the above algorithms no values are written to values.

The above claim is true, as long as there is no rehashing involved. Once a map is rehashed, the garbage values are copied to the new location, which leads OS to reserve this memory to the process. Thus, preallocating big enough maps/hashes would not trigger the usage of additional memory. However, right now it doesn't not happend for different reasons, one of them being the limitation of the size of the preallocated table to 2²⁰ elements (which is a sane approach imo).

However, using sets and not maps would be a more robust and explicit way to avoid unnecessary memory usage.

@realead realead added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 13, 2021
@jbrockmendel
Copy link
Member

makes sense. any idea how this would affect build size?

@realead
Copy link
Contributor Author

realead commented Feb 14, 2021

@jbrockmendel my best guess would be similar to this change: about 1MB in the wheel and about 4MB in installation.

However, if you are serious about keeping the size small, maybe using -g0 or -g1 while compiling would be an option?

It looks as if there is options to build with debug symbols in the setup file:

pandas/setup.py

Lines 422 to 425 in a27244d

if debugging_symbols_requested:
extra_compile_args.append("-g")
extra_compile_args.append("-UNDEBUG")
extra_compile_args.append("-O0")

However, in the log I see -g even without --with-debugging-symbols, probably because the python-version was built this way.

When I add -g0 or -g1 (maybe a better trade-off) I get the following sizes for current hashtable.xxxx.so-file

     -g      8MB
     -g0    1MB
     -g1    2MB

I like to have symbols in the *.so (-g1 option), which makes profiling easier/possible to understand. But how often does somebody debug an C-extension? It is probably Ok to say, they have to build with debug symbols themselves rather than to let all the others pay for that.

@jbrockmendel jbrockmendel added Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021
@mroeschke mroeschke added duplicated duplicated, drop_duplicates isin isin method labels Aug 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicated duplicated, drop_duplicates Enhancement isin isin method Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants