-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Reducing memory footprint for catagoricals #6219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can you give an example of what you are thinking? |
Here's a comparison of memory use for a series of 8 char strings stored I used pympler to measure the memory footprint of python string types.
For Storing it as an Again, we don't treat categorical data as categorical and for string data in particular |
if this is your usecase, then absolutely this is a great idea. I don't think it would be that difficult actually, you would simply inherit a new block type, Already have all for the hash table stuff so easy to map locations to values (and back) for easy lookups. This is kind of like a Downside is that slicing because more tricky (and potentially slower because you have to 'figure' out what the slice is then create it as opposed to a direct slice of memory, but maybe not so bad because you don't need to potentially copy big memory either). A nice project. 👍 |
Closing in favor of #5313, which already covers this. |
python uses 32 bytes for every int.
numpy uses 4 bytes for an int32 (with which you can index billion of keys).
We currently have a hardwired if preventing single-level multi-indexes,
which would allow us to take advantage of factorization to reduce memory
consumption by ~8x for indices which have lots of duplicates.
Indexes tend to be unique (by their nature). But the same principle goes
for string columns which are often highligh degenerate.
Basically, if we store catagorical data as catagorical, we can
reduce memory footprint drastically. If we do the factorization
at the stage where data is read in (perhaps in conjuction with the iterator based
reader planned for 0.14 #2193) we can drastically reduce the peak memory usage as well.
Update: #5313
The text was updated successfully, but these errors were encountered: