-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: change the category type to use unsigned int for the internal pointers #38918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
we actually use int8 thru int64 for the codes depending on the size pls show a real example where this actually matters |
Hi, (sorry for the delay, was out for a couple of weeks) My point is that using uint8 thru uint64 - rather than int8-int64 - might decrease memory usage in half in certain cases. In the example below you can see that increasing the number of unique values from 126 (col 'a') to 127 (col 'b') results in column 'b' consuming roughly twice the memory compared with column 'a'. This implies that the pointer size increased from 1 byte to 2 bytes. If the pointer was of type uint8 and not int8, the size should not increase until there are >256 unique values in the column. code: import pandas as pd
df = pd.DataFrame()
df['a'] = (list(range(126))*1000)[0:100_000]
df['b'] = (list(range(127))*1000)[0:100_000]
df[['a','b']] = df[['a','b']].astype('category')
print(df.info())
print()
print(df.memory_usage(deep=True)) output:
|
@meizy the trouble with this approach is that Categoricals encode missing values with |
@jbrockmendel thanks. While maybe not elegant, -1 can continue to be a special value even if moving to uint. Sort of pre-existing category representing missing values. I admit I'm not familiar with this code. What do you think? |
So you'd be relying on overflow so effectively using e.g. 255 as the sentinel. I'm not sure off the top of my head how that will interact with the usage of |
I think we can close this issue? As long as we store missing values in the codes as -1, it doesn't seem like a good idea to change to unsigned types. |
Is your feature request related to a problem?
It seems that the current implementation of the category type is using signed int for the pointers per item.
Thus, it will take 2 bytes (int16) per item as long as the number of unique items is less then 32K, then it goes up to 4 bytes per item.
Since there is no need for the sign here, we could use unsigned int (uint16) and have the pointers at 2 bytes for up to 64k unique items.
the same is true for moving from 1 byte to 2 bytes, or from 4 bytes to 8 bytes.
This can save a lot of memory for very large dataframes which use category types extensively.
Describe the solution you'd like
change the internal pointers to use unsigned int.
API breaking implications
n/a
Describe alternatives you've considered
n/a
Additional context
The text was updated successfully, but these errors were encountered: