Skip to content

ENH: change the category type to use unsigned int for the internal pointers #38918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
meizy opened this issue Jan 3, 2021 · 6 comments
Closed
Labels
Categorical Categorical Data Type Closing Candidate May be closeable, needs more eyeballs Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance

Comments

@meizy
Copy link

meizy commented Jan 3, 2021

Is your feature request related to a problem?

It seems that the current implementation of the category type is using signed int for the pointers per item.
Thus, it will take 2 bytes (int16) per item as long as the number of unique items is less then 32K, then it goes up to 4 bytes per item.
Since there is no need for the sign here, we could use unsigned int (uint16) and have the pointers at 2 bytes for up to 64k unique items.
the same is true for moving from 1 byte to 2 bytes, or from 4 bytes to 8 bytes.
This can save a lot of memory for very large dataframes which use category types extensively.

Describe the solution you'd like

change the internal pointers to use unsigned int.

API breaking implications

n/a

Describe alternatives you've considered

n/a

Additional context

@meizy meizy added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 3, 2021
@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

we actually use int8 thru int64 for the codes depending on the size

pls show a real example where this actually matters

@meizy
Copy link
Author

meizy commented Jan 22, 2021

Hi,

(sorry for the delay, was out for a couple of weeks)

My point is that using uint8 thru uint64 - rather than int8-int64 - might decrease memory usage in half in certain cases.

In the example below you can see that increasing the number of unique values from 126 (col 'a') to 127 (col 'b') results in column 'b' consuming roughly twice the memory compared with column 'a'. This implies that the pointer size increased from 1 byte to 2 bytes. If the pointer was of type uint8 and not int8, the size should not increase until there are >256 unique values in the column.

code:

import pandas as pd

df = pd.DataFrame()

df['a'] = (list(range(126))*1000)[0:100_000]
df['b'] = (list(range(127))*1000)[0:100_000]
df[['a','b']] = df[['a','b']].astype('category')

print(df.info())
print()
print(df.memory_usage(deep=True))

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype   
---  ------  --------------   -----   
 0   a       100000 non-null  category
 1   b       100000 non-null  category
dtypes: category(2)
memory usage: 305.1 KB
None

Index       128
a        106128
b        206136
dtype: int64

@jbrockmendel
Copy link
Member

@meizy the trouble with this approach is that Categoricals encode missing values with -1. There has been some discussion of changing that, but implementation is not at all trivial.

@jbrockmendel jbrockmendel added Categorical Categorical Data Type Performance Memory or execution speed performance and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021
@meizy
Copy link
Author

meizy commented Jun 7, 2021

@jbrockmendel thanks. While maybe not elegant, -1 can continue to be a special value even if moving to uint. Sort of pre-existing category representing missing values. I admit I'm not familiar with this code. What do you think?

@jbrockmendel
Copy link
Member

-1 can continue to be a special value even if moving to uint. Sort of pre-existing category representing missing values. I admit I'm not familiar with this code. What do you think?

So you'd be relying on overflow so effectively using e.g. 255 as the sentinel. I'm not sure off the top of my head how that will interact with the usage of get_indexer, so it may be tricky. I'd focus my efforts on #37930

@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 14, 2021
@jorisvandenbossche
Copy link
Member

I think we can close this issue? As long as we store missing values in the codes as -1, it doesn't seem like a good idea to change to unsigned types.

@mroeschke mroeschke added the Closing Candidate May be closeable, needs more eyeballs label Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Closing Candidate May be closeable, needs more eyeballs Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants