ENH: change the category type to use unsigned int for the internal pointers #38918

meizy · 2021-01-03T11:40:05Z

Is your feature request related to a problem?

It seems that the current implementation of the category type is using signed int for the pointers per item.
Thus, it will take 2 bytes (int16) per item as long as the number of unique items is less then 32K, then it goes up to 4 bytes per item.
Since there is no need for the sign here, we could use unsigned int (uint16) and have the pointers at 2 bytes for up to 64k unique items.
the same is true for moving from 1 byte to 2 bytes, or from 4 bytes to 8 bytes.
This can save a lot of memory for very large dataframes which use category types extensively.

Describe the solution you'd like

change the internal pointers to use unsigned int.

API breaking implications

n/a

Describe alternatives you've considered

n/a

Additional context

jreback · 2021-01-03T15:02:58Z

we actually use int8 thru int64 for the codes depending on the size

pls show a real example where this actually matters

meizy · 2021-01-22T11:41:03Z

Hi,

(sorry for the delay, was out for a couple of weeks)

My point is that using uint8 thru uint64 - rather than int8-int64 - might decrease memory usage in half in certain cases.

In the example below you can see that increasing the number of unique values from 126 (col 'a') to 127 (col 'b') results in column 'b' consuming roughly twice the memory compared with column 'a'. This implies that the pointer size increased from 1 byte to 2 bytes. If the pointer was of type uint8 and not int8, the size should not increase until there are >256 unique values in the column.

code:

import pandas as pd

df = pd.DataFrame()

df['a'] = (list(range(126))*1000)[0:100_000]
df['b'] = (list(range(127))*1000)[0:100_000]
df[['a','b']] = df[['a','b']].astype('category')

print(df.info())
print()
print(df.memory_usage(deep=True))

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype   
---  ------  --------------   -----   
 0   a       100000 non-null  category
 1   b       100000 non-null  category
dtypes: category(2)
memory usage: 305.1 KB
None

Index       128
a        106128
b        206136
dtype: int64

jbrockmendel · 2021-06-06T17:45:10Z

@meizy the trouble with this approach is that Categoricals encode missing values with -1. There has been some discussion of changing that, but implementation is not at all trivial.

meizy · 2021-06-07T04:57:02Z

@jbrockmendel thanks. While maybe not elegant, -1 can continue to be a special value even if moving to uint. Sort of pre-existing category representing missing values. I admit I'm not familiar with this code. What do you think?

jbrockmendel · 2021-06-07T19:54:20Z

-1 can continue to be a special value even if moving to uint. Sort of pre-existing category representing missing values. I admit I'm not familiar with this code. What do you think?

So you'd be relying on overflow so effectively using e.g. 255 as the sentinel. I'm not sure off the top of my head how that will interact with the usage of get_indexer, so it may be tricky. I'd focus my efforts on #37930

jorisvandenbossche · 2021-12-23T19:49:34Z

I think we can close this issue? As long as we store missing values in the codes as -1, it doesn't seem like a good idea to change to unsigned types.

meizy added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 3, 2021

jbrockmendel added Categorical Categorical Data Type Performance Memory or execution speed performance and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021

mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 14, 2021

mroeschke added the Closing Candidate May be closeable, needs more eyeballs label Dec 23, 2021

mroeschke closed this as completed Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: change the category type to use unsigned int for the internal pointers #38918

ENH: change the category type to use unsigned int for the internal pointers #38918

meizy commented Jan 3, 2021

jreback commented Jan 3, 2021

meizy commented Jan 22, 2021 •

edited

Loading

jbrockmendel commented Jun 6, 2021

meizy commented Jun 7, 2021

jbrockmendel commented Jun 7, 2021

jorisvandenbossche commented Dec 23, 2021

ENH: change the category type to use unsigned int for the internal pointers #38918

ENH: change the category type to use unsigned int for the internal pointers #38918

Comments

meizy commented Jan 3, 2021

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

jreback commented Jan 3, 2021

meizy commented Jan 22, 2021 • edited Loading

jbrockmendel commented Jun 6, 2021

meizy commented Jun 7, 2021

jbrockmendel commented Jun 7, 2021

jorisvandenbossche commented Dec 23, 2021

meizy commented Jan 22, 2021 •

edited

Loading