Skip to content

Commit fdc51c2

Browse files
pdparkroot
pdpark
authored and
root
committed
Doc: Different example using categorical data type for efficient storage
Closes: pandas-dev#12509
1 parent 360e8a1 commit fdc51c2

File tree

1 file changed

+43
-75
lines changed

1 file changed

+43
-75
lines changed

doc/source/cookbook.rst

+43-75
Original file line numberDiff line numberDiff line change
@@ -1323,107 +1323,75 @@ Categorical Data
13231323
----------------
13241324

13251325
`Using categorical data type to store data more efficiently and consistently in multiple DataFrames
1326-
<https://stackoverflow.com/questions/29709918/pandas-and-category-replacement/29712287#29712287>`
13271326

1328-
`More information about categorical data <http://pandas.pydata.org/pandas-docs/stable/categorical.html>`
1327+
`For more information about categorical data see <http://pandas.pydata.org/pandas-docs/stable/categorical.html>`
13291328

1330-
.. ipython:: python
1331-
1332-
import pandas as pd
1333-
import string
1334-
import numpy as np
1335-
from pandas.api.types import CategoricalDtype
1336-
1337-
Create a numpy array of all the ascii letters
1338-
1339-
.. ipython:: python
1340-
1341-
np.random.seed(1234)
1342-
pd.set_option('max_rows',10)
1343-
uniques = np.array(list(string.ascii_letters))
1344-
uniques
1345-
1346-
Create a DataFrame of one column from a subset of the unique values.
1347-
1348-
.. ipython:: python
1349-
1350-
df1 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques)/2+5,size=1000000))})
1351-
df1.head()
1352-
1353-
.. ipython:: python
1354-
1355-
df1.A.nunique()
1356-
1357-
Create a second DataFrame also of one column but utilizing all of the unique values this time.
1358-
1359-
.. ipython:: python
1360-
1361-
df2 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques),size=1000000))})
1362-
df2.head()
1329+
Create two DataFrames with one column each from a subset of the unique values with overlap. Append the dataframes and examine the memory usage.
13631330

13641331
.. ipython:: python
13651332
1366-
df2.A.nunique()
1367-
1368-
Create a second column in the first DataFrame with the values of the first column using a Categorical data type. The unique set of category values for the new column are derived from the data used to create the column. Extract the unique set of categories for this column as an index object.
1369-
1370-
.. ipython:: python
1371-
1372-
df1['B'] = df1.A.astype('category')
1373-
i = df1.B.cat.categories
1374-
i
1375-
1376-
Extract the unique set of categories as a second index from the column in the second DataFrame.
1333+
from pandas.api.types import CategoricalDtype
13771334
1378-
.. ipython:: python
1335+
domain = np.array(['A','B','C','D','E','F','G','H','I','J','a','b','c','d','e','f','g','h','i','j'])
13791336
1380-
i2 = df2.A.astype('category').cat.categories
1381-
i2
1337+
df1 = pd.DataFrame({'A': domain.take(np.random.randint(0,12,size=100000))})
1338+
df2 = pd.DataFrame({'A': domain.take(np.random.randint(8,20,size=100000))})
1339+
print('df1.A Data Type:', df1.A.dtype)
1340+
print('df2.A Data Type:', df2.A.dtype)
13821341
1383-
Use the symmetric difference operator on the two indexes to get the unique set of categorical values not in both lists and add those values to the categories from the df1.B categories.
1342+
df3 = df1.append(df2)
1343+
print('df3.A Data Type:', df3.A.dtype)
1344+
print(df3.memory_usage())
13841345
1385-
`Index symmetric_difference<https://pandas.pydata.org/pandas-docs/version/0.21.0/generated/pandas.Index.symmetric_difference.html>`
1346+
Try changing data types to categories to see if memory usage can be improved.
13861347

13871348
.. ipython:: python
13881349
1389-
cats = i.tolist() + (i ^ i2).tolist()
1390-
print(cats)
1350+
dfc1 = df1
1351+
dfc2 = df2
1352+
dfc1.A = dfc1.A.astype('category')
1353+
dfc2.A = dfc2.A.astype('category')
1354+
print(dfc1.memory_usage())
1355+
print(dfc2.memory_usage())
1356+
dfc3 = dfc1.append(dfc2)
1357+
print('Data Type:', dfc3.A.dtype)
1358+
print(dfc3.memory_usage())
13911359
1392-
We've now recovered the original list of unique ascii letters.
1360+
That seemed to work, the first two Dataframes used a lot less memory. That is util the second Dataframe was appended to the first one, at which point we're back to a column with an object data type taking as much memory as before.
1361+
The problem is that the categories of the two DataFrames' columns are incompatible.
13931362

13941363
.. ipython:: python
13951364
1396-
(np.array(sorted(cats)) == sorted(uniques)).all()
1365+
print(dfc1.A.cat.categories)
1366+
print(dfc2.A.cat.categories)
13971367
1398-
This unique list of categorical values can be used to create a CategoricalDtype. Columns created with this type will be of type Categorical and have the Categories of the CategoricalDtype specified rather than building a set of categories based on the data in the column.
1368+
We have to have the same set of categories in the original DataFrame columns to keep the column data type of the final DataFrame's as Category.
1369+
First, get the union of the categories in the two columns.
13991370

14001371
.. ipython:: python
14011372
1402-
cat_type = CategoricalDtype(categories=cats)
1403-
df2['B'] = df2['A'].astype(cat_type)
1373+
cats = df1.A.astype('category').cat.categories | df2.A.astype('category').cat.categories
1374+
(cats == domain).all()
14041375
1405-
Comparing the codes used for the categorical columns of the two DataFrames shows that the same codes are used for each.
1406-
1407-
Note: Column B in df2 has an extra value since it contains "Z" whereas column B in df1 does not.
1376+
This confirms that the union of the two sets of categories is the same as the original domain.
1377+
Second try: this time the categories are specified for both DataFrame category columns.
14081378

14091379
.. ipython:: python
14101380
1411-
df1[df1.B.isin(['A','a','z','Z'])].B.cat.codes.unique()
1412-
1413-
.. ipython:: python
1381+
dfc1 = df1
1382+
dfc2 = df2
1383+
dfc1.A = pd.Categorical(df1.A, categories=cats)
1384+
dfc2.A = pd.Categorical(df2.A, categories=cats)
1385+
print(list(enumerate(dfc1.A.cat.categories)) == list(enumerate(dfc2.A.cat.categories)))
1386+
dfc3 = dfc1.append(dfc2)
1387+
print('Data Type:', dfc3.A.dtype)
1388+
print(dfc3.memory_usage())
14141389
1415-
df2[df2.B.isin(['A','a','z','Z'])].B.cat.codes.unique()
1416-
1417-
The memory usage of the categorical column is much more efficient than the object type.
1390+
Much better! The resulting DataFrame's memory usage is far smaller now that the data type of the final DataFrame's column is Category.
1391+
In this case the mapping of Category indices with codes used by both of the original Categorical columns match, resulting in a final column after appending with the same Categories.
14181392

14191393
.. ipython:: python
14201394
1421-
df2.dtypes
1422-
1423-
.. ipython:: python
1424-
1425-
df2.A.to_frame().memory_usage()
1426-
1427-
.. ipython:: python
1395+
[(cat,code) for code,cat in enumerate(dfc3.A.cat.categories)]
14281396
1429-
df2.B.to_frame().memory_usage()
1397+
Note: in case you are tempted to substitute set(df1.A.unique()) for df1.A.astype('category').cat.categories in the first step above, the latter is an order of magnitude faster.

0 commit comments

Comments
 (0)