Skip to content

Commit 360e8a1

Browse files
pdparkroot
pdpark
authored and
root
committed
Doc: Adds example of categorical data for efficient storage and
consistency across DataFrames Resolves pandas-dev#12509
1 parent 44c822d commit 360e8a1

File tree

1 file changed

+109
-0
lines changed

1 file changed

+109
-0
lines changed

doc/source/cookbook.rst

+109
Original file line numberDiff line numberDiff line change
@@ -1318,3 +1318,112 @@ of the data values:
13181318
'weight': [100, 140, 180],
13191319
'sex': ['Male', 'Female']})
13201320
df
1321+
1322+
Categorical Data
1323+
----------------
1324+
1325+
`Using categorical data type to store data more efficiently and consistently in multiple DataFrames
1326+
<https://stackoverflow.com/questions/29709918/pandas-and-category-replacement/29712287#29712287>`
1327+
1328+
`More information about categorical data <http://pandas.pydata.org/pandas-docs/stable/categorical.html>`
1329+
1330+
.. ipython:: python
1331+
1332+
import pandas as pd
1333+
import string
1334+
import numpy as np
1335+
from pandas.api.types import CategoricalDtype
1336+
1337+
Create a numpy array of all the ascii letters
1338+
1339+
.. ipython:: python
1340+
1341+
np.random.seed(1234)
1342+
pd.set_option('max_rows',10)
1343+
uniques = np.array(list(string.ascii_letters))
1344+
uniques
1345+
1346+
Create a DataFrame of one column from a subset of the unique values.
1347+
1348+
.. ipython:: python
1349+
1350+
df1 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques)/2+5,size=1000000))})
1351+
df1.head()
1352+
1353+
.. ipython:: python
1354+
1355+
df1.A.nunique()
1356+
1357+
Create a second DataFrame also of one column but utilizing all of the unique values this time.
1358+
1359+
.. ipython:: python
1360+
1361+
df2 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques),size=1000000))})
1362+
df2.head()
1363+
1364+
.. ipython:: python
1365+
1366+
df2.A.nunique()
1367+
1368+
Create a second column in the first DataFrame with the values of the first column using a Categorical data type. The unique set of category values for the new column are derived from the data used to create the column. Extract the unique set of categories for this column as an index object.
1369+
1370+
.. ipython:: python
1371+
1372+
df1['B'] = df1.A.astype('category')
1373+
i = df1.B.cat.categories
1374+
i
1375+
1376+
Extract the unique set of categories as a second index from the column in the second DataFrame.
1377+
1378+
.. ipython:: python
1379+
1380+
i2 = df2.A.astype('category').cat.categories
1381+
i2
1382+
1383+
Use the symmetric difference operator on the two indexes to get the unique set of categorical values not in both lists and add those values to the categories from the df1.B categories.
1384+
1385+
`Index symmetric_difference<https://pandas.pydata.org/pandas-docs/version/0.21.0/generated/pandas.Index.symmetric_difference.html>`
1386+
1387+
.. ipython:: python
1388+
1389+
cats = i.tolist() + (i ^ i2).tolist()
1390+
print(cats)
1391+
1392+
We've now recovered the original list of unique ascii letters.
1393+
1394+
.. ipython:: python
1395+
1396+
(np.array(sorted(cats)) == sorted(uniques)).all()
1397+
1398+
This unique list of categorical values can be used to create a CategoricalDtype. Columns created with this type will be of type Categorical and have the Categories of the CategoricalDtype specified rather than building a set of categories based on the data in the column.
1399+
1400+
.. ipython:: python
1401+
1402+
cat_type = CategoricalDtype(categories=cats)
1403+
df2['B'] = df2['A'].astype(cat_type)
1404+
1405+
Comparing the codes used for the categorical columns of the two DataFrames shows that the same codes are used for each.
1406+
1407+
Note: Column B in df2 has an extra value since it contains "Z" whereas column B in df1 does not.
1408+
1409+
.. ipython:: python
1410+
1411+
df1[df1.B.isin(['A','a','z','Z'])].B.cat.codes.unique()
1412+
1413+
.. ipython:: python
1414+
1415+
df2[df2.B.isin(['A','a','z','Z'])].B.cat.codes.unique()
1416+
1417+
The memory usage of the categorical column is much more efficient than the object type.
1418+
1419+
.. ipython:: python
1420+
1421+
df2.dtypes
1422+
1423+
.. ipython:: python
1424+
1425+
df2.A.to_frame().memory_usage()
1426+
1427+
.. ipython:: python
1428+
1429+
df2.B.to_frame().memory_usage()

0 commit comments

Comments
 (0)