Doc: Adds example of categorical data for efficient storage and

pdpark · root · commit 360e8a1fc859 · 2018-02-16T19:42:48.000-08:00
consistency across DataFrames Resolves pandas-dev#12509
diff --git a/doc/source/cookbook.rst b/doc/source/cookbook.rst
@@ -1318,3 +1318,112 @@ of the data values:
        'weight': [100, 140, 180],
        'sex': ['Male', 'Female']})
    df
+
+Categorical Data
+----------------
+
+`Using categorical data type to store data more efficiently and consistently in multiple DataFrames
+<https://stackoverflow.com/questions/29709918/pandas-and-category-replacement/29712287#29712287>`
+
+`More information about categorical data <http://pandas.pydata.org/pandas-docs/stable/categorical.html>`
+
+.. ipython:: python
+
+   import pandas as pd
+   import string
+   import numpy as np
+   from pandas.api.types import CategoricalDtype
+
+Create a numpy array of all the ascii letters
+
+.. ipython:: python
+
+   np.random.seed(1234)
+   pd.set_option('max_rows',10)
+   uniques = np.array(list(string.ascii_letters))
+   uniques
+
+Create a DataFrame of one column from a subset of the unique values.
+
+.. ipython:: python
+
+   df1 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques)/2+5,size=1000000))})
+   df1.head()
+
+.. ipython:: python
+
+   df1.A.nunique()
+
+Create a second DataFrame also of one column but utilizing all of the unique values this time.
+
+.. ipython:: python
+
+   df2 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques),size=1000000))})
+   df2.head()
+
+.. ipython:: python
+
+   df2.A.nunique()
+
+Create a second column in the first DataFrame with the values of the first column using a Categorical data type. The unique set of category values for the new column are derived from the data used to create the column. Extract the unique set of categories for this column as an index object.
+
+.. ipython:: python
+
+   df1['B'] = df1.A.astype('category')
+   i = df1.B.cat.categories
+   i
+
+Extract the unique set of categories as a second index from the column in the second DataFrame.
+
+.. ipython:: python
+
+   i2 = df2.A.astype('category').cat.categories
+   i2
+
+Use the symmetric difference operator on the two indexes to get the unique set of categorical values not in both lists and add those values to the categories from the df1.B categories.
+
+`Index symmetric_difference<https://pandas.pydata.org/pandas-docs/version/0.21.0/generated/pandas.Index.symmetric_difference.html>`
+
+.. ipython:: python
+
+   cats = i.tolist() + (i ^ i2).tolist()
+   print(cats)
+
+We've now recovered the original list of unique ascii letters.
+
+.. ipython:: python
+
+   (np.array(sorted(cats)) == sorted(uniques)).all()
+
+This unique list of categorical values can be used to create a CategoricalDtype. Columns created with this type will be of type Categorical and have the Categories of the CategoricalDtype specified rather than building a set of categories based on the data in the column.  
+
+.. ipython:: python
+
+   cat_type = CategoricalDtype(categories=cats)
+   df2['B'] = df2['A'].astype(cat_type)
+
+Comparing the codes used for the categorical columns of the two DataFrames shows that the same codes are used for each. 
+
+Note: Column B in df2 has an extra value since it contains "Z" whereas column B in df1 does not.
+
+.. ipython:: python
+
+   df1[df1.B.isin(['A','a','z','Z'])].B.cat.codes.unique()
+
+.. ipython:: python
+
+   df2[df2.B.isin(['A','a','z','Z'])].B.cat.codes.unique()
+
+The memory usage of the categorical column is much more efficient than the object type.
+
+.. ipython:: python
+
+   df2.dtypes
+
+.. ipython:: python
+
+   df2.A.to_frame().memory_usage()
+
+.. ipython:: python
+
+   df2.B.to_frame().memory_usage()