Doc: Different example using categorical data type for efficient storage

pdpark · pdpark · commit 428f9affad5b · 2018-02-02T22:25:32.000-08:00
Closes: pandas-dev#12509
diff --git a/doc/source/cookbook.rst b/doc/source/cookbook.rst
@@ -1323,107 +1323,75 @@ Categorical Data
 ----------------
 
 `Using categorical data type to store data more efficiently and consistently in multiple DataFrames
-<https://stackoverflow.com/questions/29709918/pandas-and-category-replacement/29712287#29712287>`
 
-`More information about categorical data <http://pandas.pydata.org/pandas-docs/stable/categorical.html>`
+`For more information about categorical data see <http://pandas.pydata.org/pandas-docs/stable/categorical.html>`
 
-.. ipython:: python
-
-   import pandas as pd
-   import string
-   import numpy as np
-   from pandas.api.types import CategoricalDtype
-
-Create a numpy array of all the ascii letters
-
-.. ipython:: python
-
-   np.random.seed(1234)
-   pd.set_option('max_rows',10)
-   uniques = np.array(list(string.ascii_letters))
-   uniques
-
-Create a DataFrame of one column from a subset of the unique values.
-
-.. ipython:: python
-
-   df1 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques)/2+5,size=1000000))})
-   df1.head()
-
-.. ipython:: python
-
-   df1.A.nunique()
-
-Create a second DataFrame also of one column but utilizing all of the unique values this time.
-
-.. ipython:: python
-
-   df2 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques),size=1000000))})
-   df2.head()
+Create two DataFrames with one column each from a subset of the unique values with overlap. Append the dataframes and examine the memory usage.
 
 .. ipython:: python
 
-   df2.A.nunique()
-
-Create a second column in the first DataFrame with the values of the first column using a Categorical data type. The unique set of category values for the new column are derived from the data used to create the column. Extract the unique set of categories for this column as an index object.
-
-.. ipython:: python
-
-   df1['B'] = df1.A.astype('category')
-   i = df1.B.cat.categories
-   i
-
-Extract the unique set of categories as a second index from the column in the second DataFrame.
+   from pandas.api.types import CategoricalDtype
 
-.. ipython:: python
+   domain = np.array(['A','B','C','D','E','F','G','H','I','J','a','b','c','d','e','f','g','h','i','j'])
 
-   i2 = df2.A.astype('category').cat.categories
-   i2
+   df1 = pd.DataFrame({'A': domain.take(np.random.randint(0,12,size=100000))})
+   df2 = pd.DataFrame({'A': domain.take(np.random.randint(8,20,size=100000))})
+   print('df1.A Data Type:', df1.A.dtype)
+   print('df2.A Data Type:', df2.A.dtype)
 
-Use the symmetric difference operator on the two indexes to get the unique set of categorical values not in both lists and add those values to the categories from the df1.B categories.
+   df3 = df1.append(df2)
+   print('df3.A Data Type:', df3.A.dtype)
+   print(df3.memory_usage())
 
-`Index symmetric_difference<https://pandas.pydata.org/pandas-docs/version/0.21.0/generated/pandas.Index.symmetric_difference.html>`
+Try changing data types to categories to see if memory usage can be improved.    
 
 .. ipython:: python
 
-   cats = i.tolist() + (i ^ i2).tolist()
-   print(cats)
+   dfc1 = df1
+   dfc2 = df2
+   dfc1.A = dfc1.A.astype('category')
+   dfc2.A = dfc2.A.astype('category')
+   print(dfc1.memory_usage())
+   print(dfc2.memory_usage())
+   dfc3 = dfc1.append(dfc2)
+   print('Data Type:', dfc3.A.dtype)
+   print(dfc3.memory_usage())
 
-We've now recovered the original list of unique ascii letters.
+That seemed to work, the first two Dataframes used a lot less memory. That is util the second Dataframe was appended to the first one, at which point we're back to a column with an object data type taking as much memory as before.
+The problem is that the categories of the two DataFrames' columns are incompatible. 
 
 .. ipython:: python
 
-   (np.array(sorted(cats)) == sorted(uniques)).all()
+   print(dfc1.A.cat.categories)
+   print(dfc2.A.cat.categories)
 
-This unique list of categorical values can be used to create a CategoricalDtype. Columns created with this type will be of type Categorical and have the Categories of the CategoricalDtype specified rather than building a set of categories based on the data in the column.  
+We have to have the same set of categories in the original DataFrame columns to keep the column data type of the final DataFrame's as Category.
+First, get the union of the categories in the two columns.
 
 .. ipython:: python
 
-   cat_type = CategoricalDtype(categories=cats)
-   df2['B'] = df2['A'].astype(cat_type)
+   cats = df1.A.astype('category').cat.categories | df2.A.astype('category').cat.categories
+   (cats == domain).all()
 
-Comparing the codes used for the categorical columns of the two DataFrames shows that the same codes are used for each. 
-
-Note: Column B in df2 has an extra value since it contains "Z" whereas column B in df1 does not.
+This confirms that the union of the two sets of categories is the same as the original domain.
+Second try: this time the categories are specified for both DataFrame category columns.
 
 .. ipython:: python
 
-   df1[df1.B.isin(['A','a','z','Z'])].B.cat.codes.unique()
-
-.. ipython:: python
+   dfc1 = df1
+   dfc2 = df2
+   dfc1.A = pd.Categorical(df1.A, categories=cats)
+   dfc2.A = pd.Categorical(df2.A, categories=cats)
+   print(list(enumerate(dfc1.A.cat.categories)) == list(enumerate(dfc2.A.cat.categories)))
+   dfc3 = dfc1.append(dfc2)
+   print('Data Type:', dfc3.A.dtype)
+   print(dfc3.memory_usage())
 
-   df2[df2.B.isin(['A','a','z','Z'])].B.cat.codes.unique()
-
-The memory usage of the categorical column is much more efficient than the object type.
+Much better! The resulting DataFrame's memory usage is far smaller now that the data type of the final DataFrame's column is Category.
+In this case the mapping of Category indices with codes used by both of the original Categorical columns match, resulting in a final column after appending with the same Categories.
 
 .. ipython:: python
 
-   df2.dtypes
-
-.. ipython:: python
-
-   df2.A.to_frame().memory_usage()
-
-.. ipython:: python
+   [(cat,code) for code,cat in enumerate(dfc3.A.cat.categories)]
 
-   df2.B.to_frame().memory_usage()
+Note: in case you are tempted to substitute set(df1.A.unique()) for df1.A.astype('category').cat.categories in the first step above, the latter is an order of magnitude faster.