Doc: Updated example of using categorical data type to save on storage space

root · root · commit 5c2b355808f4 · 2018-02-17T18:11:04.000-08:00
Resolves: pandas-dev#12509
diff --git a/doc/source/cookbook.rst b/doc/source/cookbook.rst
@@ -1322,9 +1322,7 @@ of the data values:
 Categorical Data
 ----------------
 
-`Using categorical data type to store data more efficiently and consistently in multiple DataFrames
-
-`For more information about categorical data see <http://pandas.pydata.org/pandas-docs/stable/categorical.html>`
+Using categorical data type to store data more efficiently and consistently in multiple DataFrames
 
 Create two DataFrames with one column each from a subset of the unique values with overlap. Append the dataframes and examine the memory usage.
 
@@ -1336,62 +1334,82 @@ Create two DataFrames with one column each from a subset of the unique values wi
 
    df1 = pd.DataFrame({'A': domain.take(np.random.randint(0,12,size=100000))})
    df2 = pd.DataFrame({'A': domain.take(np.random.randint(8,20,size=100000))})
-   print('df1.A Data Type:', df1.A.dtype)
-   print('df2.A Data Type:', df2.A.dtype)
 
    df3 = df1.append(df2)
-   print('df3.A Data Type:', df3.A.dtype)
-   print(df3.memory_usage())
+   df3.memory_usage()
+
+.. ipython:: python
 
-Try changing data types to categories to see if memory usage can be improved.    
+   df1.A.dtype 
+
+.. ipython:: python
+
+   df2.A.dtype 
+
+.. ipython:: python
+
+   df3.A.dtype 
+
+The data type of all the columns is Object. Using Categorical columns should improve memory usage.    
 
 .. ipython:: python
 
    dfc1 = df1
    dfc2 = df2
    dfc1.A = dfc1.A.astype('category')
    dfc2.A = dfc2.A.astype('category')
-   print(dfc1.memory_usage())
-   print(dfc2.memory_usage())
+   dfc1.memory_usage()
+
+.. ipython:: python
+
+   dfc2.memory_usage()
+
+.. ipython:: python
+
    dfc3 = dfc1.append(dfc2)
-   print('Data Type:', dfc3.A.dtype)
-   print(dfc3.memory_usage())
+   dfc3.memory_usage() 
 
-That seemed to work, the first two Dataframes used a lot less memory. That is util the second Dataframe was appended to the first one, at which point we're back to a column with an object data type taking as much memory as before.
-The problem is that the categories of the two DataFrames' columns are incompatible. 
+.. ipython:: python
+
+   df3.A.dtype
+
+The first two Dataframes used a lot less memory because their columns are Categorical, but the column in the third Dataframe has a data type of Object and is using the same amount of memory as before. The appended category columns were converted into an object column because the columns' categories are incompatible: 
 
 .. ipython:: python
 
-   print(dfc1.A.cat.categories)
-   print(dfc2.A.cat.categories)
+   dfc1.A.cat.categories 
 
-We have to have the same set of categories in the original DataFrame columns to keep the column data type of the final DataFrame's as Category.
-First, get the union of the categories in the two columns.
+.. ipython:: python
+
+   dfc2.A.cat.categories
+
+The columns to be merged must have the same set of categories in order for the data type of the final column to be Category. To fix this, get the union of the categories from the two columns.
 
 .. ipython:: python
 
-   cats = df1.A.astype('category').cat.categories | df2.A.astype('category').cat.categories
+   cats = df1.A.cat.categories | df2.A.cat.categories
    (cats == domain).all()
 
-This confirms that the union of the two sets of categories is the same as the original domain.
-Second try: this time the categories are specified for both DataFrame category columns.
+This confirms that the union of the two sets of categories is the same as the original domain used to create the two columns. Note: substituting set(df1.A.unique()) for df1.A.cat.categories in the step above is an order of magnitude slower.
+
+
+This time the same categories are specified for both of the starting columns.
 
 .. ipython:: python
 
    dfc1 = df1
    dfc2 = df2
    dfc1.A = pd.Categorical(df1.A, categories=cats)
    dfc2.A = pd.Categorical(df2.A, categories=cats)
-   print(list(enumerate(dfc1.A.cat.categories)) == list(enumerate(dfc2.A.cat.categories)))
    dfc3 = dfc1.append(dfc2)
-   print('Data Type:', dfc3.A.dtype)
-   print(dfc3.memory_usage())
+   dfc3.memory_usage()
 
-Much better! The resulting DataFrame's memory usage is far smaller now that the data type of the final DataFrame's column is Category.
-In this case the mapping of Category indices with codes used by both of the original Categorical columns match, resulting in a final column after appending with the same Categories.
+The resulting DataFrame's memory usage is far smaller now that the data type of the final column is Category. It worked as expected this time because the code --> category mapping was the same for both of the original columns, which carried over to the final column: 
 
 .. ipython:: python
 
-   [(cat,code) for code,cat in enumerate(dfc3.A.cat.categories)]
+   [(code, cat) for code, cat in enumerate(dfc3.A.cat.categories)]
+
+`More information about categorical data <http://pandas.pydata.org/pandas-docs/stable/categorical.html>`__
+
 
-Note: in case you are tempted to substitute set(df1.A.unique()) for df1.A.astype('category').cat.categories in the first step above, the latter is an order of magnitude faster.