Skip to content

Commit 5c2b355

Browse files
rootroot
root
authored and
root
committed
Doc: Updated example of using categorical data type to save on storage space
Resolves: pandas-dev#12509
1 parent bef45a5 commit 5c2b355

File tree

1 file changed

+46
-28
lines changed

1 file changed

+46
-28
lines changed

doc/source/cookbook.rst

+46-28
Original file line numberDiff line numberDiff line change
@@ -1322,9 +1322,7 @@ of the data values:
13221322
Categorical Data
13231323
----------------
13241324

1325-
`Using categorical data type to store data more efficiently and consistently in multiple DataFrames
1326-
1327-
`For more information about categorical data see <http://pandas.pydata.org/pandas-docs/stable/categorical.html>`
1325+
Using categorical data type to store data more efficiently and consistently in multiple DataFrames
13281326

13291327
Create two DataFrames with one column each from a subset of the unique values with overlap. Append the dataframes and examine the memory usage.
13301328

@@ -1336,62 +1334,82 @@ Create two DataFrames with one column each from a subset of the unique values wi
13361334
13371335
df1 = pd.DataFrame({'A': domain.take(np.random.randint(0,12,size=100000))})
13381336
df2 = pd.DataFrame({'A': domain.take(np.random.randint(8,20,size=100000))})
1339-
print('df1.A Data Type:', df1.A.dtype)
1340-
print('df2.A Data Type:', df2.A.dtype)
13411337
13421338
df3 = df1.append(df2)
1343-
print('df3.A Data Type:', df3.A.dtype)
1344-
print(df3.memory_usage())
1339+
df3.memory_usage()
1340+
1341+
.. ipython:: python
13451342
1346-
Try changing data types to categories to see if memory usage can be improved.
1343+
df1.A.dtype
1344+
1345+
.. ipython:: python
1346+
1347+
df2.A.dtype
1348+
1349+
.. ipython:: python
1350+
1351+
df3.A.dtype
1352+
1353+
The data type of all the columns is Object. Using Categorical columns should improve memory usage.
13471354

13481355
.. ipython:: python
13491356
13501357
dfc1 = df1
13511358
dfc2 = df2
13521359
dfc1.A = dfc1.A.astype('category')
13531360
dfc2.A = dfc2.A.astype('category')
1354-
print(dfc1.memory_usage())
1355-
print(dfc2.memory_usage())
1361+
dfc1.memory_usage()
1362+
1363+
.. ipython:: python
1364+
1365+
dfc2.memory_usage()
1366+
1367+
.. ipython:: python
1368+
13561369
dfc3 = dfc1.append(dfc2)
1357-
print('Data Type:', dfc3.A.dtype)
1358-
print(dfc3.memory_usage())
1370+
dfc3.memory_usage()
13591371
1360-
That seemed to work, the first two Dataframes used a lot less memory. That is util the second Dataframe was appended to the first one, at which point we're back to a column with an object data type taking as much memory as before.
1361-
The problem is that the categories of the two DataFrames' columns are incompatible.
1372+
.. ipython:: python
1373+
1374+
df3.A.dtype
1375+
1376+
The first two Dataframes used a lot less memory because their columns are Categorical, but the column in the third Dataframe has a data type of Object and is using the same amount of memory as before. The appended category columns were converted into an object column because the columns' categories are incompatible:
13621377

13631378
.. ipython:: python
13641379
1365-
print(dfc1.A.cat.categories)
1366-
print(dfc2.A.cat.categories)
1380+
dfc1.A.cat.categories
13671381
1368-
We have to have the same set of categories in the original DataFrame columns to keep the column data type of the final DataFrame's as Category.
1369-
First, get the union of the categories in the two columns.
1382+
.. ipython:: python
1383+
1384+
dfc2.A.cat.categories
1385+
1386+
The columns to be merged must have the same set of categories in order for the data type of the final column to be Category. To fix this, get the union of the categories from the two columns.
13701387

13711388
.. ipython:: python
13721389
1373-
cats = df1.A.astype('category').cat.categories | df2.A.astype('category').cat.categories
1390+
cats = df1.A.cat.categories | df2.A.cat.categories
13741391
(cats == domain).all()
13751392
1376-
This confirms that the union of the two sets of categories is the same as the original domain.
1377-
Second try: this time the categories are specified for both DataFrame category columns.
1393+
This confirms that the union of the two sets of categories is the same as the original domain used to create the two columns. Note: substituting set(df1.A.unique()) for df1.A.cat.categories in the step above is an order of magnitude slower.
1394+
1395+
1396+
This time the same categories are specified for both of the starting columns.
13781397

13791398
.. ipython:: python
13801399
13811400
dfc1 = df1
13821401
dfc2 = df2
13831402
dfc1.A = pd.Categorical(df1.A, categories=cats)
13841403
dfc2.A = pd.Categorical(df2.A, categories=cats)
1385-
print(list(enumerate(dfc1.A.cat.categories)) == list(enumerate(dfc2.A.cat.categories)))
13861404
dfc3 = dfc1.append(dfc2)
1387-
print('Data Type:', dfc3.A.dtype)
1388-
print(dfc3.memory_usage())
1405+
dfc3.memory_usage()
13891406
1390-
Much better! The resulting DataFrame's memory usage is far smaller now that the data type of the final DataFrame's column is Category.
1391-
In this case the mapping of Category indices with codes used by both of the original Categorical columns match, resulting in a final column after appending with the same Categories.
1407+
The resulting DataFrame's memory usage is far smaller now that the data type of the final column is Category. It worked as expected this time because the code --> category mapping was the same for both of the original columns, which carried over to the final column:
13921408

13931409
.. ipython:: python
13941410
1395-
[(cat,code) for code,cat in enumerate(dfc3.A.cat.categories)]
1411+
[(code, cat) for code, cat in enumerate(dfc3.A.cat.categories)]
1412+
1413+
`More information about categorical data <http://pandas.pydata.org/pandas-docs/stable/categorical.html>`__
1414+
13961415

1397-
Note: in case you are tempted to substitute set(df1.A.unique()) for df1.A.astype('category').cat.categories in the first step above, the latter is an order of magnitude faster.

0 commit comments

Comments
 (0)