You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create two DataFrames with one column each from a subset of the unique values with overlap. Append the dataframes and examine the memory usage.
1363
1330
1364
1331
.. ipython:: python
1365
1332
1366
-
df2.A.nunique()
1367
-
1368
-
Create a second column in the first DataFrame with the values of the first column using a Categorical data type. The unique set of category values for the new column are derived from the data used to create the column. Extract the unique set of categories for this column as an index object.
1369
-
1370
-
.. ipython:: python
1371
-
1372
-
df1['B'] = df1.A.astype('category')
1373
-
i = df1.B.cat.categories
1374
-
i
1375
-
1376
-
Extract the unique set of categories as a second index from the column in the second DataFrame.
Use the symmetric difference operator on the two indexes to get the unique set of categorical values not in both lists and add those values to the categories from the df1.B categories.
Try changing data types to categories to see if memory usage can be improved.
1386
1347
1387
1348
.. ipython:: python
1388
1349
1389
-
cats = i.tolist() + (i ^ i2).tolist()
1390
-
print(cats)
1350
+
dfc1 = df1
1351
+
dfc2 = df2
1352
+
dfc1.A = dfc1.A.astype('category')
1353
+
dfc2.A = dfc2.A.astype('category')
1354
+
print(dfc1.memory_usage())
1355
+
print(dfc2.memory_usage())
1356
+
dfc3 = dfc1.append(dfc2)
1357
+
print('Data Type:', dfc3.A.dtype)
1358
+
print(dfc3.memory_usage())
1391
1359
1392
-
We've now recovered the original list of unique ascii letters.
1360
+
That seemed to work, the first two Dataframes used a lot less memory. That is util the second Dataframe was appended to the first one, at which point we're back to a column with an object data type taking as much memory as before.
1361
+
The problem is that the categories of the two DataFrames' columns are incompatible.
1393
1362
1394
1363
.. ipython:: python
1395
1364
1396
-
(np.array(sorted(cats)) ==sorted(uniques)).all()
1365
+
print(dfc1.A.cat.categories)
1366
+
print(dfc2.A.cat.categories)
1397
1367
1398
-
This unique list of categorical values can be used to create a CategoricalDtype. Columns created with this type will be of type Categorical and have the Categories of the CategoricalDtype specified rather than building a set of categories based on the data in the column.
1368
+
We have to have the same set of categories in the original DataFrame columns to keep the column data type of the final DataFrame's as Category.
1369
+
First, get the union of the categories in the two columns.
The memory usage of the categorical column is much more efficient than the object type.
1390
+
Much better! The resulting DataFrame's memory usage is far smaller now that the data type of the final DataFrame's column is Category.
1391
+
In this case the mapping of Category indices with codes used by both of the original Categorical columns match, resulting in a final column after appending with the same Categories.
1418
1392
1419
1393
.. ipython:: python
1420
1394
1421
-
df2.dtypes
1422
-
1423
-
.. ipython:: python
1424
-
1425
-
df2.A.to_frame().memory_usage()
1426
-
1427
-
.. ipython:: python
1395
+
[(cat,code) for code,cat inenumerate(dfc3.A.cat.categories)]
1428
1396
1429
-
df2.B.to_frame().memory_usage()
1397
+
Note: in case you are tempted to substitute set(df1.A.unique()) for df1.A.astype('category').cat.categories in the first step above, the latter is an order of magnitude faster.
0 commit comments