You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Try changing data types to categories to see if memory usage can be improved.
1343
+
df1.A.dtype
1344
+
1345
+
.. ipython:: python
1346
+
1347
+
df2.A.dtype
1348
+
1349
+
.. ipython:: python
1350
+
1351
+
df3.A.dtype
1352
+
1353
+
The data type of all the columns is Object. Using Categorical columns should improve memory usage.
1347
1354
1348
1355
.. ipython:: python
1349
1356
1350
1357
dfc1 = df1
1351
1358
dfc2 = df2
1352
1359
dfc1.A = dfc1.A.astype('category')
1353
1360
dfc2.A = dfc2.A.astype('category')
1354
-
print(dfc1.memory_usage())
1355
-
print(dfc2.memory_usage())
1361
+
dfc1.memory_usage()
1362
+
1363
+
.. ipython:: python
1364
+
1365
+
dfc2.memory_usage()
1366
+
1367
+
.. ipython:: python
1368
+
1356
1369
dfc3 = dfc1.append(dfc2)
1357
-
print('Data Type:', dfc3.A.dtype)
1358
-
print(dfc3.memory_usage())
1370
+
dfc3.memory_usage()
1359
1371
1360
-
That seemed to work, the first two Dataframes used a lot less memory. That is util the second Dataframe was appended to the first one, at which point we're back to a column with an object data type taking as much memory as before.
1361
-
The problem is that the categories of the two DataFrames' columns are incompatible.
1372
+
.. ipython:: python
1373
+
1374
+
df3.A.dtype
1375
+
1376
+
The first two Dataframes used a lot less memory because their columns are Categorical, but the column in the third Dataframe has a data type of Object and is using the same amount of memory as before. The appended category columns were converted into an object column because the columns' categories are incompatible:
1362
1377
1363
1378
.. ipython:: python
1364
1379
1365
-
print(dfc1.A.cat.categories)
1366
-
print(dfc2.A.cat.categories)
1380
+
dfc1.A.cat.categories
1367
1381
1368
-
We have to have the same set of categories in the original DataFrame columns to keep the column data type of the final DataFrame's as Category.
1369
-
First, get the union of the categories in the two columns.
1382
+
.. ipython:: python
1383
+
1384
+
dfc2.A.cat.categories
1385
+
1386
+
The columns to be merged must have the same set of categories in order for the data type of the final column to be Category. To fix this, get the union of the categories from the two columns.
This confirms that the union of the two sets of categories is the same as the original domain.
1377
-
Second try: this time the categories are specified for both DataFrame category columns.
1393
+
This confirms that the union of the two sets of categories is the same as the original domain used to create the two columns. Note: substituting set(df1.A.unique()) for df1.A.cat.categories in the step above is an order of magnitude slower.
1394
+
1395
+
1396
+
This time the same categories are specified for both of the starting columns.
Much better! The resulting DataFrame's memory usage is far smaller now that the data type of the final DataFrame's column is Category.
1391
-
In this case the mapping of Category indices with codes used by both of the original Categorical columns match, resulting in a final column after appending with the same Categories.
1407
+
The resulting DataFrame's memory usage is far smaller now that the data type of the final column is Category. It worked as expected this time because the code --> category mapping was the same for both of the original columns, which carried over to the final column:
1392
1408
1393
1409
.. ipython:: python
1394
1410
1395
-
[(cat,code) for code,cat inenumerate(dfc3.A.cat.categories)]
1411
+
[(code, cat) for code, cat inenumerate(dfc3.A.cat.categories)]
1412
+
1413
+
`More information about categorical data <http://pandas.pydata.org/pandas-docs/stable/categorical.html>`__
1414
+
1396
1415
1397
-
Note: in case you are tempted to substitute set(df1.A.unique()) for df1.A.astype('category').cat.categories in the first step above, the latter is an order of magnitude faster.
0 commit comments