Skip to content

Commit d1c0271

Browse files
committed
Merge pull request #7619 from asobrien/df-mem-info
ENH: dataframe memory usage
2 parents 5a3af87 + ab97e0a commit d1c0271

File tree

7 files changed

+188
-7
lines changed

7 files changed

+188
-7
lines changed

doc/source/faq.rst

+75
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,81 @@ Frequently Asked Questions (FAQ)
2424
options.display.mpl_style='default'
2525
from pandas.compat import lrange
2626
27+
28+
.. _df-memory-usage:
29+
30+
DataFrame memory usage
31+
~~~~~~~~~~~~~~~~~~~~~~
32+
As of pandas version 0.15.0, the memory usage of a dataframe (including
33+
the index) is shown when accessing the ``info`` method of a dataframe. A
34+
configuration option, ``display.memory_usage`` (see :ref:`options`),
35+
specifies if the dataframe's memory usage will be displayed when
36+
invoking the df.info() method.
37+
38+
For example, the memory usage of the dataframe below is shown
39+
when calling df.info():
40+
41+
.. ipython:: python
42+
43+
dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]',
44+
'complex128', 'object', 'bool']
45+
n = 5000
46+
data = dict([ (t, np.random.randint(100, size=n).astype(t))
47+
for t in dtypes])
48+
df = DataFrame(data)
49+
50+
df.info()
51+
52+
By default the display option is set to True but can be explicitly
53+
overridden by passing the memory_usage argument when invoking df.info().
54+
Note that ``memory_usage=None`` is the default value for the df.info()
55+
method and follows the setting specified by display.memory_usage.
56+
57+
The memory usage of each column can be found by calling the ``memory_usage``
58+
method. This returns a Series with an index represented by column names
59+
and memory usage of each column shown in bytes. For the dataframe above,
60+
the memory usage of each column and the total memory usage of the
61+
dataframe can be found with the memory_usage method:
62+
63+
.. ipython:: python
64+
65+
df.memory_usage()
66+
67+
# total memory usage of dataframe
68+
df.memory_usage().sum()
69+
70+
By default the memory usage of the dataframe's index is not shown in the
71+
returned Series, the memory usage of the index can be shown by passing
72+
the ``index=True`` argument:
73+
74+
.. ipython:: python
75+
76+
df.memory_usage(index=True)
77+
78+
The memory usage displayed by the ``info`` method utilizes the
79+
``memory_usage`` method to determine the memory usage of a dataframe
80+
while also formatting the output in human-readable units (base-2
81+
representation; i.e., 1KB = 1024 bytes).
82+
83+
Pandas version 0.15.0 introduces a new categorical data type (see
84+
:ref:`categorical`), which can be used in Series and DataFrames.
85+
Significant memory savings can be achieved when using the category
86+
datatype. This is demonstrated below:
87+
88+
.. ipython:: python
89+
90+
df['bases_object'] = Series(np.array(['adenine', 'cytosine', 'guanine', 'thymine']).take(np.random.randint(0,4,size=len(df))))
91+
92+
df['bases_categorical'] = df['bases_object'].astype('category')
93+
94+
df.memory_usage()
95+
96+
While the *base_object* and *bases_categorical* appear as identical
97+
columns in the dataframe, the memory savings of the categorical
98+
datatype, versus the object datatype, is revealed by ``memory_usage``.
99+
100+
101+
27102
.. _ref-monkey-patching:
28103

29104
Adding Features to your pandas Installation

doc/source/options.rst

+3
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,9 @@ display.max_seq_items 100 when pretty-printing a long sequence,
348348
of "..." to the resulting string.
349349
If set to None, the number of items
350350
to be printed is unlimited.
351+
display.memory_usage True This specifies if the memory usage of
352+
a DataFrame should be displayed when the
353+
df.info() method is invoked.
351354
display.mpl_style None Setting this to 'default' will modify
352355
the rcParams used by matplotlib
353356
to give plots a more pleasing visual

doc/source/v0.15.0.txt

+10
Original file line numberDiff line numberDiff line change
@@ -259,6 +259,16 @@ API changes
259259

260260
- ``DataFrame.plot`` and ``Series.plot`` keywords are now have consistent orders (:issue:`8037`)
261261

262+
- Implements methods to find memory usage of a DataFrame (:issue:`6852`). A new display option ``display.memory_usage`` (see :ref:`options`) sets the default behavior of the ``memory_usage`` argument in the ``df.info()`` method; by default ``display.memory_usage`` is True but this can be overridden by explicitly passing the memory_usage argument to the df.info() method, as shown below. Additionally `memory_usage` is an available method for a dataframe object which returns the memory usage of each column (for more information see :ref:`df-memory-usage`):
263+
264+
.. ipython:: python
265+
266+
df = DataFrame({ 'float' : np.random.randn(1000), 'int' : np.random.randint(0,5,size=1000)})
267+
df.memory_usage()
268+
269+
df.info(memory_usage=True)
270+
271+
262272
.. _whatsnew_0150.dt:
263273

264274
.dt accessor

pandas/core/config_init.py

+8
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,12 @@
203203
Setting this to None/False restores the values to their initial value.
204204
"""
205205

206+
pc_memory_usage_doc = """
207+
: bool or None
208+
This specifies if the memory usage of a DataFrame should be displayed when
209+
df.info() is called.
210+
"""
211+
206212
style_backup = dict()
207213

208214

@@ -274,6 +280,8 @@ def mpl_style_cb(key):
274280
# redirected to width, make defval identical
275281
cf.register_option('line_width', get_default_val('display.width'),
276282
pc_line_width_doc)
283+
cf.register_option('memory_usage', True, pc_memory_usage_doc,
284+
validator=is_instance_factory([type(None), bool]))
277285

278286
cf.deprecate_option('display.line_width',
279287
msg=pc_line_width_deprecation_warning,

pandas/core/frame.py

+53-2
Original file line numberDiff line numberDiff line change
@@ -1390,7 +1390,7 @@ def to_latex(self, buf=None, columns=None, col_space=None, colSpace=None,
13901390
if buf is None:
13911391
return formatter.buf.getvalue()
13921392

1393-
def info(self, verbose=None, buf=None, max_cols=None):
1393+
def info(self, verbose=None, buf=None, max_cols=None, memory_usage=None):
13941394
"""
13951395
Concise summary of a DataFrame.
13961396
@@ -1404,6 +1404,12 @@ def info(self, verbose=None, buf=None, max_cols=None):
14041404
max_cols : int, default None
14051405
Determines whether full summary or short summary is printed.
14061406
None follows the `display.max_info_columns` setting.
1407+
memory_usage : boolean, default None
1408+
Specifies whether total memory usage of the DataFrame
1409+
elements (including index) should be displayed. None follows
1410+
the `display.memory_usage` setting. True or False overrides
1411+
the `display.memory_usage` setting. Memory usage is shown in
1412+
human-readable units (base-2 representation).
14071413
"""
14081414
from pandas.core.format import _put_lines
14091415

@@ -1462,6 +1468,14 @@ def _verbose_repr():
14621468
def _non_verbose_repr():
14631469
lines.append(self.columns.summary(name='Columns'))
14641470

1471+
def _sizeof_fmt(num):
1472+
# returns size in human readable format
1473+
for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
1474+
if num < 1024.0:
1475+
return "%3.1f %s" % (num, x)
1476+
num /= 1024.0
1477+
return "%3.1f %s" % (num, 'PB')
1478+
14651479
if verbose:
14661480
_verbose_repr()
14671481
elif verbose is False: # specifically set to False, not nesc None
@@ -1474,9 +1488,46 @@ def _non_verbose_repr():
14741488

14751489
counts = self.get_dtype_counts()
14761490
dtypes = ['%s(%d)' % k for k in sorted(compat.iteritems(counts))]
1477-
lines.append('dtypes: %s\n' % ', '.join(dtypes))
1491+
lines.append('dtypes: %s' % ', '.join(dtypes))
1492+
if memory_usage is None:
1493+
memory_usage = get_option('display.memory_usage')
1494+
if memory_usage: # append memory usage of df to display
1495+
lines.append("memory usage: %s\n" %
1496+
_sizeof_fmt(self.memory_usage(index=True).sum()))
14781497
_put_lines(buf, lines)
14791498

1499+
def memory_usage(self, index=False):
1500+
"""Memory usage of DataFrame columns.
1501+
1502+
Parameters
1503+
----------
1504+
index : bool
1505+
Specifies whether to include memory usage of DataFrame's
1506+
index in returned Series. If `index=True` (default is False)
1507+
the first index of the Series is `Index`.
1508+
1509+
Returns
1510+
-------
1511+
sizes : Series
1512+
A series with column names as index and memory usage of
1513+
columns with units of bytes.
1514+
1515+
Notes
1516+
-----
1517+
Memory usage does not include memory consumed by elements that
1518+
are not components of the array.
1519+
1520+
See Also
1521+
--------
1522+
numpy.ndarray.nbytes
1523+
"""
1524+
result = Series([ c.values.nbytes for col, c in self.iteritems() ],
1525+
index=self.columns)
1526+
if index:
1527+
result = Series(self.index.values.nbytes,
1528+
index=['Index']).append(result)
1529+
return result
1530+
14801531
def transpose(self):
14811532
"""Transpose index and columns"""
14821533
return super(DataFrame, self).transpose(1, 0)

pandas/tests/test_format.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ def has_info_repr(df):
4343
def has_non_verbose_info_repr(df):
4444
has_info = has_info_repr(df)
4545
r = repr(df)
46-
nv = len(r.split('\n')) == 5 # 1. <class>, 2. Index, 3. Columns, 4. dtype, 5. trailing newline
46+
nv = len(r.split('\n')) == 6 # 1. <class>, 2. Index, 3. Columns, 4. dtype, 5. memory usage, 6. trailing newline
4747
return has_info and nv
4848

4949
def has_horizontally_truncated_repr(df):

pandas/tests/test_frame.py

+38-4
Original file line numberDiff line numberDiff line change
@@ -6553,7 +6553,7 @@ def test_info_max_cols(self):
65536553
buf = StringIO()
65546554
df.info(buf=buf, verbose=verbose)
65556555
res = buf.getvalue()
6556-
self.assertEqual(len(res.split('\n')), len_)
6556+
self.assertEqual(len(res.strip().split('\n')), len_)
65576557

65586558
for len_, verbose in [(10, None), (5, False), (10, True)]:
65596559

@@ -6562,23 +6562,57 @@ def test_info_max_cols(self):
65626562
buf = StringIO()
65636563
df.info(buf=buf, verbose=verbose)
65646564
res = buf.getvalue()
6565-
self.assertEqual(len(res.split('\n')), len_)
6565+
self.assertEqual(len(res.strip().split('\n')), len_)
65666566

65676567
for len_, max_cols in [(10, 5), (5, 4)]:
65686568
# setting truncates
65696569
with option_context('max_info_columns', 4):
65706570
buf = StringIO()
65716571
df.info(buf=buf, max_cols=max_cols)
65726572
res = buf.getvalue()
6573-
self.assertEqual(len(res.split('\n')), len_)
6573+
self.assertEqual(len(res.strip().split('\n')), len_)
65746574

65756575
# setting wouldn't truncate
65766576
with option_context('max_info_columns', 5):
65776577
buf = StringIO()
65786578
df.info(buf=buf, max_cols=max_cols)
65796579
res = buf.getvalue()
6580-
self.assertEqual(len(res.split('\n')), len_)
6580+
self.assertEqual(len(res.strip().split('\n')), len_)
65816581

6582+
def test_info_memory_usage(self):
6583+
# Ensure memory usage is displayed, when asserted, on the last line
6584+
dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]',
6585+
'complex128', 'object', 'bool']
6586+
data = {}
6587+
n = 10
6588+
for i, dtype in enumerate(dtypes):
6589+
data[i] = np.random.randint(2, size=n).astype(dtype)
6590+
df = DataFrame(data)
6591+
buf = StringIO()
6592+
# display memory usage case
6593+
df.info(buf=buf, memory_usage=True)
6594+
res = buf.getvalue().splitlines()
6595+
self.assertTrue("memory usage: " in res[-1])
6596+
# do not display memory usage cas
6597+
df.info(buf=buf, memory_usage=False)
6598+
res = buf.getvalue().splitlines()
6599+
self.assertTrue("memory usage: " not in res[-1])
6600+
6601+
# Test a DataFrame with duplicate columns
6602+
dtypes = ['int64', 'int64', 'int64', 'float64']
6603+
data = {}
6604+
n = 100
6605+
for i, dtype in enumerate(dtypes):
6606+
data[i] = np.random.randint(2, size=n).astype(dtype)
6607+
df = DataFrame(data)
6608+
df.columns = dtypes
6609+
# Ensure df size is as expected
6610+
df_size = df.memory_usage().sum()
6611+
exp_size = len(dtypes) * n * 8 # cols * rows * bytes
6612+
self.assertEqual(df_size, exp_size)
6613+
# Ensure number of cols in memory_usage is the same as df
6614+
size_df = np.size(df.columns.values) # index=False; default
6615+
self.assertEqual(size_df, np.size(df.memory_usage()))
65826616

65836617
def test_dtypes(self):
65846618
self.mixed_frame['bool'] = self.mixed_frame['A'] > 0

0 commit comments

Comments
 (0)