Skip to content

ENH: Provide dict object for to_dict() #16122 #16220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 16, 2017
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ New features

Other Enhancements
^^^^^^^^^^^^^^^^^^
- ``Series.to_dict()`` and ``DataFrame.to_dict()`` now support an ``into`` keyword which allows you to specify the ``collections.Mapping`` subclass that you would like returned. The default is ``dict``, which is backwards compatible. (:issue:`16122`)



Expand Down
37 changes: 37 additions & 0 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
import warnings
from datetime import datetime, timedelta
from functools import partial
import inspect
import collections

import numpy as np
from pandas._libs import lib, tslib
Expand Down Expand Up @@ -479,6 +481,41 @@ def _dict_compat(d):
for key, value in iteritems(d))


def standardize_mapping(into):
"""
Helper function to standardize a supplied mapping.
.. versionadded:: 0.21.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded

Parameters
----------
into : instance or subclass of collections.Mapping
Must be a class, an initialized collections.defaultdict,
or an empty instance of a collections.Mapping subclass.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this last statement is true, it doesn't have to be empty (nor should we require it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "empty," which should make this correct.


Returns
-------
mapping : a collections.Mapping subclass or other constructor
a callable object that can accept an iterator to create
the desired Mapping.

See Also
--------
DataFrame.to_dict
Series.to_dict
"""
if not inspect.isclass(into):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if its not a class, then you just look at the class itself, it doesn't matter if its empty. this should return a class only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I wrote it this way is because I imagined someone might pass a pre-populated mapping to this function and expect the series or frame to be appended to what was already in the passed mapping. The error was to make it explicit that this would not happen.

Also happy to simplify this function a little - I'll remove this logic if you don't think its worth checking an empty mapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are defining the interface here; this is an internal function, so it should do one thing and ideally have a limited input set. This is just determining if its a valid mapping, no need to get complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would remove this check. This function should just be to ensure that we have an instance that's satisfies the Mapping interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all makes sense - I've reworked the function. It now only returns a class or a partial (if a defaultdict is passed)

if isinstance(into, collections.defaultdict):
return partial(
collections.defaultdict, into.default_factory)
into = type(into)
if not issubclass(into, collections.Mapping):
raise TypeError('unsupported type: {}'.format(into))
elif into == collections.defaultdict:
raise TypeError(
'to_dict() only accepts initialized defaultdicts')
return into


def sentinel_factory():
class Sentinel(object):
pass
Expand Down
88 changes: 73 additions & 15 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,8 @@
_default_index,
_values_from_object,
_maybe_box_datetimelike,
_dict_compat)
_dict_compat,
standardize_mapping)
from pandas.core.generic import NDFrame, _shared_docs
from pandas.core.index import Index, MultiIndex, _ensure_index
from pandas.core.indexing import (maybe_droplevels, convert_to_index_sliceable,
Expand Down Expand Up @@ -860,7 +861,7 @@ def from_dict(cls, data, orient='columns', dtype=None):

return cls(data, index=index, columns=columns, dtype=dtype)

def to_dict(self, orient='dict'):
def to_dict(self, orient='dict', into=dict):
"""Convert DataFrame to dictionary.

Parameters
Expand All @@ -882,32 +883,89 @@ def to_dict(self, orient='dict'):
Abbreviations are allowed. `s` indicates `series` and `sp`
indicates `split`.

into : class, default dict
The collections.Mapping subclass used for all Mappings
in the return value. Can be the actual class or an empty
instance of the mapping type you want. If you want a
collections.defaultdict, you must pass an initialized
instance.

.. versionadded:: 0.21.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line


Returns
-------
result : dict like {column -> {index -> value}}
result : collections.Mapping like {column -> {index -> value}}
If ``into`` is collections.defaultdict, the return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some examples of using this.

value's default_factory will be None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this one. It seems to contrast with what is said above about a defaultdict needing to be initialized. So in that case, the default_factory will just be what is provided in the initialized defaultdict? (and which can be None, if you didn't provide a default_factory, but that seems a bit besides the point to mention here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from an earlier version, and I forgot to update the docs. This sentence is wrong, and its gone now.


Examples
--------
>>> from pandas import DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this import, but use pd.DataFrame instead of DataFrame in the other places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

>>> from collections import OrderedDict, defaultdict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this import to the example where it is used?

>>> df = DataFrame(
{'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 0.1 and 0.2 are wrong, should be 0.5 and 0.75

>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}

You can specify the return orientation.

>>> df.to_dict('series')
{'col1': a 1
b 2
Name: col1, dtype: int64, 'col2': a 0.50
b 0.75
Name: col2, dtype: float64}
>>> df.to_dict('split')
{'columns': ['col1', 'col2'],
'data': [[1.0, 0.5], [2.0, 0.75]],
'index': ['a', 'b']}
>>> df.to_dict('records')
[{'col1': 1.0, 'col2': 0.5}, {'col1': 2.0, 'col2': 0.75}]
>>> df.to_dict('index')
{'a': {'col1': 1.0, 'col2': 0.5}, 'b': {'col1': 2.0, 'col2': 0.75}}

You can also specify the mapping type.

>>> df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('a', 1), ('b', 2)])),
('col2', OrderedDict([('a', 0.5), ('b', 0.75)]))])

If you want a `defaultdict`, you need to initialize it:

>>> dd = defaultdict(list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice examples!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a blank line and a preamble to this example

>>> df.to_dict('records', into=dd)
[defaultdict(<type 'list'>, {'col2': 0.5, 'col1': 1.0}),
defaultdict(<type 'list'>, {'col2': 0.75, 'col1': 2.0})]
"""
if not self.columns.is_unique:
warnings.warn("DataFrame columns are not unique, some "
"columns will be omitted.", UserWarning)
# GH16122
into_c = standardize_mapping(into)
if orient.lower().startswith('d'):
return dict((k, v.to_dict()) for k, v in compat.iteritems(self))
return into_c(
(k, v.to_dict(into)) for k, v in compat.iteritems(self))
elif orient.lower().startswith('l'):
return dict((k, v.tolist()) for k, v in compat.iteritems(self))
return into_c((k, v.tolist()) for k, v in compat.iteritems(self))
elif orient.lower().startswith('sp'):
return {'index': self.index.tolist(),
'columns': self.columns.tolist(),
'data': lib.map_infer(self.values.ravel(),
_maybe_box_datetimelike)
.reshape(self.values.shape).tolist()}
return into_c((('index', self.index.tolist()),
('columns', self.columns.tolist()),
('data', lib.map_infer(self.values.ravel(),
_maybe_box_datetimelike)
.reshape(self.values.shape).tolist())))
elif orient.lower().startswith('s'):
return dict((k, _maybe_box_datetimelike(v))
for k, v in compat.iteritems(self))
return into_c((k, _maybe_box_datetimelike(v))
for k, v in compat.iteritems(self))
elif orient.lower().startswith('r'):
return [dict((k, _maybe_box_datetimelike(v))
for k, v in zip(self.columns, row))
return [into_c((k, _maybe_box_datetimelike(v))
for k, v in zip(self.columns, row))
for row in self.values]
elif orient.lower().startswith('i'):
return dict((k, v.to_dict()) for k, v in self.iterrows())
return into_c((k, v.to_dict(into)) for k, v in self.iterrows())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to use into here? What if the user passed an instance rather than a class? Wouldn't the values all be written into the same object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct. v.to_dict(into) should call standardize_mapping again. Since standardize_mapping only returns a class, I don't think there is a danger of populating the same object twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.nI forgot that standardize_mapping always returned a class

else:
raise ValueError("orient '%s' not understood" % orient)

Expand Down
39 changes: 33 additions & 6 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@
_maybe_match_name,
SettingWithCopyError,
_maybe_box_datetimelike,
_dict_compat)
_dict_compat,
standardize_mapping)
from pandas.core.index import (Index, MultiIndex, InvalidIndexError,
Float64Index, _ensure_index)
from pandas.core.indexing import check_bool_indexer, maybe_convert_indices
Expand Down Expand Up @@ -1074,15 +1075,41 @@ def tolist(self):
""" Convert Series to a nested list """
return list(self.asobject)

def to_dict(self):
def to_dict(self, into=dict):
"""
Convert Series to {label -> value} dict
Convert Series to {label -> value} dict or dict-like object
Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line between the first line and Parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

----------
into : class, default dict
The collections.Mapping subclass to use as the return
object. Can be the actual class or an empty
instance of the mapping type you want. If you want a
collections.defaultdict, you must pass an initialized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you must pass it initialized. (maybe elsewhere as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here and in frame.py


.. versionadded:: 0.21.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


Returns
-------
value_dict : dict
"""
return dict(compat.iteritems(self))
value_dict : collections.Mapping
If ``into`` is collections.defaultdict, the return
value's default_factory will be None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed as well I think (you updated it for the DataFrame docstring)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Examples
--------
>>> from pandas import Series
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here as for DataFrame (leave out the import, and use pd.Series instead for consistency within the docstrings)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

>>> from collections import OrderedDict, defaultdict
>>> s = Series([1, 2, 3, 4])
>>> s.to_dict()
{0: 1, 1: 2, 2: 3, 3: 4}
>>> s.to_dict(OrderedDict)
OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)])
>>> dd = defaultdict(list)
>>> s.to_dict(dd)
defaultdict(<type 'list'>, {0: 1, 1: 2, 2: 3, 3: 4})
"""
# GH16122
into_c = standardize_mapping(into)
return into_c(compat.iteritems(self))

def to_frame(self, name=None):
"""
Expand Down
123 changes: 69 additions & 54 deletions pandas/tests/frame/test_convert_to.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# -*- coding: utf-8 -*-

import pytest
import collections
import numpy as np

from pandas import compat
Expand All @@ -13,50 +14,6 @@

class TestDataFrameConvertTo(TestData):

def test_to_dict(self):
test_data = {
'A': {'1': 1, '2': 2},
'B': {'1': '1', '2': '2', '3': '3'},
}
recons_data = DataFrame(test_data).to_dict()

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert v2 == recons_data[k][k2]

recons_data = DataFrame(test_data).to_dict("l")

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert v2 == recons_data[k][int(k2) - 1]

recons_data = DataFrame(test_data).to_dict("s")

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert v2 == recons_data[k][k2]

recons_data = DataFrame(test_data).to_dict("sp")
expected_split = {'columns': ['A', 'B'], 'index': ['1', '2', '3'],
'data': [[1.0, '1'], [2.0, '2'], [np.nan, '3']]}
tm.assert_dict_equal(recons_data, expected_split)

recons_data = DataFrame(test_data).to_dict("r")
expected_records = [{'A': 1.0, 'B': '1'},
{'A': 2.0, 'B': '2'},
{'A': np.nan, 'B': '3'}]
assert isinstance(recons_data, list)
assert len(recons_data) == 3
for l, r in zip(recons_data, expected_records):
tm.assert_dict_equal(l, r)

# GH10844
recons_data = DataFrame(test_data).to_dict("i")

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert v2 == recons_data[k2][k]

def test_to_dict_timestamp(self):

# GH11247
Expand Down Expand Up @@ -190,17 +147,75 @@ def test_to_records_with_unicode_column_names(self):
)
tm.assert_almost_equal(result, expected)

@pytest.mark.parametrize('mapping', [
dict,
collections.defaultdict(list),
collections.OrderedDict])
def test_to_dict(self, mapping):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be tested elsewhere, but can you add a test with a dataframe that has duplicate columns? Make sure to catch the warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I added one at the end. Let me know if that is what you were getting at.

test_data = {
'A': {'1': 1, '2': 2},
'B': {'1': '1', '2': '2', '3': '3'},
}

# GH16122
recons_data = DataFrame(test_data).to_dict(into=mapping)

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert (v2 == recons_data[k][k2])

recons_data = DataFrame(test_data).to_dict("l", mapping)

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert (v2 == recons_data[k][int(k2) - 1])

recons_data = DataFrame(test_data).to_dict("s", mapping)

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert (v2 == recons_data[k][k2])

recons_data = DataFrame(test_data).to_dict("sp", mapping)
expected_split = {'columns': ['A', 'B'], 'index': ['1', '2', '3'],
'data': [[1.0, '1'], [2.0, '2'], [np.nan, '3']]}
tm.assert_dict_equal(recons_data, expected_split)

recons_data = DataFrame(test_data).to_dict("r", mapping)
expected_records = [{'A': 1.0, 'B': '1'},
{'A': 2.0, 'B': '2'},
{'A': np.nan, 'B': '3'}]
assert isinstance(recons_data, list)
assert (len(recons_data) == 3)
for l, r in zip(recons_data, expected_records):
tm.assert_dict_equal(l, r)

# GH10844
recons_data = DataFrame(test_data).to_dict("i")

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert (v2 == recons_data[k2][k])

df = DataFrame(test_data)
df['duped'] = df[df.columns[0]]
recons_data = df.to_dict("i")
comp_data = test_data.copy()
comp_data['duped'] = comp_data[df.columns[0]]
for k, v in compat.iteritems(comp_data):
for k2, v2 in compat.iteritems(v):
assert (v2 == recons_data[k2][k])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a test that hits some of the errors people might encounter (you do check these in the testing of standardize_mapping), but this is an integration test. you can put a test right after this, maybe test_to_dict_errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this test - made sure the TypeErrors were caught. Let me know if you think anything else should go in there.


@pytest.mark.parametrize('tz', ['UTC', 'GMT', 'US/Eastern'])
def test_to_records_datetimeindex_with_tz(tz):
# GH13937
dr = date_range('2016-01-01', periods=10,
freq='S', tz=tz)
@pytest.mark.parametrize('tz', ['UTC', 'GMT', 'US/Eastern'])
def test_to_records_datetimeindex_with_tz(self, tz):
# GH13937
dr = date_range('2016-01-01', periods=10,
freq='S', tz=tz)

df = DataFrame({'datetime': dr}, index=dr)
df = DataFrame({'datetime': dr}, index=dr)

expected = df.to_records()
result = df.tz_convert("UTC").to_records()
expected = df.to_records()
result = df.tz_convert("UTC").to_records()

# both converted to UTC, so they are equal
tm.assert_numpy_array_equal(result, expected)
# both converted to UTC, so they are equal
tm.assert_numpy_array_equal(result, expected)
Loading