Skip to content

ENH: Provide dict object for to_dict() #16122 #16220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 16, 2017
1 change: 0 additions & 1 deletion doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -515,7 +515,6 @@ Other Enhancements
- Options added to allow one to turn on/off using ``bottleneck`` and ``numexpr``, see :ref:`here <basics.accelerate>` (:issue:`16157`)
- ``DataFrame.style.bar()`` now accepts two more options to further customize the bar chart. Bar alignment is set with ``align='left'|'mid'|'zero'``, the default is "left", which is backward compatible; You can now pass a list of ``color=[color_negative, color_positive]``. (:issue:`14757`)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset this file

.. _ISO 8601 duration: https://en.wikipedia.org/wiki/ISO_8601#Durations


Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ New features

Other Enhancements
^^^^^^^^^^^^^^^^^^
- ``Series.to_dict()`` and ``DataFrame.to_dict()`` now support an ``into`` keyword which allows you to specify the ``collections.Mapping`` subclass that you would like returned. The default is ``dict``, which is backwards compatible. (:issue:`16122`)



Expand Down
39 changes: 39 additions & 0 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
import warnings
from datetime import datetime, timedelta
from functools import partial
import inspect
import collections

import numpy as np
from pandas._libs import lib, tslib
Expand Down Expand Up @@ -479,6 +481,43 @@ def _dict_compat(d):
for key, value in iteritems(d))


def _standardize_mapping(into):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to be leading underscore, this is a private module. do you have a more informative name? (longer is ok).

"""
Helper function to standardize the supplied mapping so it can
be passed to the ``Series.to_dict()`` and ``DataFrame.to_dict()``

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded

Parameters
----------
into : instance or subclass of collections.Mapping
The argument supplied to ``to_dict``. Must be a class, an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove references to to_dict, this is generic about mappings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, though you can add a See Also section to point to to_dict. You can also remove references to to_dict in the summary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the references and added a See Also section.

initialized collections.defaultdict, or an empty instance
of a collections.Mapping subclass.

Returns
-------
mapping : a collections.Mapping subclass or other constructor
a callable object that can accept an iterator to create
the desired Mapping.

"""
if not inspect.isclass(into):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if its not a class, then you just look at the class itself, it doesn't matter if its empty. this should return a class only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I wrote it this way is because I imagined someone might pass a pre-populated mapping to this function and expect the series or frame to be appended to what was already in the passed mapping. The error was to make it explicit that this would not happen.

Also happy to simplify this function a little - I'll remove this logic if you don't think its worth checking an empty mapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are defining the interface here; this is an internal function, so it should do one thing and ideally have a limited input set. This is just determining if its a valid mapping, no need to get complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would remove this check. This function should just be to ensure that we have an instance that's satisfies the Mapping interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all makes sense - I've reworked the function. It now only returns a class or a partial (if a defaultdict is passed)

if len(into) > 0:
raise ValueError(
"to_dict() only accepts empty mappings.")
elif type(into) == collections.defaultdict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use isinstance

return partial(
collections.defaultdict, into.default_factory)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? what is this trying to do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the return partial(...) part, I'm trying to get around the way defaultdict accepts arguments. The other out-of-the-box mappings will happily accept only an iterator, which is what both of the to_dict() functions supply to the return value of this function. defaultdict also needs a default_factory, so I use partial to include the default_factory so the to_dict() functions can supply only an iterator.

For the last else, I'm just sending the class of the supplied mapping (once its been confirmed empty) back to the function to avoid duplicated logic.

If you think a different structure works better, or it should be broken up into multiple functions, let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok the defaultdict handling is fine (use isinstance(type, collections.defaultdict) though)

Copy link
Contributor

@jreback jreback May 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see. the problem is you need to accept instances to deal with defaultdict, so you can now be passed an instance of defaultdict OR a class of something else (or even a class of defaultdict which is an error).

hmm. I would right off the bat detect an instance of a default dict, else only allow a mapping ABC (ex-defaultdict). Then I think you are good, no? (IOW else you would raise).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the else (just return the partial)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return _standardize_mapping(type(into))
elif not issubclass(into, collections.Mapping):
raise TypeError('unsupported type: {}'.format(into))
elif into == collections.defaultdict:
raise TypeError(
'to_dict() only accepts initialized defaultdicts')
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just return into here, no need for the else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return into


def sentinel_factory():
class Sentinel(object):
pass
Expand Down
44 changes: 29 additions & 15 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,8 @@
_default_index,
_values_from_object,
_maybe_box_datetimelike,
_dict_compat)
_dict_compat,
_standardize_mapping)
from pandas.core.generic import NDFrame, _shared_docs
from pandas.core.index import Index, MultiIndex, _ensure_index
from pandas.core.indexing import (maybe_droplevels, convert_to_index_sliceable,
Expand Down Expand Up @@ -860,7 +861,7 @@ def from_dict(cls, data, orient='columns', dtype=None):

return cls(data, index=index, columns=columns, dtype=dtype)

def to_dict(self, orient='dict'):
def to_dict(self, orient='dict', into=dict):
"""Convert DataFrame to dictionary.

Parameters
Expand All @@ -882,32 +883,45 @@ def to_dict(self, orient='dict'):
Abbreviations are allowed. `s` indicates `series` and `sp`
indicates `split`.

into : class, default dict
The collections.Mapping subclass used for all Mappings
in the return value. Can be the actual class or an empty
instance of the mapping type you want. If you want a
collections.defaultdict, you must pass an initialized
instance.
.. versionadded:: 0.21.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line


Returns
-------
result : dict like {column -> {index -> value}}
result : collections.Mapping like {column -> {index -> value}}
If ``into`` is collections.defaultdict, the return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some examples of using this.

value's default_factory will be None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this one. It seems to contrast with what is said above about a defaultdict needing to be initialized. So in that case, the default_factory will just be what is provided in the initialized defaultdict? (and which can be None, if you didn't provide a default_factory, but that seems a bit besides the point to mention here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from an earlier version, and I forgot to update the docs. This sentence is wrong, and its gone now.

"""
if not self.columns.is_unique:
warnings.warn("DataFrame columns are not unique, some "
"columns will be omitted.", UserWarning)
# GH16122
into_c = _standardize_mapping(into)
if orient.lower().startswith('d'):
return dict((k, v.to_dict()) for k, v in compat.iteritems(self))
return into_c(
(k, v.to_dict(into)) for k, v in compat.iteritems(self))
elif orient.lower().startswith('l'):
return dict((k, v.tolist()) for k, v in compat.iteritems(self))
return into_c((k, v.tolist()) for k, v in compat.iteritems(self))
elif orient.lower().startswith('sp'):
return {'index': self.index.tolist(),
'columns': self.columns.tolist(),
'data': lib.map_infer(self.values.ravel(),
_maybe_box_datetimelike)
.reshape(self.values.shape).tolist()}
return into_c((('index', self.index.tolist()),
('columns', self.columns.tolist()),
('data', lib.map_infer(self.values.ravel(),
_maybe_box_datetimelike)
.reshape(self.values.shape).tolist())))
elif orient.lower().startswith('s'):
return dict((k, _maybe_box_datetimelike(v))
for k, v in compat.iteritems(self))
return into_c((k, _maybe_box_datetimelike(v))
for k, v in compat.iteritems(self))
elif orient.lower().startswith('r'):
return [dict((k, _maybe_box_datetimelike(v))
for k, v in zip(self.columns, row))
return [into_c((k, _maybe_box_datetimelike(v))
for k, v in zip(self.columns, row))
for row in self.values]
elif orient.lower().startswith('i'):
return dict((k, v.to_dict()) for k, v in self.iterrows())
return into_c((k, v.to_dict(into)) for k, v in self.iterrows())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to use into here? What if the user passed an instance rather than a class? Wouldn't the values all be written into the same object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct. v.to_dict(into) should call standardize_mapping again. Since standardize_mapping only returns a class, I don't think there is a danger of populating the same object twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.nI forgot that standardize_mapping always returned a class

else:
raise ValueError("orient '%s' not understood" % orient)

Expand Down
23 changes: 18 additions & 5 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@
_maybe_match_name,
SettingWithCopyError,
_maybe_box_datetimelike,
_dict_compat)
_dict_compat,
_standardize_mapping)
from pandas.core.index import (Index, MultiIndex, InvalidIndexError,
Float64Index, _ensure_index)
from pandas.core.indexing import check_bool_indexer, maybe_convert_indices
Expand Down Expand Up @@ -1074,15 +1075,27 @@ def tolist(self):
""" Convert Series to a nested list """
return list(self.asobject)

def to_dict(self):
def to_dict(self, into=dict):
"""
Convert Series to {label -> value} dict
Convert Series to {label -> value} dict or dict-like object
Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line between the first line and Parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

----------
into : class, default dict
The collections.Mapping subclass to use as the return
object. Can be the actual class or an empty
instance of the mapping type you want. If you want a
collections.defaultdict, you must pass an initialized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you must pass it initialized. (maybe elsewhere as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here and in frame.py

.. versionadded:: 0.21.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


Returns
-------
value_dict : dict
value_dict : collections.Mapping
If ``into`` is collections.defaultdict, the return
value's default_factory will be None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed as well I think (you updated it for the DataFrame docstring)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""
return dict(compat.iteritems(self))
# GH16122
into_c = _standardize_mapping(into)
return into_c(compat.iteritems(self))

def to_frame(self, name=None):
"""
Expand Down
113 changes: 59 additions & 54 deletions pandas/tests/frame/test_convert_to.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# -*- coding: utf-8 -*-

import pytest
import collections
import numpy as np

from pandas import compat
Expand All @@ -13,50 +14,6 @@

class TestDataFrameConvertTo(TestData):

def test_to_dict(self):
test_data = {
'A': {'1': 1, '2': 2},
'B': {'1': '1', '2': '2', '3': '3'},
}
recons_data = DataFrame(test_data).to_dict()

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert v2 == recons_data[k][k2]

recons_data = DataFrame(test_data).to_dict("l")

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert v2 == recons_data[k][int(k2) - 1]

recons_data = DataFrame(test_data).to_dict("s")

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert v2 == recons_data[k][k2]

recons_data = DataFrame(test_data).to_dict("sp")
expected_split = {'columns': ['A', 'B'], 'index': ['1', '2', '3'],
'data': [[1.0, '1'], [2.0, '2'], [np.nan, '3']]}
tm.assert_dict_equal(recons_data, expected_split)

recons_data = DataFrame(test_data).to_dict("r")
expected_records = [{'A': 1.0, 'B': '1'},
{'A': 2.0, 'B': '2'},
{'A': np.nan, 'B': '3'}]
assert isinstance(recons_data, list)
assert len(recons_data) == 3
for l, r in zip(recons_data, expected_records):
tm.assert_dict_equal(l, r)

# GH10844
recons_data = DataFrame(test_data).to_dict("i")

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert v2 == recons_data[k2][k]

def test_to_dict_timestamp(self):

# GH11247
Expand Down Expand Up @@ -190,17 +147,65 @@ def test_to_records_with_unicode_column_names(self):
)
tm.assert_almost_equal(result, expected)

@pytest.mark.parametrize('mapping', [
dict,
collections.defaultdict(list),
collections.OrderedDict])
def test_to_dict(self, mapping):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be tested elsewhere, but can you add a test with a dataframe that has duplicate columns? Make sure to catch the warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I added one at the end. Let me know if that is what you were getting at.

test_data = {
'A': {'1': 1, '2': 2},
'B': {'1': '1', '2': '2', '3': '3'},
}
# GH16122
recons_data = DataFrame(test_data).to_dict(into=mapping)

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert (v2 == recons_data[k][k2])

recons_data = DataFrame(test_data).to_dict("l", mapping)

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert (v2 == recons_data[k][int(k2) - 1])

recons_data = DataFrame(test_data).to_dict("s", mapping)

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert (v2 == recons_data[k][k2])

recons_data = DataFrame(test_data).to_dict("sp", mapping)
expected_split = {'columns': ['A', 'B'], 'index': ['1', '2', '3'],
'data': [[1.0, '1'], [2.0, '2'], [np.nan, '3']]}
tm.assert_dict_equal(recons_data, expected_split)

recons_data = DataFrame(test_data).to_dict("r", mapping)
expected_records = [{'A': 1.0, 'B': '1'},
{'A': 2.0, 'B': '2'},
{'A': np.nan, 'B': '3'}]
assert isinstance(recons_data, list)
assert (len(recons_data) == 3)
for l, r in zip(recons_data, expected_records):
tm.assert_dict_equal(l, r)

# GH10844
recons_data = DataFrame(test_data).to_dict("i")

for k, v in compat.iteritems(test_data):
for k2, v2 in compat.iteritems(v):
assert (v2 == recons_data[k2][k])

@pytest.mark.parametrize('tz', ['UTC', 'GMT', 'US/Eastern'])
def test_to_records_datetimeindex_with_tz(tz):
# GH13937
dr = date_range('2016-01-01', periods=10,
freq='S', tz=tz)
@pytest.mark.parametrize('tz', ['UTC', 'GMT', 'US/Eastern'])
def test_to_records_datetimeindex_with_tz(self, tz):
# GH13937
dr = date_range('2016-01-01', periods=10,
freq='S', tz=tz)

df = DataFrame({'datetime': dr}, index=dr)
df = DataFrame({'datetime': dr}, index=dr)

expected = df.to_records()
result = df.tz_convert("UTC").to_records()
expected = df.to_records()
result = df.tz_convert("UTC").to_records()

# both converted to UTC, so they are equal
tm.assert_numpy_array_equal(result, expected)
# both converted to UTC, so they are equal
tm.assert_numpy_array_equal(result, expected)
18 changes: 15 additions & 3 deletions pandas/tests/series/test_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
# pylint: disable-msg=E1101,W0612

from datetime import datetime
import collections
import pytest

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -126,9 +128,6 @@ def test_to_frame(self):
dict(testdifferent=self.ts.values), index=self.ts.index)
assert_frame_equal(rs, xp)

def test_to_dict(self):
tm.assert_series_equal(Series(self.ts.to_dict(), name='ts'), self.ts)

def test_timeseries_periodindex(self):
# GH2891
from pandas import period_range
Expand Down Expand Up @@ -167,6 +166,19 @@ class SubclassedFrame(DataFrame):
expected = SubclassedFrame({'X': [1, 2, 3]})
assert_frame_equal(result, expected)

@pytest.mark.parametrize('mapping', (
dict,
collections.defaultdict(list),
collections.OrderedDict))
def test_to_dict(self, mapping):
# GH16122
ts = TestData().ts
tm.assert_series_equal(
Series(ts.to_dict(mapping), name='ts'), ts)
from_method = Series(ts.to_dict(collections.Counter))
from_constructor = Series(collections.Counter(ts.iteritems()))
tm.assert_series_equal(from_method, from_constructor)


class TestSeriesToList(TestData):

Expand Down
Loading