Skip to content

DataFrame.to_dict(orient='records') datetime conversion inconsistency #11247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
depristo opened this issue Oct 5, 2015 · 10 comments
Closed

DataFrame.to_dict(orient='records') datetime conversion inconsistency #11247

depristo opened this issue Oct 5, 2015 · 10 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@depristo
Copy link

depristo commented Oct 5, 2015

My original code used:

data = df.to_dict(orient='records')

but I had to replace with a different to_dict conversion because to_dict(orient='records') doesn't properly convert numpy.datetime64 to Timestamps but rather returns numpy.datetime64 values which are hard to work with downstream. If you include any other non-datetime fields in the dataframe the conversion to a Timestamp happens with 'records' mode, so I think this is some kind of edge-case where converters aren't being triggered when everything is a datetime.

import pandas as pd
df = pd.DataFrame({
        'c1': pd.to_datetime(['1-1-2015', '1-2-2015', '1-3-2015']),
        'c2': pd.to_datetime(['2-2-2015', '2-3-2015', '2-4-2015']),
    })
print('raw dataframe')
print(df)
print('\nconverted to records')
print(df.to_dict(orient='records'))
print('\nconverted to dict')
print(df.to_dict(orient='dict'))
print('\nconverted to list')
print(df.to_dict(orient='list'))

produces

raw dataframe
          c1         c2
0 2015-01-01 2015-02-02
1 2015-01-02 2015-02-03
2 2015-01-03 2015-02-04

converted to records
[{'c2': numpy.datetime64('2015-02-01T16:00:00.000000000-0800'), 'c1': numpy.datetime64('2014-12-31T16:00:00.000000000-0800')}, {'c2': numpy.datetime64('2015-02-02T16:00:00.000000000-0800'), 'c1': numpy.datetime64('2015-01-01T16:00:00.000000000-0800')}, {'c2': numpy.datetime64('2015-02-03T16:00:00.000000000-0800'), 'c1': numpy.datetime64('2015-01-02T16:00:00.000000000-0800')}]

converted to dict
{'c2': {0: Timestamp('2015-02-02 00:00:00'), 1: Timestamp('2015-02-03 00:00:00'), 2: Timestamp('2015-02-04 00:00:00')}, 'c1': {0: Timestamp('2015-01-01 00:00:00'), 1: Timestamp('2015-01-02 00:00:00'), 2: Timestamp('2015-01-03 00:00:00')}}

converted to list
{'c2': [Timestamp('2015-02-02 00:00:00'), Timestamp('2015-02-03 00:00:00'), Timestamp('2015-02-04 00:00:00')], 'c1': [Timestamp('2015-01-01 00:00:00'), Timestamp('2015-01-02 00:00:00'), Timestamp('2015-01-03 00:00:00')]}
@jreback
Copy link
Contributor

jreback commented Oct 5, 2015

I would guess this is some older code, yep these should be Timestamps

@jreback jreback added Bug Prio-low Output-Formatting __repr__ of pandas objects, to_string Dtype Conversions Unexpected or buggy dtype conversions labels Oct 5, 2015
@jreback
Copy link
Contributor

jreback commented Oct 5, 2015

pull-requests are welcome.

@jreback jreback added this to the Next Major Release milestone Oct 5, 2015
@depristo
Copy link
Author

depristo commented Oct 5, 2015

Normally I'd love to fix but I looked at the code and it's not at all
obvious to me what the underlying problem is. There's no explicit
conversion code happening in to_dict, so I think the conversion is being
triggered by some low-level system.

On Mon, Oct 5, 2015 at 3:26 PM, Jeff Reback [email protected]
wrote:

pull-requests are welcome.


Reply to this email directly or view it on GitHub
#11247 (comment).

Mark A. DePristo
[email protected]

@jreback
Copy link
Contributor

jreback commented Oct 5, 2015

set an example up (yours above), then step thru code until you see what's happening

@udiboy1209
Copy link

Same thing happens for orient='series' and orient='split'!

>>> print(df.to_dict(orient='series'))
{'c2': 0   2015-02-02
1   2015-02-03
2   2015-02-04
Name: c2, dtype: datetime64[ns], 'c1': 0   2015-01-01
1   2015-01-02
2   2015-01-03
Name: c1, dtype: datetime64[ns]}

>>> print(df.to_dict(orient='split'))
{'index': [0, 1, 2], 'data': [[1420070400000000000L, 1422835200000000000L], [1420156800000000000L, 1422921600000000000L], [1420243200000000000L, 1423008000000000000L]], 'columns': ['c1', 'c2']}

I agree with @depristo , this is a pretty non-trivial bug. The code for to_dict() gives no clues.

@ethanluoyc
Copy link
Contributor

I did some preliminary investigation. It appears that to_list() invoked using parameter "list" will invoke __iter__ in series.py which checks for the type of the element in it and if it is np.datetime64 it will return the Timestamp object using lib.Timestamp(x). Apparently the checking of the dtype seems to be lacking when using "record", may this be the problem?

@ethanluoyc
Copy link
Contributor

I think I find the problem. When working with dict and list, the value is a Series object and some conversion is implemented. When working with record, the returned is directly from the NDArray so there is no correct conversion of the datetime objects.
From frame.py:

    def to_dict(self, orient='dict'):
        """Convert DataFrame to dictionary.

        Parameters
        ----------
        orient : str {'dict', 'list', 'series', 'split', 'records'}
            Determines the type of the values of the dictionary.

            - dict (default) : dict like {column -> {index -> value}}
            - list : dict like {column -> [values]}
            - series : dict like {column -> Series(values)}
            - split : dict like
              {index -> [index], columns -> [columns], data -> [values]}
            - records : list like
              [{column -> value}, ... , {column -> value}]

            Abbreviations are allowed. `s` indicates `series` and `sp`
            indicates `split`.

        Returns
        -------
        result : dict like {column -> {index -> value}}
        """
        if not self.columns.is_unique:
            warnings.warn("DataFrame columns are not unique, some "
                          "columns will be omitted.", UserWarning)
        if orient.lower().startswith('d'):
            return dict((k, v.to_dict()) for k, v in compat.iteritems(self))
        elif orient.lower().startswith('l'):
            return dict((k, v.tolist()) for k, v in compat.iteritems(self))
        elif orient.lower().startswith('sp'):
            return {'index': self.index.tolist(),
                    'columns': self.columns.tolist(),
                    'data': self.values.tolist()}
        elif orient.lower().startswith('s'):
            return dict((k, v) for k, v in compat.iteritems(self))
        elif orient.lower().startswith('r'):
            return [dict((k, v) for k, v in zip(self.columns, row))
                    for row in self.values]
        else:
            raise ValueError("orient '%s' not understood" % orient)

@jreback is this so?

@jreback
Copy link
Contributor

jreback commented Oct 11, 2015

so using .values on a DataFrame causes a numpy array container to be created that will hold all of the elements. If the frame is mixed type (e.g. say object and datetime64[ns]) then it will be object dtype and all will work, e.g.

In [5]: df = DataFrame({'A' : 'foo', 'B' : pd.date_range('20130101',periods=3), 'C' : 1})

In [6]: df.to_dict(orient='records')
Out[6]: 
[{'A': 'foo', 'B': Timestamp('2013-01-01 00:00:00'), 'C': 1},
 {'A': 'foo', 'B': Timestamp('2013-01-02 00:00:00'), 'C': 1},
 {'A': 'foo', 'B': Timestamp('2013-01-03 00:00:00'), 'C': 1}]

but in the case above .values creates a datetime64[ns] container so the values are represented as np.datetime64, so this is all correct.

What you would need to do is box things, you can use: https://github.com/pydata/pandas/blob/master/pandas/core/common.py#L1848 (_maybe_box_datetimelike)

@jreback
Copy link
Contributor

jreback commented Oct 14, 2015

should fix as I indicated above

On Oct 14, 2015, at 9:57 AM, Ethan Luo [email protected] wrote:

@jreback so do we consider this as an intended effect or we need to fix this?


Reply to this email directly or view it on GitHub.

@jreback jreback modified the milestones: 0.17.1, Next Major Release Oct 15, 2015
ethanluoyc added a commit to ethanluoyc/pandas that referenced this issue Oct 16, 2015
Fix a bug where to_dict() does not return Timestamp when there is only
datetime dtype present.
jreback pushed a commit that referenced this issue Oct 16, 2015
Fix a bug where to_dict() does not return Timestamp when there is only
datetime dtype present.
@jreback
Copy link
Contributor

jreback commented Oct 16, 2015

closed by #11327

@jreback jreback closed this as completed Oct 16, 2015
Dr-Irv added a commit to Dr-Irv/pandas that referenced this issue Oct 24, 2015
This includes updates to 3 Excel files, plus a test in test_excel.py,
plus the fix in parsers.py

issue when read_html with previous fix

With read_html, the fix didn't work on Python 2.7.  Handle the string
conversion correctly

Add bug fixed to what's new

Revert "Add bug fixed to what's new"

This reverts commit 05b2344.

Revert "issue when read_html with previous fix"

This reverts commit d1bc296.

Add what's new to describe bug.  fix issue with original fix

Added text to describe the bug.
Fixed issue so that it works correctly in Python 2.7

Add round trip test

Added round trip test and fixed error in writing sheets when
merge_cells=false and columns have multi index

DEPR: deprecate pandas.io.ga, pandas-dev#11308

DEPR: deprecate engine keyword from to_csv pandas-dev#11274

remove warnings from the tests for deprecation of engine in to_csv

PERF: Checking monotonic-ness before sorting on an index pandas-dev#11080

BUG: Bug in list-like indexing with a mixed-integer Index, pandas-dev#11320

Add hex color strings test

CLN: GH11271 move _get_handle, UTF encoders to io.common

TST: tests for list skiprows in read_excel

BUG: Fix to_dict() problem when using only datetime pandas-dev#11247

Fix a bug where to_dict() does not return Timestamp when there is only
datetime dtype present.

Undo change for when columns are multiindex

There is still something wrong here in the format of the file when there
are multiindex columns, but that's for another day

Fix formatting in test_excel and remove spurious test

See title

BUG: bug in comparisons vs tuples, pandas-dev#11339

bug#10442 : fix, adding note and test

BUG pandas-dev#10442(test) : Convert datetimelike index to strings with astype(str)

BUG#10422: note added

bug#10442 : tests added

bug#10442 : note udated

BUG pandas-dev#10442(test) : Convert datetimelike index to strings with astype(str)

bug#10442: fix, adding note and test

bug#10442: fix, adding note and test

Adjust test so that merge_cells=False works correctly

Adjust the test so that if merge_cells=false, it does a proper
formatting of the columns in the single row header, and puts the row
header in the first row

Fix test for Python 2.7 and 3.5

The test is failing on Python 2.7 and 3.5, which appears to read in the
values as floats, and I cannot replicate.  So force the tests to pass by
just making the column names equal when merge_cells=False

Fix for openpyxl < 2, and for issue pandas-dev#11408

If using openpyxl < 2, and value is a string that could be a number,
force a string to be written out.  If using openpyxl >= 2.2, then fix
issue pandas-dev#11408 to do with merging cells

Use set_value_explicit instead of set_explicit_value

set_value_explicit is in openpyxl 1.6, changed in openpyxl 1.8, but
there is code in 1.8 to set set_value_explicit to set_explicit_value for
compatibility

Add line in whatsnew for issue 11408

ENH: added capability to handle Path/LocalPath objects, pandas-dev#11033

DOC: typo in whatsnew/0.17.1.txt

PERF: Release GIL on some datetime ops

BUG: Bug in DataFrame.replace with a datetime64[ns, tz] and a non-compat to_replace pandas-dev#11326

CLN: clean up internal impl of fillna/replace, xref pandas-dev#11153

PERF: fast inf checking in to_excel

PERF: Series.dropna with non-nan dtypes

fixed pathlib tests on windows

DEPR: remove some SparsePanel deprecation warnings in testing

DEPR: avoid numpy comparison to None warnings

API: indexing with a null key will raise a TypeError rather than a ValueError, pandas-dev#11356

WARN: elementwise comparisons with index names, xref pandas-dev#11162

DEPR warning in io/data.py w.r.t. order->sort_values

WARN: more elementwise comparisons to object

WARN: more uncomparables of numeric array vs object

BUG: quick fix for pandas-dev#10989

TST: add test case from Issue pandas-dev#10989

API: add _to_safe_for_reshape to allow safe insert/append with embedded CategoricalIndexes

Signed-off-by: Jeff Reback <[email protected]>

BLD: conda

Revert "BLD: conda"

This reverts commit 0c8a8e1.

TST: remove invalid symbol warnings

TST: move some tests to slow

TST: fix some warnings filters

TST: import pandas_datareader, use for tests

TST: remove some deprecation warnings from imports

DEPR: fix VisibleDeprecationWarnings in sparse

TST: remove some warnings in test_nanops

ENH: Improve the error message in to_gbq when the DataFrame schema does not match pandas-dev#11359

add libgfortran to 1.8.1 build

binstar -> anaconda

remove link to issue 11328 in whatsnew

Fixes to document issue in code, small efficiency fix

Try to resolve rebase conflict in whats new
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants