Skip to content

Assign back converted multiple columns to datetime failed #20511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jesrael opened this issue Mar 28, 2018 · 8 comments · Fixed by #46982
Closed

Assign back converted multiple columns to datetime failed #20511

jesrael opened this issue Mar 28, 2018 · 8 comments · Fixed by #46982
Assignees
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@jesrael
Copy link

jesrael commented Mar 28, 2018

I want convert multiple columns to datetimes:

df = pd.DataFrame(np.random.choice(['2015-01-01','2016-01-01'], size=(5,6))).add_prefix('date')
print (df)
        date0       date1       date2       date3       date4       date5
0  2015-01-01  2016-01-01  2015-01-01  2015-01-01  2016-01-01  2016-01-01
1  2016-01-01  2016-01-01  2016-01-01  2015-01-01  2016-01-01  2015-01-01
2  2015-01-01  2015-01-01  2016-01-01  2016-01-01  2016-01-01  2015-01-01
3  2016-01-01  2015-01-01  2015-01-01  2015-01-01  2015-01-01  2015-01-01
4  2015-01-01  2016-01-01  2015-01-01  2016-01-01  2016-01-01  2015-01-01

i= range(0,6,2)
df.iloc[:,i] = df.iloc[:,i].apply(lambda x: pd.to_datetime(x, errors='coerce'))
print (df.dtypes)
date0    object
date1    object
date2    object
date3    object
date4    object
date5    object
dtype: object

print (df)
                 date0       date1                date2       date3  \
0  1420070400000000000  2016-01-01  1420070400000000000  2015-01-01   
1  1451606400000000000  2016-01-01  1451606400000000000  2015-01-01   
2  1420070400000000000  2015-01-01  1451606400000000000  2016-01-01   
3  1451606400000000000  2015-01-01  1420070400000000000  2015-01-01   
4  1420070400000000000  2016-01-01  1420070400000000000  2016-01-01   

                 date4       date5  
0  1451606400000000000  2016-01-01  
1  1451606400000000000  2015-01-01  
2  1451606400000000000  2015-01-01  
3  1420070400000000000  2015-01-01  
4  1451606400000000000  2015-01-01  

I test converting:

print (df.iloc[:,i].apply(lambda x: pd.to_datetime(x, errors='coerce')).dtypes)
date0    datetime64[ns]
date2    datetime64[ns]
date4    datetime64[ns]
dtype: object

But assign back convert datetimes to unix dates. Also I test loc and same problem.

print (pd.show_versions())

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: sk_SK
LOCALE: None.None

pandas: 0.22.0
pytest: 2.8.5
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.23.4
numpy: 1.12.1
scipy: 0.19.0
pyarrow: 0.7.0
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: 0.4.0
matplotlib: 2.0.0
openpyxl: 2.4.8
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.4.0
None
@TomAugspurger
Copy link
Contributor

It's not entirely clear what you're trying to do. For

I want convert multiple columns to datetimes:

you can use apply and to_datetime

In [15]: df2 = df.apply(pd.to_datetime, errors='coerce')

In [16]: df2.dtypes
Out[16]:
date0    datetime64[ns]
date1    datetime64[ns]
date2    datetime64[ns]
date3    datetime64[ns]
date4    datetime64[ns]
date5    datetime64[ns]
dtype: object

@TomAugspurger TomAugspurger added this to the No action milestone Mar 28, 2018
@jesrael
Copy link
Author

jesrael commented Mar 29, 2018

@TomAugspurger - soory, maybe miss:

But assign back convert datetimes to unix dates. Also I test loc and same problem.

It converting nice if assign to new DataFrame, but failed if assign to subset, then datetimes columns are converting to unix datetime? Or something else?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 29, 2018 via email

@jesrael
Copy link
Author

jesrael commented Mar 29, 2018

@TomAugspurger - Hmmm, I am user https://stackoverflow.com/users/2901002/jezrael

I try explain more:

df = pd.DataFrame(np.random.choice(['2015-01-01','2016-01-01'], size=(5,6))).add_prefix('date')
print (df)
        date0       date1       date2       date3       date4       date5
0  2015-01-01  2016-01-01  2015-01-01  2015-01-01  2016-01-01  2016-01-01
1  2016-01-01  2016-01-01  2016-01-01  2015-01-01  2016-01-01  2015-01-01
2  2015-01-01  2015-01-01  2016-01-01  2016-01-01  2016-01-01  2015-01-01
3  2016-01-01  2015-01-01  2015-01-01  2015-01-01  2015-01-01  2015-01-01
4  2015-01-01  2016-01-01  2015-01-01  2016-01-01  2016-01-01  2015-01-01

i= range(0,6,2)
df.iloc[:,i] = df.iloc[:,i].apply(lambda x: pd.to_datetime(x, errors='coerce'))

I get:

print (df.dtypes)
date0    object
date1    object
date2    object
date3    object
date4    object
date5    object
dtype: object

and I expected:

print (df.dtypes)
date0    datetime64[ns]
date1    object
date2    datetime64[ns]
date3    object
date4    datetime64[ns]
date5    object
dtype: object

@shippy
Copy link

shippy commented Nov 29, 2018

@TomAugspurger This really looks like a bug. Let's see if I can explain it more clearly.

The bug presents in two ways:

  1. .apply(pd.to_datetime) called on a multi-column slice converts the columns to datetime64 after the call, but not during the assignment to the same multi-column slice.
  2. Further, assignment of the result of multi-column .apply(pd.to_datetime) transforms the datetime string to a nanosecond timestamp.

This violates expectations in two ways:

  1. The column doesn't look as expected, and there's no hint as to how to get the expected format back.
  2. The column doesn't behave as expected, e.g. cannot be subtracted from other datetime columns,

To demonstrate, let's set up a six-column DataFrame. The leftmost column will stay as-is, for side-by-side comparison; the rest will have pd.to_datetime called upon them in various forms. All five rightmost columns should all be datetime64 and equal to one another by the end of the exercise. Only the three rightmost will.

We'll recreate the bug in apply_multislice1 and apply_multislice2.

test_df = (pd.DataFrame(np.random.choice(['2016-01-01'], size=(2, 6)))
        .add_prefix('date')
        .rename(columns={
            'date0': 'original',
            'date1': 'apply_multislice1',
            'date2': 'apply_multislice2',
            'date3': 'apply_unislice',
            'date4': 'apply_noslice',
            'date5': 'assigned'}))
     original apply_multislice1 apply_multislice2 apply_unislice apply_noslice    assigned
0  2016-01-01        2016-01-01        2016-01-01     2016-01-01    2016-01-01  2016-01-01
1  2016-01-01        2016-01-01        2016-01-01     2016-01-01    2016-01-01  2016-01-01

Bug demonstration

I'll quickly demonstrate the case that succeeds without assignment:

assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].apply(pd.to_datetime) == pd.to_datetime('2016-01-01')).all(axis=None)

If it's assigned back to itself, however, it changes value to a nanosecond timestamp that still has dtype object:

# Assignment
test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] = test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].apply(pd.to_datetime)

# Succeeds, but shouldn't:
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == 1451606400000000000).all(axis=None)
assert test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].dtypes.isin([object]).all()

# Fails, but shouldn't:
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == pd.to_datetime('2016-01-01')).all(axis=None)
assert test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].dtypes.isin(['datetime64']).all()

Variations without bug

I could think of three other ways to do the datetime conversion - directly, and through .apply with different slicing:

test_df.loc[:, 'apply_unislice'] = test_df.loc[:, ['apply_unislice']].apply(pd.to_datetime)
test_df.loc[:, 'apply_noslice'] = test_df.loc[:, 'apply_noslice'].apply(pd.to_datetime)
test_df.loc[:, 'assigned'] = pd.to_datetime(test_df.loc[:, 'assigned'])

# Succeeds, as expected:
assert (test_df.loc[:, ['apply_unislice', 'apply_noslice', 'assigned']] == pd.to_datetime('2016-01-01')).all(axis=None)
assert not test_df.loc[:, ['apply_unislice', 'apply_noslice', 'assigned']].dtypes.isin([object]).any(axis=None)

Final outcome

All five rightmost columns should have the same content and dtype, but by now we know they don't.

     original    apply_multislice1    apply_multislice2 apply_unislice apply_noslice   assigned
0  2016-01-01  1451606400000000000  1451606400000000000     2016-01-01    2016-01-01 2016-01-01
1  2016-01-01  1451606400000000000  1451606400000000000     2016-01-01    2016-01-01 2016-01-01
original                     object
apply_multislice1            object
apply_multislice2            object
apply_unislice       datetime64[ns]
apply_noslice        datetime64[ns]
assigned             datetime64[ns]

Recovery from the bug

If the bugged outcome in apply_multislice* is converted as nanoseconds and assigned, pd.to_datetime will change dtype correctly, so at least the value wasn't lost.

# Continuing where we left off: Succeeds, but shouldn't:
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == 1451606400000000000).all(axis=None)
# Call pd.to_datetime(unit='ns')
test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] = test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].apply(pd.to_datetime, unit='ns')
# Operates as expected
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == pd.to_datetime('2016-01-01')).all(axis=None)

Environment

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-137-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.4
pytest: 4.0.0
pip: 18.1
setuptools: 40.6.2
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

@jesrael sorry I missed the fact that assignment was the buggy part.

In [48]: df = pd.DataFrame({"A": ['2015-01-01', '2015-01-02'], 'B': ['2015', '2016']})

In [49]: df2 = df.copy()

In [50]: df2.iloc[:, [0]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016'])})

In [51]: df2
Out[51]:
                     A     B
0  1420070400000000000  2015
1  1451606400000000000  2016

A few observations:

This doesn't occur when all of the columns are being updated:

In [64]: df2.iloc[:, [0, 1]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016']), 'B': pd.to_datetime(['2015', '2016'])})

In [65]: df2
Out[65]:
           A          B
0 2015-01-01 2015-01-01
1 2016-01-01 2016-01-01

The internal blocks are unsurprisingly incorrect

In [69]: df2.iloc[:, [0]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016'])})

In [70]: df2._data
Out[70]:
BlockManager
Items: Index(['A', 'B'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
ObjectBlock: slice(0, 2, 1), 2 x 2, dtype: object

we'd like to split that object block so that the newly assigned column becomes a DatetimeBlock.

@TomAugspurger TomAugspurger reopened this Nov 29, 2018
@TomAugspurger TomAugspurger modified the milestones: No action, Contributions Welcome Nov 29, 2018
@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Difficulty Advanced and removed Usage Question labels Nov 29, 2018
shippy added a commit to ABCD-STUDY/fitbit-pipeline that referenced this issue Dec 13, 2018
shippy added a commit to ABCD-STUDY/fitbit-pipeline that referenced this issue Dec 13, 2018
@mroeschke mroeschke added the Bug label Apr 1, 2020
@mroeschke
Copy link
Member

This looks okay on master. Could use a test

In [13]: In [48]: df = pd.DataFrame({"A": ['2015-01-01', '2015-01-02'], 'B': ['2015', '2016']})
    ...:
    ...: In [49]: df2 = df.copy()
    ...:
    ...: In [50]: df2.iloc[:, [0]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016'])})

In [14]: df2
Out[14]:
                     A     B
0  2015-01-01 00:00:00  2015
1  2016-01-01 00:00:00  2016

@mroeschke mroeschke added good first issue and removed Bug Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Datetime Datetime data dtype labels Jun 19, 2021
@mroeschke mroeschke added the Needs Tests Unit test(s) needed to prevent regressions label Jun 19, 2021
@parthi-siva
Copy link
Contributor

take

parthi-siva added a commit to parthi-siva/pandas that referenced this issue Mar 18, 2022
This tests make sure when converting multiple columns to datetimes
and when assiging back it remains as datetime not as unix date
as mentioned in GH pandas-dev#20511.
@jreback jreback modified the milestones: Contributions Welcome, 1.5 Mar 18, 2022
@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Mar 18, 2022
parthi-siva added a commit to parthi-siva/pandas that referenced this issue Mar 22, 2022
This tests make sure when converting multiple columns to datetimes
and when assiging back it remains as datetime not as unix date
as mentioned in GH pandas-dev#20511.
parthi-siva added a commit to parthi-siva/pandas that referenced this issue Mar 24, 2022
This tests make sure when converting multiple columns to datetimes
and when assiging back it remains as datetime not as unix date
as mentioned in GH pandas-dev#20511.
parthi-siva added a commit to parthi-siva/pandas that referenced this issue May 10, 2022
This tests make sure when converting multiple columns to datetimes
and when assiging back it remains as datetime not as unix date
as mentioned in GH pandas-dev#20511.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
7 participants