column type changes when assigning with loc and values #26779

blinkseb · 2019-06-11T08:56:54Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy

In [34]: df = pd.DataFrame( 
    ...:     data={ 
    ...:         'channel': [1, 2, 3], 
    ...:         'A': ['String 1', numpy.NaN, 'String 2'], 
    ...:         'B': [pd.Timestamp('2019-06-11 11:00:00'), numpy.NaN, pd.Timestamp('2019-06-11 12:00:00')] 
    ...:     } 
    ...: )                                                                                                                                                                                    

In [35]: df                                                                                                                                                                                   
Out[35]: 
   channel         A                   B
0        1  String 1 2019-06-11 11:00:00
1        2       NaN                 NaT
2        3  String 2 2019-06-11 12:00:00

In [36]: df2 = pd.DataFrame( 
    ...:     data={'A': ['String 3'], 'B': [pd.Timestamp('2019-06-11 12:00:00')]} 
    ...: )                                                                                                                                                                                    

In [37]: df2                                                                                                                                                                                  
Out[37]: 
          A                   B
0  String 3 2019-06-11 12:00:00

In [38]: df.loc[df['A'].isna() , ['A', 'B']] = df2.values                                                                                                                                     

In [39]: df                                                                                                                                                                                   
Out[39]: 
   channel         A                    B
0        1  String 1  1560250800000000000
1        2  String 3  2019-06-11 12:00:00
2        3  String 2  1560254400000000000

Problem description

Hello,

Assigning new values to rows using.loc and .values change the type of the column (look how column B, which was a datetime64, is now a mix of integer and datetime)

Note: the same exact code produces the expected output when using pandas 0.22.0, 0.23.4 but broked in pandas 0.24.0 ; also, if I remove the A column, the output is correct (no unwanted datetime to integer conversion)

Thanks for your help!

Expected Output

Out[39]: 
   channel         A                    B
0        1  String 1  2019-06-11 11:00:00
1        2  String 3  2019-06-11 12:00:00
2        3  String 2  2019-06-11 12:00:00

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 5.1.7-300.fc30.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

thiviyanT · 2019-06-11T14:59:21Z

Hi @blinkseb

It seems that changing one of those lines to the following makes it work:

df.loc[df['A'].isna(), ['A', 'B']] = df2.values[0]

What you are doing with df.loc[df['A'].isna() , ['A', 'B']] = df2.values is that you are assigning each matching row to a list of rows (i.e. df2.values), as opposed to just one row. In your case, you have just one row in df2 but still pandas does not seem to be able to infer the datatypes from the list correctly.

In [10]: df.info()
Out[10]: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
channel    3 non-null int64
A          3 non-null object
B          3 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 152.0+ bytes

Sorry, I cannot provide an in-depth explanation into why this is the case as I fully do not understand how pandas works under the hood (yet). However, I don't think this is a bug.

Tested using pandas v0.24.2

blinkseb · 2019-06-11T15:04:23Z

Thanks a lot for the feedback! In this particular case yes, there's a single row in df2, but in my real usecase, I have more. I'm basically trying to replace values of N columns with values from another dataframe. If there's another solution not involving using values, I'd be happy to use it 😄

Thanks!

TomAugspurger · 2019-06-11T15:13:37Z

Looks like a casting bug. I would recommend looking in Block.setitem if you want to debug further.

I'm not sure what the expected behavior is here for the output df.B.dtype. I don't know if that should be converted to object dtype, because the array is object-type, or if we've split to just assigning a scalar by that point.

TomAugspurger · 2019-06-11T15:14:34Z

Are you replacing missing values in your real example? In that case df.combine_first(other) may work.

…

On Tue, Jun 11, 2019 at 10:04 AM Sébastien Brochet ***@***.***> wrote: Thanks a lot for the feedback! In this particular case yes, there's a single row in df2, but in my real usecase, I have more. I'm basically trying to replace values of N columns with values from another dataframe. If there's another solution not involving using values, I'd be happy to use it 😄 Thanks! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26779?email_source=notifications&email_token=AAKAOIVZCRQZ2QFOYRTNNWDPZ65IDA5CNFSM4HW4BJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXNN7BY#issuecomment-500883335>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIVKNSI2CLDHS6NVQ2TPZ65IDANCNFSM4HW4BJIA> .

phofl · 2020-11-08T23:46:07Z

This returns

   channel         A                    B
0        1  String 1  2019-06-11 11:00:00
1        2  String 3  2019-06-11 12:00:00
2        3  String 2  2019-06-11 12:00:00

now. Seems to be fixed.

shiv-io · 2021-05-22T08:56:14Z

take

shiv-io · 2021-05-22T08:57:04Z

@phofl I'd like to work on adding a test for this. Is pandas/tests/frame/methods/test_dtypes.py the correct module to add the test to? Thanks!

phofl · 2021-05-22T11:08:29Z

This should go somewhere into frame/indexing

jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 23, 2019

phofl added the Needs Tests Unit test(s) needed to prevent regressions label Nov 8, 2020

github-actions bot assigned shiv-io May 22, 2021

mroeschke added the good first issue label Jul 10, 2021

shiv-io added a commit to shiv-io/pandas that referenced this issue Jul 26, 2021

TST GH pandas-dev#26779

4c4bcf9

shiv-io mentioned this issue Jul 26, 2021

TST: test_column_types_consistent #42725

Merged

4 tasks

jreback added this to the 1.4 milestone Jul 26, 2021

jreback closed this as completed in #42725 Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

column type changes when assigning with loc and values #26779

column type changes when assigning with loc and values #26779

blinkseb commented Jun 11, 2019

INSTALLED VERSIONS

thiviyanT commented Jun 11, 2019

blinkseb commented Jun 11, 2019

TomAugspurger commented Jun 11, 2019

TomAugspurger commented Jun 11, 2019 via email

phofl commented Nov 8, 2020

shiv-io commented May 22, 2021

shiv-io commented May 22, 2021

phofl commented May 22, 2021

column type changes when assigning with loc and values #26779

column type changes when assigning with loc and values #26779

Comments

blinkseb commented Jun 11, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

thiviyanT commented Jun 11, 2019

blinkseb commented Jun 11, 2019

TomAugspurger commented Jun 11, 2019

TomAugspurger commented Jun 11, 2019 via email

phofl commented Nov 8, 2020

shiv-io commented May 22, 2021

shiv-io commented May 22, 2021

phofl commented May 22, 2021

Output of `pd.show_versions()`