Skip to content

column type changes when assigning with loc and values #26779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
blinkseb opened this issue Jun 11, 2019 · 8 comments · Fixed by #42725
Closed

column type changes when assigning with loc and values #26779

blinkseb opened this issue Jun 11, 2019 · 8 comments · Fixed by #42725
Assignees
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@blinkseb
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy

In [34]: df = pd.DataFrame( 
    ...:     data={ 
    ...:         'channel': [1, 2, 3], 
    ...:         'A': ['String 1', numpy.NaN, 'String 2'], 
    ...:         'B': [pd.Timestamp('2019-06-11 11:00:00'), numpy.NaN, pd.Timestamp('2019-06-11 12:00:00')] 
    ...:     } 
    ...: )                                                                                                                                                                                    

In [35]: df                                                                                                                                                                                   
Out[35]: 
   channel         A                   B
0        1  String 1 2019-06-11 11:00:00
1        2       NaN                 NaT
2        3  String 2 2019-06-11 12:00:00

In [36]: df2 = pd.DataFrame( 
    ...:     data={'A': ['String 3'], 'B': [pd.Timestamp('2019-06-11 12:00:00')]} 
    ...: )                                                                                                                                                                                    

In [37]: df2                                                                                                                                                                                  
Out[37]: 
          A                   B
0  String 3 2019-06-11 12:00:00

In [38]: df.loc[df['A'].isna() , ['A', 'B']] = df2.values                                                                                                                                     

In [39]: df                                                                                                                                                                                   
Out[39]: 
   channel         A                    B
0        1  String 1  1560250800000000000
1        2  String 3  2019-06-11 12:00:00
2        3  String 2  1560254400000000000

Problem description

Hello,

Assigning new values to rows using.loc and .values change the type of the column (look how column B, which was a datetime64, is now a mix of integer and datetime)

Note: the same exact code produces the expected output when using pandas 0.22.0, 0.23.4 but broked in pandas 0.24.0 ; also, if I remove the A column, the output is correct (no unwanted datetime to integer conversion)

Thanks for your help!

Expected Output

Out[39]: 
   channel         A                    B
0        1  String 1  2019-06-11 11:00:00
1        2  String 3  2019-06-11 12:00:00
2        3  String 2  2019-06-11 12:00:00

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 5.1.7-300.fc30.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@thiviyanT
Copy link
Contributor

Hi @blinkseb

It seems that changing one of those lines to the following makes it work:

df.loc[df['A'].isna(), ['A', 'B']] = df2.values[0]

What you are doing with df.loc[df['A'].isna() , ['A', 'B']] = df2.values is that you are assigning each matching row to a list of rows (i.e. df2.values), as opposed to just one row. In your case, you have just one row in df2 but still pandas does not seem to be able to infer the datatypes from the list correctly.

In [10]: df.info()
Out[10]: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
channel    3 non-null int64
A          3 non-null object
B          3 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 152.0+ bytes

Sorry, I cannot provide an in-depth explanation into why this is the case as I fully do not understand how pandas works under the hood (yet). However, I don't think this is a bug.


Tested using pandas v0.24.2

@blinkseb
Copy link
Author

Thanks a lot for the feedback! In this particular case yes, there's a single row in df2, but in my real usecase, I have more. I'm basically trying to replace values of N columns with values from another dataframe. If there's another solution not involving using values, I'd be happy to use it 😄

Thanks!

@TomAugspurger
Copy link
Contributor

Looks like a casting bug. I would recommend looking in Block.setitem if you want to debug further.

I'm not sure what the expected behavior is here for the output df.B.dtype. I don't know if that should be converted to object dtype, because the array is object-type, or if we've split to just assigning a scalar by that point.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 11, 2019 via email

@jbrockmendel jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 23, 2019
@phofl
Copy link
Member

phofl commented Nov 8, 2020

This returns

   channel         A                    B
0        1  String 1  2019-06-11 11:00:00
1        2  String 3  2019-06-11 12:00:00
2        3  String 2  2019-06-11 12:00:00

now. Seems to be fixed.

@phofl phofl added the Needs Tests Unit test(s) needed to prevent regressions label Nov 8, 2020
@shiv-io
Copy link
Contributor

shiv-io commented May 22, 2021

take

@shiv-io
Copy link
Contributor

shiv-io commented May 22, 2021

@phofl I'd like to work on adding a test for this. Is pandas/tests/frame/methods/test_dtypes.py the correct module to add the test to? Thanks!

@phofl
Copy link
Member

phofl commented May 22, 2021

This should go somewhere into frame/indexing

shiv-io added a commit to shiv-io/pandas that referenced this issue Jul 26, 2021
@jreback jreback added this to the 1.4 milestone Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants