dtype change when doing operations on dataframes #32822

thedimlebowski · 2020-03-19T09:28:33Z

Code Sample

    In [1]: import pandas as pd
    In [2]: import numpy as np
    In [3]: s = pd.Series([1,2,np.nan], dtype='Int64')
    In [4]: t = pd.Series([1,2,3], dtype='Int64')
    In [5]: s-t
    Out[5]:
    0       0
    1       0
    2    <NA>
    dtype: Int64
    In [6]: (s-t).dtype
    Out[6]: Int64Dtype()
    In [7]: s.to_frame() - t.to_frame()
    Out[7]:
          0
    0     0
    1     0
    2  <NA>
    In [8]: (s.to_frame() - t.to_frame())[0].dtype
    Out[8]: dtype('O')

Problem description

The Int64 dtype is changed when doing operations on two dataframes whereas it is kept (expected behavior) on series operations.

Expected Output

    # In [8]: (s.to_frame() - t.to_frame())[0].dtype
    # Out[8]: dtype('Int64Dtype()')

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : 0.4.0
scipy : None
sqlalchemy : None
tables : None
tabulate : 0.8.6
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

mproszewska · 2020-03-27T18:09:29Z

This is because sub function on DataFrames is executed differently when DataFrames have just one column. For DataFrames with more columns result is calculated sequentially like on sequence of Series.

>>> s = pd.DataFrame({'a': [1, 2, 3],'b':[1, 2, 3]}, dtype='Int64')
>>> t = pd.DataFrame({'a': [1, 2, np.nan],'b':[1, 2, np.nan]}, dtype='Int64')
>>> (s-t)['a'].dtype
Int64Dtype()

When DataFrames contain only one column, sub is calculated on numpy arrays and hence result array dtype is object (casting int and NA).

Function ops.should_series_dispatch is responsible for deciding whether function would be evaluated sequentially on each column or not. I tried adding binary operations like add, sub, mul etc. to the list of operations that are executed sequentially on DataFrame. Those binary operations work on Series (Series._binop), so it seems like they should work on DataFrames. However it results with multiple tests not passing. I'm not sure how to fix it or how to approach this differently.

mproszewska · 2020-04-03T22:27:20Z

I have few more observations. Series uses pd.core.construction.extract_array for binary operations, which casts Series of ints to IntegerArray. So that dtype is maintained. DataFrame casts values arrays to np.array and dtype is determined based on values (hence problems with nan).
IMO best option would be forcing DataFrame to execute binary operations sequentially on its series, but I'm open to discussion on that.

mroeschke · 2021-07-30T03:55:13Z

This looks to work on master now. Could use a test

In [11]:     In [1]: import pandas as pd
    ...:     In [2]: import numpy as np
    ...:     In [3]: s = pd.Series([1,2,np.nan], dtype='Int64')
    ...:     In [4]: t = pd.Series([1,2,3], dtype='Int64')
    ...:     In [5]: s-t
Out[11]:
0       0
1       0
2    <NA>
dtype: Int64

In [12]: In [6]: (s-t).dtype
Out[12]: Int64Dtype()

In [13]: In [7]: s.to_frame() - t.to_frame()
Out[13]:
      0
0     0
1     0
2  <NA>

In [14]: In [8]: (s.to_frame() - t.to_frame())[0].dtype
Out[14]: Int64Dtype()

daksh-intwala · 2021-08-31T07:49:23Z

I can fix it, can I make a pull request?

mroeschke · 2021-08-31T18:26:03Z

Sure go for it @daksh-intwala

poolkit · 2021-10-06T09:24:58Z

It works fine on 1.2.4 version

vamsi231297 · 2022-02-22T06:58:51Z

is issue still pending @mroeschke ? I can see that in the newer version we are not seeing this issue

mroeschke · 2022-02-22T17:43:54Z

We would need a unit test (or confirmation of an existing/similar one) to close out this issue to prevent regressions.

parthi-siva · 2022-03-26T14:12:59Z

take

WillAyd added Bug ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 19, 2020

mproszewska added a commit to mproszewska/pandas that referenced this issue Mar 27, 2020

BUG: Fix pandas-dev#32822 that does not pass tests

8b76a56

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jul 30, 2021

github-actions bot assigned parthi-siva Mar 26, 2022

This was referenced Apr 22, 2022

TST: GH32822 data types in frame manip Aadharsh-Acharya/pandas#1

Merged

TST: GH32822 data types in frame manip #46836

Merged

jreback added this to the 1.5 milestone Apr 27, 2022

mroeschke closed this as completed in #46836 Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dtype change when doing operations on dataframes #32822

dtype change when doing operations on dataframes #32822

thedimlebowski commented Mar 19, 2020 •

edited by mroeschke

Loading

INSTALLED VERSIONS

mproszewska commented Mar 27, 2020 •

edited

Loading

mproszewska commented Apr 3, 2020

mroeschke commented Jul 30, 2021

daksh-intwala commented Aug 31, 2021

mroeschke commented Aug 31, 2021

poolkit commented Oct 6, 2021

vamsi231297 commented Feb 22, 2022

mroeschke commented Feb 22, 2022

parthi-siva commented Mar 26, 2022

dtype change when doing operations on dataframes #32822

dtype change when doing operations on dataframes #32822

Comments

thedimlebowski commented Mar 19, 2020 • edited by mroeschke Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mproszewska commented Mar 27, 2020 • edited Loading

mproszewska commented Apr 3, 2020

mroeschke commented Jul 30, 2021

daksh-intwala commented Aug 31, 2021

mroeschke commented Aug 31, 2021

poolkit commented Oct 6, 2021

vamsi231297 commented Feb 22, 2022

mroeschke commented Feb 22, 2022

parthi-siva commented Mar 26, 2022

thedimlebowski commented Mar 19, 2020 •

edited by mroeschke

Loading

Output of `pd.show_versions()`

mproszewska commented Mar 27, 2020 •

edited

Loading