Skip to content

dtype change when doing operations on dataframes #32822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thedimlebowski opened this issue Mar 19, 2020 · 9 comments · Fixed by Aadharsh-Acharya/pandas#1 or #46836
Closed

dtype change when doing operations on dataframes #32822

thedimlebowski opened this issue Mar 19, 2020 · 9 comments · Fixed by Aadharsh-Acharya/pandas#1 or #46836
Assignees
Labels
good first issue NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@thedimlebowski
Copy link

thedimlebowski commented Mar 19, 2020

Code Sample

    In [1]: import pandas as pd
    In [2]: import numpy as np
    In [3]: s = pd.Series([1,2,np.nan], dtype='Int64')
    In [4]: t = pd.Series([1,2,3], dtype='Int64')
    In [5]: s-t
    Out[5]:
    0       0
    1       0
    2    <NA>
    dtype: Int64
    In [6]: (s-t).dtype
    Out[6]: Int64Dtype()
    In [7]: s.to_frame() - t.to_frame()
    Out[7]:
          0
    0     0
    1     0
    2  <NA>
    In [8]: (s.to_frame() - t.to_frame())[0].dtype
    Out[8]: dtype('O')

Problem description

The Int64 dtype is changed when doing operations on two dataframes whereas it is kept (expected behavior) on series operations.

Expected Output

    # In [8]: (s.to_frame() - t.to_frame())[0].dtype
    # Out[8]: dtype('Int64Dtype()')

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : 0.4.0
scipy : None
sqlalchemy : None
tables : None
tabulate : 0.8.6
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@WillAyd WillAyd added Bug ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 19, 2020
mproszewska added a commit to mproszewska/pandas that referenced this issue Mar 27, 2020
@mproszewska
Copy link
Contributor

mproszewska commented Mar 27, 2020

This is because sub function on DataFrames is executed differently when DataFrames have just one column. For DataFrames with more columns result is calculated sequentially like on sequence of Series.

>>> s = pd.DataFrame({'a': [1, 2, 3],'b':[1, 2, 3]}, dtype='Int64')
>>> t = pd.DataFrame({'a': [1, 2, np.nan],'b':[1, 2, np.nan]}, dtype='Int64')
>>> (s-t)['a'].dtype
Int64Dtype()

When DataFrames contain only one column, sub is calculated on numpy arrays and hence result array dtype is object (casting int and NA).

Function ops.should_series_dispatch is responsible for deciding whether function would be evaluated sequentially on each column or not. I tried adding binary operations like add, sub, mul etc. to the list of operations that are executed sequentially on DataFrame. Those binary operations work on Series (Series._binop), so it seems like they should work on DataFrames. However it results with multiple tests not passing. I'm not sure how to fix it or how to approach this differently.

@mproszewska
Copy link
Contributor

I have few more observations. Series uses pd.core.construction.extract_array for binary operations, which casts Series of ints to IntegerArray. So that dtype is maintained. DataFrame casts values arrays to np.array and dtype is determined based on values (hence problems with nan).
IMO best option would be forcing DataFrame to execute binary operations sequentially on its series, but I'm open to discussion on that.

@mroeschke
Copy link
Member

This looks to work on master now. Could use a test

In [11]:     In [1]: import pandas as pd
    ...:     In [2]: import numpy as np
    ...:     In [3]: s = pd.Series([1,2,np.nan], dtype='Int64')
    ...:     In [4]: t = pd.Series([1,2,3], dtype='Int64')
    ...:     In [5]: s-t
Out[11]:
0       0
1       0
2    <NA>
dtype: Int64

In [12]: In [6]: (s-t).dtype
Out[12]: Int64Dtype()

In [13]: In [7]: s.to_frame() - t.to_frame()
Out[13]:
      0
0     0
1     0
2  <NA>

In [14]: In [8]: (s.to_frame() - t.to_frame())[0].dtype
Out[14]: Int64Dtype()

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jul 30, 2021
@daksh-intwala
Copy link

I can fix it, can I make a pull request?

@mroeschke
Copy link
Member

Sure go for it @daksh-intwala

@poolkit
Copy link

poolkit commented Oct 6, 2021

It works fine on 1.2.4 version

@vamsi231297
Copy link

is issue still pending @mroeschke ? I can see that in the newer version we are not seeing this issue

@mroeschke
Copy link
Member

We would need a unit test (or confirmation of an existing/similar one) to close out this issue to prevent regressions.

@parthi-siva
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
9 participants