Skip to content

DataFrame.clip_upper does not preserve dtype per column #24162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
joneugster opened this issue Dec 8, 2018 · 3 comments · Fixed by #24458
Closed

DataFrame.clip_upper does not preserve dtype per column #24162

joneugster opened this issue Dec 8, 2018 · 3 comments · Fixed by #24458
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug
Milestone

Comments

@joneugster
Copy link

Code Sample

import pandas as pd
data = pd.DataFrame({'INT': [-1, 0, 10, 9],
              'FLOAT': [-0.148, 0.2347, 38.237, 12.2233]},
             index=pd.date_range("20180101 00:00", periods=4))

print('Original data:')
print(data.head())

print('\nThis is probably not a bug but my misunderstanding:')
print('(So how would I apply "clip_upper" inplace on parts of the dataframe?)')
data.loc[[True, True, True, False], ['INT']].clip_upper(8, inplace=True)
print(data.head()) 
# I used then:
# data.loc[[True, True, True, False], ['INT']] = data.loc[[True, True, True, False], ['INT']].clip_upper(8)     

print('\nIt seems that clip_upper does not preserve the dtypes:')
print(data.clip_upper(8).head())

print('\nSame for inplace:')
data.clip_upper(8, inplace=True)
print(data.head())
Output of this code:
Original data:
            INT    FLOAT
2018-01-01   -1  -0.1480
2018-01-02    0   0.2347
2018-01-03   10  38.2370
2018-01-04    9  12.2233

(A) This is probably not a bug but my misunderstanding:
(So how would I apply "clip_upper" inplace on parts of the dataframe?)
            INT    FLOAT
2018-01-01   -1  -0.1480
2018-01-02    0   0.2347
2018-01-03   10  38.2370
2018-01-04    9  12.2233

(B) It seems that clip_upper does not preserve the dtypes:
            INT   FLOAT
2018-01-01 -1.0 -0.1480
2018-01-02  0.0  0.2347
2018-01-03  8.0  8.0000
2018-01-04  8.0  8.0000

(C) Same for inplace:
            INT   FLOAT
2018-01-01 -1.0 -0.1480
2018-01-02  0.0  0.2347
2018-01-03  8.0  8.0000
2018-01-04  8.0  8.0000

Problem description

clip_upper with int- and float- columns convert int-column to float.

Calling data.clip_upper(10) with an integer, I would expect that it leaves the int-column as integers and the float-column as float. However, it converts everything to float. (see (B) and (C))

Moreover, clip_upper with inplace=True does not work with .loc but this might as well be me understanding the concept wrong... (see (A))

Same for clip_lower.

Expected Output

For (A):

            INT    FLOAT
2018-01-01   -1  -0.1480
2018-01-02    0   0.2347
2018-01-03    8  38.2370
2018-01-04    9  12.2233

For (B) and (C):

            INT   FLOAT
2018-01-01 -1 -0.1480
2018-01-02  0  0.2347
2018-01-03  8  8.0000
2018-01-04  8  8.0000

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.23.4
pytest: 4.0.1
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.11
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Dec 11, 2018

The example is very confusing and pulls in a lot of unnecessary elements. Please keep reports minimal in the future.

This is much easier to produce with a small sample:

In [66]: data = pd.DataFrame([[1, 2], [3, 4]], columns=['int1', 'int2']) 
In [66]: data.clip_upper(1)                                                     
Out[66]: 
   int1  int2
0     1     1
1     1     1

In [67]: data['float'] = data['int1'].astype(float)                             

In [68]: data.clip_upper(1)                                                     
Out[68]: 
   int1  int2  float
0   1.0   1.0    1.0
1   1.0   1.0    1.0

dtype should probably be preserved by column though it appears the mere presence of a float casts the entire frame.

Investigation and PRs are always welcome

@WillAyd WillAyd changed the title DataFrame.clip_upper does not preserve dtype DataFrame.clip_upper does not preserve dtype per column Dec 11, 2018
@WillAyd WillAyd added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Dec 11, 2018
@WillAyd WillAyd added this to the Contributions Welcome milestone Dec 11, 2018
@minggli
Copy link
Contributor

minggli commented Dec 27, 2018

Hi @WillAyd 👋 ,

Happy to look at this issue 🐞 and will revert with a PR 🚀 .

Thanks,

Ming

@cgangwar11
Copy link
Contributor

cgangwar11 commented Dec 27, 2018

In [16]: data
Out[16]:
   int  float
0    1    2.0
1    3    4.0
In [17]: axes_dict = data._construct_axes_dict()
['index', 'columns'] None {'index': RangeIndex(start=0, stop=2, step=1), 'columns': Index(['int', 'float'], dtype='object')}

In [18]: result = data._constructor(data.values, **axes_dict).__finalize__(data)

In [19]: result
Out[19]:
   int  float
0  1.0    2.0
1  3.0    4.0

Underlying problem is constructor method which is casting dtype of int column to float.
I will take a look at working of property decorator and create a PR

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Dec 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants