Skip to content

regression in 0.24: np.clip does not work for some values of a_min #25066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sam-s opened this issue Feb 1, 2019 · 6 comments
Open

regression in 0.24: np.clip does not work for some values of a_min #25066

sam-s opened this issue Feb 1, 2019 · 6 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Regression Functionality that used to work in a prior pandas version

Comments

@sam-s
Copy link

sam-s commented Feb 1, 2019

Code Sample

import pandas as pd
import numpy as np
print(pd.__name__,pd.__version__,np.__name__,np.__version__)
np.clip(pd.Series([0,1,0,1]), 1e-7, 1)
np.clip(pd.Series([0,1,0,1]), 1e-8, 1)
np.clip(pd.Series([0.0,1.0,0.0,1.0]), 1e-8, 1)

Problem description

With pandas 0.23.4, numpy.clip works as expected, producing a Series of type float64 for any min argument.

With pandas 0.24.0, it works up to 1e-7 but not for 1e-8

>>> np.clip(pd.Series([0,1,0,1]), 1e-7, 1)
0    1.000000e-07
1    1.000000e+00
2    1.000000e-07
3    1.000000e+00
dtype: float64
>>> np.clip(pd.Series([0,1,0,1]), 1e-8, 1)
0    0
1    1
2    0
3    1
dtype: int64

This is an unwelcome change from the previous version.

Expected Output

>>> np.clip(pd.Series([0,1,0,1]), 1e-7, 1)
0    1.000000e-07
1    1.000000e+00
2    1.000000e-07
3    1.000000e+00
dtype: float64
>>> np.clip(pd.Series([0,1,0,1]), 1e-8, 1)
0    1.000000e-08
1    1.000000e+00
2    1.000000e-08
3    1.000000e+00
dtype: float64
>>> np.clip(pd.Series([0,1,0,1]), 1e-15, 1)
0    1.000000e-15
1    1.000000e+00
2    1.000000e-15
3    1.000000e+00
dtype: float64

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: C LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: None
pip: 19.0.1
setuptools: 40.7.0
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: 1.1.2
lxml.etree: 4.3.0
bs4: None
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jorisvandenbossche
Copy link
Member

@sam-s thanks for the report! I can confirm the issue.
I suppose this is related to trying to preserve the dtype: #24458.

IMO it would make sense to follow numpy behaviour here to only preserve the dtype if the passed lower/upper match the dtype.

@jorisvandenbossche jorisvandenbossche added this to the 0.24.1 milestone Feb 1, 2019
@gfyoung gfyoung added Regression Functionality that used to work in a prior pandas version Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 2, 2019
@gfyoung
Copy link
Member

gfyoung commented Feb 2, 2019

@sam-s @jorisvandenbossche : I tracked down the regression, and it boils down to this:

When tiny floats and integers are involved in .where, numpy uses exact numbers and does not round. That's why your examples work before #24458.

In pandas, in the post-processing portion of .where, we actually attempt to cast downcast to int if they are "sufficiently" close to one another (via np.allclose). Turns out you hit the tolerance on the nose with 1e-8 (we pass in atol=1e-8). Thus, the result of our internal where functionality is correct, but it is the post-processing of said result that is the issue.

Relevant files: generic.py, managers.py, blocks.py --> cast.py (downcasting occurs at block level)

@gfyoung
Copy link
Member

gfyoung commented Feb 2, 2019

Reading the docs, this rounding / downcasting behavior is not documented in the docstring for .where, and while it won't manifest itself in many situations, it does in this one. Thus, there are a couple of options here on how to proceed:

  1. Revert BUG: clip doesn't preserve dtype by column #24458 - Not sure I really want to do this though.

  2. Don't downcast results in post-processing. It's not documented, but I see the benefits (memory).

  3. Stricter downcasting procedures so that we don't round results at all like numpy does. This, however, would involve modifying maybe_downcast_to_dtype to make this work.

Thoughts?

cc @jreback

@TomAugspurger
Copy link
Contributor

Pushing to 0.24.2, since this looks a little tricky..

@TomAugspurger TomAugspurger modified the milestones: 0.24.1, 0.24.2 Feb 2, 2019
@jorisvandenbossche
Copy link
Member

@gfyoung I didn't look into detail, but from your explanation, I was thinking: couldn't we check the type of the scalar clip value? If the clipping value you provide is a float, then I would say it is OK that the output is also float (even if your original column was int).

On the other hand, long term, we could also try to go to a behaviour where we always preserve the dtype (basically meaning we would cast the clip value to the dtype of the column, so any float clip value would be converted to int). So if you want to clip integers with a clip value like 0.5, you first need to cast them to floats.
But that would require a deprecation cycle (if we would want this of course).

@jorisvandenbossche
Copy link
Member

a behaviour where we always preserve the dtype

Although, argument against it: numpy doesn't do this. (but numpy does always result in a dtype compatible with the clipping values)

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.24.2, 0.24.3 Mar 11, 2019
@jreback jreback modified the milestones: 0.24.3, Contributions Welcome Apr 20, 2019
@mroeschke mroeschke added the Bug label May 11, 2020
@mroeschke mroeschke added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Jun 26, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

6 participants