Rank(pct=True) behaves strangely on big data #23676

AShoydokova · 2018-11-13T21:40:53Z

Code Sample, a copy-pastable example if possible

smallData = pd.DataFrame({'a': [0]*10 + [1,2,3]})
print(smallData.a.rank(pct=True).tail())

bigData = pd.DataFrame({'a': [0]*100000000 + [1,2,3]})
print(bigData.a.rank(pct=True).tail())

When I use pd.DataFrame().rank(pct=True) on small data (see the first example), it gives me percentages or percentiles. However when data is big, it doesn't return percentages. Maybe it expected output, I just want to calculate percentiles on big data.

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Output

8 0.423077
9 0.423077
10 0.846154
11 0.923077
12 1.000000

99999998 2.980232
99999999 2.980232
100000000 5.960465
100000001 5.960465
100000002 5.960465

Expected Output

I would expect something close to 0.5 for all 0 and something close to 1 for all other values

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jschendel · 2018-11-14T00:15:32Z

I can confirm this on master. For a more exact cutoff, a Series with length 16777216 looks produce valid results, and a Series of length 16777217 looks to be invalid:

In [1]: import pandas as pd; import numpy as np; pd.__version__
Out[1]: '0.24.0.dev0+989.g2d4dd508f'

In [2]: def check_rank_pct(n):
   ...:     data = np.concatenate([np.repeat(0, n-3), np.array([1, 2, 3])])
   ...:     return pd.Series(data).rank(pct=True).tail().values

In [3]: check_rank_pct(16777216)
Out[3]: array([ 0.49999994,  0.49999994,  0.99999988,  0.99999994,  1.        ])

In [4]: check_rank_pct(16777217)
Out[4]: array([ 0.49999997,  0.49999997,  0.99999994,  1.        ,  1.00000006])

This is interesting because 16777217 = 2**24 + 1, which is the first integer that can't be exactly represented with 32-bit floating point. I don't know this part of the codebase well enough to say where something like that comes into play, but hopefully this is a good clue for someone more knowledgeable. At the very least, if we're able to get this up to the 64-bit floating point limit, we'd increase the cutoff to 2**53 + 1.

jschendel · 2018-11-14T07:22:36Z

This looks to be a dupe of #18271, so closing in favor of that issue. Will open a PR to fix this soon though.

WillAyd · 2018-11-14T07:24:35Z

@jschendel was simultaneously looking at this. Should post a PR in a few minutes

jschendel · 2018-11-14T07:28:24Z

@WillAyd : Oops, just saw this after posting a PR of my own!

WillAyd · 2018-11-14T07:28:54Z

Ha no worries. Changes are exactly the same so gets to the same spot. Let's stick with yours

jschendel added Bug Numeric Operations Arithmetic, Comparison, and Logical operations Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Nov 14, 2018

jschendel added this to the Contributions Welcome milestone Nov 14, 2018

jschendel closed this as completed Nov 14, 2018

jschendel added the Duplicate Report Duplicate issue or pull request label Nov 14, 2018

jschendel modified the milestones: Contributions Welcome, No action Nov 14, 2018

jbencina mentioned this issue Sep 21, 2021

BUG: to_numeric incorrectly converting values >= 16,777,217 with downcast='float' #43693

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rank(pct=True) behaves strangely on big data #23676

Rank(pct=True) behaves strangely on big data #23676

AShoydokova commented Nov 13, 2018 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

jschendel commented Nov 14, 2018

jschendel commented Nov 14, 2018

WillAyd commented Nov 14, 2018

jschendel commented Nov 14, 2018

WillAyd commented Nov 14, 2018

Rank(pct=True) behaves strangely on big data #23676

Rank(pct=True) behaves strangely on big data #23676

Comments

AShoydokova commented Nov 13, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

When I use pd.DataFrame().rank(pct=True) on small data (see the first example), it gives me percentages or percentiles. However when data is big, it doesn't return percentages. Maybe it expected output, I just want to calculate percentiles on big data.

Output

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

jschendel commented Nov 14, 2018

jschendel commented Nov 14, 2018

WillAyd commented Nov 14, 2018

jschendel commented Nov 14, 2018

WillAyd commented Nov 14, 2018

AShoydokova commented Nov 13, 2018 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS