Skip to content

Rank(pct=True) behaves strangely on big data #23676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AShoydokova opened this issue Nov 13, 2018 · 5 comments
Closed

Rank(pct=True) behaves strangely on big data #23676

AShoydokova opened this issue Nov 13, 2018 · 5 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Duplicate Report Duplicate issue or pull request Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@AShoydokova
Copy link

AShoydokova commented Nov 13, 2018

Code Sample, a copy-pastable example if possible

smallData = pd.DataFrame({'a': [0]*10 + [1,2,3]})
print(smallData.a.rank(pct=True).tail())

bigData = pd.DataFrame({'a': [0]*100000000 + [1,2,3]})
print(bigData.a.rank(pct=True).tail())

When I use pd.DataFrame().rank(pct=True) on small data (see the first example), it gives me percentages or percentiles. However when data is big, it doesn't return percentages. Maybe it expected output, I just want to calculate percentiles on big data.

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Output

8 0.423077
9 0.423077
10 0.846154
11 0.923077
12 1.000000

99999998 2.980232
99999999 2.980232
100000000 5.960465
100000001 5.960465
100000002 5.960465

Expected Output

I would expect something close to 0.5 for all 0 and something close to 1 for all other values

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jschendel
Copy link
Member

I can confirm this on master. For a more exact cutoff, a Series with length 16777216 looks produce valid results, and a Series of length 16777217 looks to be invalid:

In [1]: import pandas as pd; import numpy as np; pd.__version__
Out[1]: '0.24.0.dev0+989.g2d4dd508f'

In [2]: def check_rank_pct(n):
   ...:     data = np.concatenate([np.repeat(0, n-3), np.array([1, 2, 3])])
   ...:     return pd.Series(data).rank(pct=True).tail().values

In [3]: check_rank_pct(16777216)
Out[3]: array([ 0.49999994,  0.49999994,  0.99999988,  0.99999994,  1.        ])

In [4]: check_rank_pct(16777217)
Out[4]: array([ 0.49999997,  0.49999997,  0.99999994,  1.        ,  1.00000006])

This is interesting because 16777217 = 2**24 + 1, which is the first integer that can't be exactly represented with 32-bit floating point. I don't know this part of the codebase well enough to say where something like that comes into play, but hopefully this is a good clue for someone more knowledgeable. At the very least, if we're able to get this up to the 64-bit floating point limit, we'd increase the cutoff to 2**53 + 1.

@jschendel jschendel added Bug Numeric Operations Arithmetic, Comparison, and Logical operations Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Nov 14, 2018
@jschendel jschendel added this to the Contributions Welcome milestone Nov 14, 2018
@jschendel
Copy link
Member

This looks to be a dupe of #18271, so closing in favor of that issue. Will open a PR to fix this soon though.

@jschendel jschendel added the Duplicate Report Duplicate issue or pull request label Nov 14, 2018
@jschendel jschendel modified the milestones: Contributions Welcome, No action Nov 14, 2018
@WillAyd
Copy link
Member

WillAyd commented Nov 14, 2018

@jschendel was simultaneously looking at this. Should post a PR in a few minutes

@jschendel
Copy link
Member

@WillAyd : Oops, just saw this after posting a PR of my own!

@WillAyd
Copy link
Member

WillAyd commented Nov 14, 2018

Ha no worries. Changes are exactly the same so gets to the same spot. Let's stick with yours

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Duplicate Report Duplicate issue or pull request Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

3 participants