BUG: Series.rank(pct=True).max() != 1 for a large series of floats #18271

proinsias · 2017-11-13T21:56:17Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

rs = np.random.RandomState(seed=0, )

len_df = 20000000
df = pd.DataFrame(data=rs.rand(len_df), columns=['abc'], ).sort_values('abc')

df['Rank'] = df['abc'].rank()
df['Rank_Pct']= df['abc'].rank(pct=True)
df['Rank_Pct_Manual'] = df['Rank']/len_df

df.describe()

Output:

	abc	        Rank	        Rank_Pct        Rank_Pct_Manual
count	2.000000e+07	2.000000e+07	2.000000e+07	2.000000e+07
mean	4.999223e-01	1.000000e+07	5.960465e-01	5.000000e-01
std	2.886891e-01	5.773503e+06	3.441276e-01	2.886751e-01
min	1.036192e-08	1.000000e+00	5.960464e-08	5.000000e-08
25%	2.498756e-01	5.000001e+06	2.980233e-01	2.500000e-01
50%	4.999781e-01	1.000000e+07	5.960465e-01	5.000000e-01
75%	7.499111e-01	1.500000e+07	8.940697e-01	7.500000e-01
max	1.000000e+00	2.000000e+07	1.192093e+00	1.000000e+00

Problem description

I have a set of 20 million floats, and I am trying to follow this StackOverflow example. This code discusses calculating the percentile ranking of a column using either the pct=True option for the rank() function, or by manually dividing the output of rank(pct=True) by the length of the Series.

I noticed that the former values have a maximum that is not 1, while the latter have the expected maximum of 1.

I have tried this with the latest (0.21.0) version of pandas, and can replicate it with an array of random floats.

It seems to be related to the number of rows being greater than 2^23 – you can see this by comparing the output when len_df=16770000 and len_df=16780000.

I believe the responsible code is pandas/_libs/algos.pyx/rank_1d_float64().

I'm working on a PR now.

Expected Output

I would expect the values of Rank_Pct and Rank_Pct_Manual to be the same, and that the maximum of both should be 1.

Output of `pd.show_versions()`

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.3.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-11-15T11:34:34Z

xref to #15630.

proinsias mentioned this issue Nov 14, 2017

BUG: Use float64 for row counter in rank() #18274

Closed

4 tasks

jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Nov 14, 2017

jreback closed this as completed Nov 15, 2017

jreback reopened this Nov 15, 2017

jreback added Bug Difficulty Intermediate labels Nov 15, 2017

jreback modified the milestones: 0.21.1, Next Major Release Nov 15, 2017

This was referenced Nov 14, 2018

Rank(pct=True) behaves strangely on big data #23676

Closed

BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows #23688

Merged

jreback modified the milestones: Contributions Welcome, 0.24.0 Nov 14, 2018

jreback closed this as completed in #23688 Nov 14, 2018

jbencina mentioned this issue Sep 21, 2021

BUG: to_numeric incorrectly converting values >= 16,777,217 with downcast='float' #43693

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Series.rank(pct=True).max() != 1 for a large series of floats #18271

BUG: Series.rank(pct=True).max() != 1 for a large series of floats #18271

proinsias commented Nov 13, 2017 •

edited

Loading

jreback commented Nov 15, 2017 •

edited

Loading

BUG: Series.rank(pct=True).max() != 1 for a large series of floats #18271

BUG: Series.rank(pct=True).max() != 1 for a large series of floats #18271

Comments

proinsias commented Nov 13, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Nov 15, 2017 • edited Loading

proinsias commented Nov 13, 2017 •

edited

Loading

Output of `pd.show_versions()`

jreback commented Nov 15, 2017 •

edited

Loading