You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a set of 20 million floats, and I am trying to follow this StackOverflow example. This code discusses calculating the percentile ranking of a column using either the pct=True option for the rank() function, or by manually dividing the output of rank(pct=True) by the length of the Series.
I noticed that the former values have a maximum that is not 1, while the latter have the expected maximum of 1.
I have tried this with the latest (0.21.0) version of pandas, and can replicate it with an array of random floats.
It seems to be related to the number of rows being greater than 2^23 – you can see this by comparing the output when len_df=16770000 and len_df=16780000.
I believe the responsible code is pandas/_libs/algos.pyx/rank_1d_float64().
I'm working on a PR now.
Expected Output
I would expect the values of Rank_Pct and Rank_Pct_Manual to be the same, and that the maximum of both should be 1.
Code Sample, a copy-pastable example if possible
Output:
Problem description
I have a set of 20 million floats, and I am trying to follow this StackOverflow example. This code discusses calculating the percentile ranking of a column using either the
pct=True
option for therank()
function, or by manually dividing the output ofrank(pct=True)
by the length of theSeries
.I noticed that the former values have a maximum that is not 1, while the latter have the expected maximum of 1.
I have tried this with the latest (0.21.0) version of
pandas
, and can replicate it with an array of random floats.It seems to be related to the number of rows being greater than 2^23 – you can see this by comparing the output when
len_df=16770000
andlen_df=16780000
.I believe the responsible code is
pandas/_libs/algos.pyx/rank_1d_float64()
.I'm working on a PR now.
Expected Output
I would expect the values of
Rank_Pct
andRank_Pct_Manual
to be the same, and that the maximum of both should be 1.Output of
pd.show_versions()
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.0
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.3.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: