-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ERR: qcut uniquess checking (try 2) #15000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
My second try on #14455 -- I wasn't able to rebase this one. |
Current coverage is 84.78% (diff: 87.50%)@@ master #15000 diff @@
==========================================
Files 145 145
Lines 51090 51095 +5
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43315 43319 +4
- Misses 7775 7776 +1
Partials 0 0
|
@@ -151,6 +151,8 @@ def qcut(x, q, labels=None, retbins=False, precision=3): | |||
as a scalar. | |||
precision : int | |||
The precision at which to store and display the bins labels | |||
duplicates : {'raise', 'drop'}, optional | |||
If binned edges are not unique, raise ValueError or drop non-uniques. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag (0.20.0)
raise ValueError("invalid value for 'duplicates' parameter, " | ||
+ "valid options are: raise, drop") | ||
|
||
if len(algos.unique(bins)) < len(bins): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do
bins = algos.unique(bins)
first (so you have it already for below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand -- if I do that before line 213, how would I compare len of bins before and after in line 207 and how would I output the original non unique edges in 209 in case duplicates='raise'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call it something else so it doesn't overwrite this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Is this for performance reasons?
pls add a whatsnew in 0.20.0 (other enhancements) |
@ashishsingal1 For future reference, it is not needed to start a new PR, you can always clean up locally and just force push to the same branch in your fork, and this PR will be updated automatically. In fact, we rather much prefer that instead of opening a new PR, to keep discussions gathered at one place. If you need help with those git aspects, don't hesitate to ask! |
Add unique_bins for performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good already! Added some more comments.
@@ -151,6 +151,9 @@ def qcut(x, q, labels=None, retbins=False, precision=3): | |||
as a scalar. | |||
precision : int | |||
The precision at which to store and display the bins labels | |||
duplicates : {'raise', 'drop'}, optional | |||
If binned edges are not unique, raise ValueError or drop non-uniques. | |||
.. versionadded:: 0.20.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be a whiteline above this one (rst specifics ...)
@@ -151,6 +151,9 @@ def qcut(x, q, labels=None, retbins=False, precision=3): | |||
as a scalar. | |||
precision : int | |||
The precision at which to store and display the bins labels | |||
duplicates : {'raise', 'drop'}, optional | |||
If binned edges are not unique, raise ValueError or drop non-uniques. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"binned edges" -> bin edges ?
values = [0, 0, 0, 0, 1, 2, 3] | ||
cats = qcut(values, 3, duplicates='drop') | ||
ex_levels = ['[0, 1]', '(1, 3]'] | ||
self.assertTrue((cats.categories == ex_levels).all()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for duplicates='raise'
(the default) as well? (assert that it raises a ValueError)
|
||
if duplicates not in ['raise', 'drop']: | ||
raise ValueError("invalid value for 'duplicates' parameter, " | ||
+ "valid options are: raise, drop") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the +
is not needed here. The strings will automatically be concatenated due to Python's implied line continuation inside parentheses
if duplicates == 'raise': | ||
raise ValueError('Bin edges must be unique: %s' % repr(bins) + | ||
' You can drop duplicate edges ' + | ||
'by setting \'duplicates\' param') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here about the +
's. Further, if you use "
for the string, you don not need to escape the '
inside the string.
Lastly, can you put the injection of "repr(bins)" at the end of the string (and maybe also use .format
instead of %)
Line too long code formatting fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashishsingal1 Thanks for the update. Looking good! Just added some small comments on the docs
@@ -151,6 +151,9 @@ def qcut(x, q, labels=None, retbins=False, precision=3): | |||
as a scalar. | |||
precision : int | |||
The precision at which to store and display the bins labels | |||
duplicates : {'raise', 'drop'}, optional | |||
If bin edges are not unique, raise ValueError or drop non-uniques. | |||
.. versionadded:: 0.20.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be a blank line above .. versionadded ...
(rst specifics ..)
@@ -151,6 +151,9 @@ def qcut(x, q, labels=None, retbins=False, precision=3): | |||
as a scalar. | |||
precision : int | |||
The precision at which to store and display the bins labels | |||
duplicates : {'raise', 'drop'}, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add "default 'raise'"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or you can also add it in the description like "Default is to raise an error" (one of both is good)
Docstring update
@@ -105,6 +105,7 @@ Other enhancements | |||
of sorting or an incorrect key. See :ref:`here <advanced.unsorted>` | |||
|
|||
- ``pd.cut`` and ``pd.qcut`` now support datetime64 and timedelta64 dtypes (:issue:`14714`, :issue:`14798`) | |||
- ``pd.qcut`` can optionally remove duplicate edges instead of throwing an error (:issue:`7751`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.cut/qcut
have gained the duplicates
kw to control whether to raise on duplicated edges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this relevant for cut or just qcut?
Deleting whitespace.
@jreback @jorisvandenbossche anything left for this PR on my side? |
thanks! |
git diff upstream/master | flake8 --diff
Add option to drop non-unique bins.