-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix describe(): percentiles (#13104), col index (#13288) #13298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
out[~int_idx] = pcts[~int_idx].round(prec).astype(str) | ||
return [i + '%' for i in out] | ||
|
||
pretty_percentiles = prettify(percentiles) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks a bit ugly here when percentiles
is passed to functions as an argument. But it needs to be evaluated only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this function (the prettify), maybe call it slightly more descriptive to core.common
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't very clear about the "ugliness" - see my comment to describe_numeric_1d
.
Anyway, I don't mind the function prettify
is here. But if you think it could be useful for other purposes, I can make it more general without efficiency loss in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its fine; just normally like to hide things like this away into an external function, much easier to read this way; so maybe more this instead to pandas/formats.format.py
somewhere, so you are just calling format_percentiles
or something
@pijucha looks really good. just a couple of comments. |
|
||
def describe_numeric_1d(series, percentiles): | ||
stat_index = (['count', 'mean', 'std', 'min'] + | ||
[pretty_name(x) for x in percentiles] + ['max']) | ||
pretty_percentiles + ['max']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, pretty_percentiles
is global to describe_numeric_1d
but percentiles
is local. So if someone changes something and calls it with a different argument then it'll break. I don't like it but didn't want to put prettify()
here and run it for every column.
(Or maybe we should get rid of a parameter percentiles
? Just a thought.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I would just remove percentiles (as everything else is in the closure anyhow)
39c0676
to
773a4a3
Compare
Current coverage is 84.23%@@ master #13298 diff @@
==========================================
Files 138 138
Lines 50681 50681
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 42682 42687 +5
+ Misses 7999 7994 -5
Partials 0 0
|
raise NotImplementedError(msg) | ||
elif self.ndim == 2 and self.columns.size == 0: | ||
raise ValueError("Cannot describe a DataFrame without columns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose this solution for a data frame without columns. Is it ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, does this break anything if you raise on a completely empty frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's a non-breaking change. pd.concat()
inside describe()
raises anyway (I put it in one of my earlier comments). The purpose of this one is just to give a more meaningful message (and skip some code).
unique_pcts = np.unique(percentiles) | ||
if len(unique_pcts) < len(percentiles): | ||
raise ValueError("percentiles cannot contain duplicates") | ||
percentiles = unique_pcts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this might be slightly better than the suggested:
if len(set(percentiles)) < len(percentiles):
raise...
percentile = np.unique(percentiles) # can't use pd.unique here - we need percentiles sorted
but I can change it if it's less idiomatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that's fine, yea a feature of pd.unique
is that it doesn't sort!
@@ -318,3 +319,5 @@ Bug Fixes | |||
|
|||
|
|||
- Bug in ``Categorical.remove_unused_categories()`` changes ``.codes`` dtype to platform int (:issue:`13261`) | |||
- Bug in ``DataFrame.describe()`` where percentile identifiers in index were always rounded to at most 1 decimal place (:issue:`13104`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move all of theses describe issues to a separate sub-section (.describe() changes
), where you show an example of the previous behavior (e.g. give the example from the original issue), and then show the new way of displaying it.
Even though these are 'bug' fixes, a user would want to know that this is happening.
@pijucha looks good. just some more validation tests and some doc changes. |
…s-dev#13288) BUG pandas-dev#13104: - Percentile identifiers are now rounded to the least precision that keeps them unique. - Supplying duplicates in percentiles will raise ValueError. BUG pandas-dev#13288 - Fixed a column index of the output data frame. Previously, if a data frame had a column index of object type and the index contained numeric values, the output column index could be corrupt. It led to ValueError if the output was displayed. - describe() will raise ValueError with an informative message on DataFrame without columns.
@jreback I've made the changes. (Tests still pending.) |
thanks @pijucha nice PR! I slightly edited the docs so take a look when built. |
@jreback Thanks |
git diff upstream/master | flake8 --diff
BUG #13104:
them unique.
BUG #13288
Previously, if a data frame had a column index of object type and
the index contained numeric values, the output column index could
be corrupt. It led to ValueError if the output was displayed.