Skip to content

BUG: Fix describe(): percentiles (#13104), col index (#13288) #13298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

pijucha
Copy link
Contributor

@pijucha pijucha commented May 26, 2016

BUG #13104:

  • Percentiles are now rounded to the least precision that keeps
    them unique.
  • Supplying duplicates in percentiles will raise ValueError.

BUG #13288

  • Fixed a column index of the output data frame.
    Previously, if a data frame had a column index of object type and
    the index contained numeric values, the output column index could
    be corrupt. It led to ValueError if the output was displayed.

@pijucha pijucha changed the title BUG: Fix describe(): percentiles (#13104), col index (#13228) BUG: Fix describe(): percentiles (#13104), col index (#13288) May 26, 2016
out[~int_idx] = pcts[~int_idx].round(prec).astype(str)
return [i + '%' for i in out]

pretty_percentiles = prettify(percentiles)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit ugly here when percentiles is passed to functions as an argument. But it needs to be evaluated only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this function (the prettify), maybe call it slightly more descriptive to core.common

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't very clear about the "ugliness" - see my comment to describe_numeric_1d.

Anyway, I don't mind the function prettify is here. But if you think it could be useful for other purposes, I can make it more general without efficiency loss in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its fine; just normally like to hide things like this away into an external function, much easier to read this way; so maybe more this instead to pandas/formats.format.py somewhere, so you are just calling format_percentiles or something

@jreback jreback added Bug Output-Formatting __repr__ of pandas objects, to_string labels May 27, 2016
@jreback jreback added this to the 0.18.2 milestone May 27, 2016
@jreback
Copy link
Contributor

jreback commented May 27, 2016

@pijucha looks really good. just a couple of comments.


def describe_numeric_1d(series, percentiles):
stat_index = (['count', 'mean', 'std', 'min'] +
[pretty_name(x) for x in percentiles] + ['max'])
pretty_percentiles + ['max'])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, pretty_percentiles is global to describe_numeric_1d but percentiles is local. So if someone changes something and calls it with a different argument then it'll break. I don't like it but didn't want to put prettify() here and run it for every column.

(Or maybe we should get rid of a parameter percentiles? Just a thought.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I would just remove percentiles (as everything else is in the closure anyhow)

@pijucha pijucha force-pushed the bug13104 branch 2 times, most recently from 39c0676 to 773a4a3 Compare May 29, 2016 03:25
@codecov-io
Copy link

codecov-io commented May 29, 2016

Current coverage is 84.23%

Merging #13298 into master will increase coverage by <.01%

@@             master     #13298   diff @@
==========================================
  Files           138        138          
  Lines         50681      50681          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42682      42687     +5   
+ Misses         7999       7994     -5   
  Partials          0          0          

Powered by Codecov. Last updated by 9e7bfdd...39c0676

raise NotImplementedError(msg)
elif self.ndim == 2 and self.columns.size == 0:
raise ValueError("Cannot describe a DataFrame without columns")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose this solution for a data frame without columns. Is it ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, does this break anything if you raise on a completely empty frame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's a non-breaking change. pd.concat() inside describe()raises anyway (I put it in one of my earlier comments). The purpose of this one is just to give a more meaningful message (and skip some code).

unique_pcts = np.unique(percentiles)
if len(unique_pcts) < len(percentiles):
raise ValueError("percentiles cannot contain duplicates")
percentiles = unique_pcts
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this might be slightly better than the suggested:

if len(set(percentiles)) < len(percentiles):
    raise...
percentile = np.unique(percentiles) # can't use pd.unique here  - we need percentiles sorted

but I can change it if it's less idiomatic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's fine, yea a feature of pd.unique is that it doesn't sort!

@@ -318,3 +319,5 @@ Bug Fixes


- Bug in ``Categorical.remove_unused_categories()`` changes ``.codes`` dtype to platform int (:issue:`13261`)
- Bug in ``DataFrame.describe()`` where percentile identifiers in index were always rounded to at most 1 decimal place (:issue:`13104`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move all of theses describe issues to a separate sub-section (.describe() changes), where you show an example of the previous behavior (e.g. give the example from the original issue), and then show the new way of displaying it.

Even though these are 'bug' fixes, a user would want to know that this is happening.

@jreback
Copy link
Contributor

jreback commented May 30, 2016

@pijucha looks good. just some more validation tests and some doc changes.

…s-dev#13288)

BUG pandas-dev#13104:
- Percentile identifiers are now rounded to the least precision
that keeps them unique.
- Supplying duplicates in percentiles will raise ValueError.

BUG pandas-dev#13288
- Fixed a column index of the output data frame.
Previously, if a data frame had a column index of object type and
the index contained numeric values, the output column index could
be corrupt. It led to ValueError if the output was displayed.

- describe() will raise ValueError with an informative message
on DataFrame without columns.
@pijucha
Copy link
Contributor Author

pijucha commented May 31, 2016

@jreback I've made the changes. (Tests still pending.)

@jreback jreback closed this in 132c1c5 May 31, 2016
@jreback
Copy link
Contributor

jreback commented May 31, 2016

thanks @pijucha nice PR! I slightly edited the docs so take a look when built.

@pijucha
Copy link
Contributor Author

pijucha commented May 31, 2016

@jreback Thanks

@pijucha pijucha deleted the bug13104 branch June 9, 2016 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.DataFrame.describe percentile string precision
3 participants