BUG: Fix describe(): percentiles (#13104), col index (#13288) #13298

pijucha · 2016-05-26T20:39:43Z

closes pd.DataFrame.describe percentile string precision #13104, BUG: DataFrame.describe() breaks with a column index of object type and numeric entries #13288
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

Percentiles are now rounded to the least precision that keeps
them unique.
Supplying duplicates in percentiles will raise ValueError.

Fixed a column index of the output data frame.
Previously, if a data frame had a column index of object type and
the index contained numeric values, the output column index could
be corrupt. It led to ValueError if the output was displayed.

pijucha · 2016-05-26T20:46:30Z

pandas/core/generic.py

+            out[~int_idx] = pcts[~int_idx].round(prec).astype(str)
+            return [i + '%' for i in out]
+
+        pretty_percentiles = prettify(percentiles)


It looks a bit ugly here when percentiles is passed to functions as an argument. But it needs to be evaluated only once.

let's move this function (the prettify), maybe call it slightly more descriptive to core.common

I wasn't very clear about the "ugliness" - see my comment to describe_numeric_1d.

Anyway, I don't mind the function prettify is here. But if you think it could be useful for other purposes, I can make it more general without efficiency loss in this case.

its fine; just normally like to hide things like this away into an external function, much easier to read this way; so maybe more this instead to pandas/formats.format.py somewhere, so you are just calling format_percentiles or something

jreback · 2016-05-27T00:03:39Z

@pijucha looks really good. just a couple of comments.

pijucha · 2016-05-27T01:44:26Z

pandas/core/generic.py


        def describe_numeric_1d(series, percentiles):
            stat_index = (['count', 'mean', 'std', 'min'] +
-                          [pretty_name(x) for x in percentiles] + ['max'])
+                          pretty_percentiles + ['max'])


Here, pretty_percentiles is global to describe_numeric_1d but percentiles is local. So if someone changes something and calls it with a different argument then it'll break. I don't like it but didn't want to put prettify() here and run it for every column.

(Or maybe we should get rid of a parameter percentiles? Just a thought.)

yes I would just remove percentiles (as everything else is in the closure anyhow)

codecov-io · 2016-05-29T03:25:25Z

Current coverage is 84.23%

Merging #13298 into master will increase coverage by <.01%

@@             master     #13298   diff @@
==========================================
  Files           138        138          
  Lines         50681      50681          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42682      42687     +5   
+ Misses         7999       7994     -5   
  Partials          0          0

Powered by Codecov. Last updated by 9e7bfdd...39c0676

pijucha · 2016-05-29T14:08:40Z

pandas/core/generic.py

            raise NotImplementedError(msg)
+        elif self.ndim == 2 and self.columns.size == 0:
+            raise ValueError("Cannot describe a DataFrame without columns")


I chose this solution for a data frame without columns. Is it ok?

hmm, does this break anything if you raise on a completely empty frame?

No, it's a non-breaking change. pd.concat() inside describe()raises anyway (I put it in one of my earlier comments). The purpose of this one is just to give a more meaningful message (and skip some code).

pijucha · 2016-05-30T05:15:33Z

pandas/core/generic.py

+        unique_pcts = np.unique(percentiles)
+        if len(unique_pcts) < len(percentiles):
+            raise ValueError("percentiles cannot contain duplicates")
+        percentiles = unique_pcts


I thought this might be slightly better than the suggested:

if len(set(percentiles)) < len(percentiles): raise... percentile = np.unique(percentiles) # can't use pd.unique here - we need percentiles sorted

but I can change it if it's less idiomatic.

ok that's fine, yea a feature of pd.unique is that it doesn't sort!

jreback · 2016-05-30T20:58:50Z

doc/source/whatsnew/v0.18.2.txt

@@ -318,3 +319,5 @@ Bug Fixes


 - Bug in ``Categorical.remove_unused_categories()`` changes ``.codes`` dtype to platform int (:issue:`13261`)
+- Bug in ``DataFrame.describe()`` where percentile identifiers in index were always rounded to at most 1 decimal place (:issue:`13104`)


can you move all of theses describe issues to a separate sub-section (.describe() changes), where you show an example of the previous behavior (e.g. give the example from the original issue), and then show the new way of displaying it.

Even though these are 'bug' fixes, a user would want to know that this is happening.

jreback · 2016-05-30T20:59:21Z

@pijucha looks good. just some more validation tests and some doc changes.

…s-dev#13288) BUG pandas-dev#13104: - Percentile identifiers are now rounded to the least precision that keeps them unique. - Supplying duplicates in percentiles will raise ValueError. BUG pandas-dev#13288 - Fixed a column index of the output data frame. Previously, if a data frame had a column index of object type and the index contained numeric values, the output column index could be corrupt. It led to ValueError if the output was displayed. - describe() will raise ValueError with an informative message on DataFrame without columns.

pijucha · 2016-05-31T01:30:57Z

@jreback I've made the changes. (Tests still pending.)

jreback · 2016-05-31T14:13:12Z

thanks @pijucha nice PR! I slightly edited the docs so take a look when built.

pijucha · 2016-05-31T15:48:24Z

@jreback Thanks

pijucha changed the title ~~BUG: Fix describe(): percentiles (#13104), col index (#13228)~~ BUG: Fix describe(): percentiles (#13104), col index (#13288) May 26, 2016

pijucha force-pushed the bug13104 branch from 48d7529 to e5290b7 Compare May 26, 2016 20:43

pijucha reviewed May 26, 2016
View reviewed changes

jreback added Bug Output-Formatting __repr__ of pandas objects, to_string labels May 27, 2016

jreback added this to the 0.18.2 milestone May 27, 2016

pijucha reviewed May 27, 2016
View reviewed changes

pijucha force-pushed the bug13104 branch 2 times, most recently from 39c0676 to 773a4a3 Compare May 29, 2016 03:25

pijucha reviewed May 29, 2016
View reviewed changes

pijucha force-pushed the bug13104 branch from 773a4a3 to be3ade7 Compare May 30, 2016 02:58

pijucha reviewed May 30, 2016
View reviewed changes

jreback reviewed May 30, 2016
View reviewed changes

pijucha force-pushed the bug13104 branch from be3ade7 to 9a6bd6e Compare May 31, 2016 01:16

jreback closed this in 132c1c5 May 31, 2016

jreback mentioned this pull request May 31, 2016

pd.DataFrame.describe percentile string precision #13132

Closed

4 tasks

pijucha deleted the bug13104 branch June 9, 2016 04:04

		@@ -318,3 +319,5 @@ Bug Fixes


		- Bug in ``Categorical.remove_unused_categories()`` changes ``.codes`` dtype to platform int (:issue:`13261`)
		- Bug in ``DataFrame.describe()`` where percentile identifiers in index were always rounded to at most 1 decimal place (:issue:`13104`)

Uh oh!

BUG: Fix describe(): percentiles (#13104), col index (#13288) #13298

BUG: Fix describe(): percentiles (#13104), col index (#13288) #13298

Uh oh!

Conversation

pijucha commented May 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented May 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented May 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 84.23%

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented May 30, 2016

Uh oh!

pijucha commented May 31, 2016

Uh oh!

jreback commented May 31, 2016

Uh oh!

pijucha commented May 31, 2016

Uh oh!

Uh oh!

pijucha commented May 26, 2016 •

edited

Loading

codecov-io commented May 29, 2016 •

edited

Loading