DOC: Clarified and expanded describe documentation #14995

palewire · 2016-12-27T04:49:17Z

passes git diff upstream/master | flake8 --diff

As discussed in #14483.

codecov-io · 2016-12-27T08:41:06Z

Current coverage is 84.77% (diff: 100%)

Merging #14995 into master will decrease coverage by 0.01%

@@             master     #14995   diff @@
==========================================
  Files           145        145          
  Lines         51090      51129    +39   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43316      43343    +27   
- Misses         7774       7786    +12   
  Partials          0          0

Powered by Codecov. Last update eecfa88...8880a89

jreback · 2016-12-27T11:43:32Z

pandas/core/generic.py

@@ -5201,60 +5201,209 @@ def abs(self):
        """
        return np.abs(self)



why are you redefining things here???

this is just a very small edit to the _shared_docs['describe']

@palewire Typically we reuse docstring on several places, eg for Series/DataFrame/Panel definitions, that's the reason of the use of _shared_docs.

But, @jreback, was just looking in this specific case, this is the only place where this docstring is used, so it is actually not really needed to put it in _shared_docs I think? (maybe a leftover from when the definitions where in multiple places)

this is used in both Series & DataFrame, so needs to stay as shared docs

But it is only defined here (the function is not redefined in series or dataframe, so the shared docstrings is not used anywhere else)

oh, ok then.

I'd be happy to try that correction, though I think it's worth pointing out that was a pre-existing bug in the describe documentation and nothing introduced by this pull request. Could you point me to example of a similar shared method I could model the fix on?

as I said, pretty much any function in series or dataframe that has a shared doc

I am not sure if it is worth including describe defs in both Series and DataFrame (that of course just simply passes args to its super method), just for customizing this single word. For docstrings that include more variables to be changed, that would be OK. But IMO in this case it is not worth it.

It's a bit of a problem with how our handling of shared docstrings currently works, as it does not work perfectly for all cases that we use it for. But having a better approach for functions like this (i.e. functions that have only a definition in generic, and not in series/frame.py) is a whole other/larger issue that can be left for another issue/PR to discuss.

@jorisvandenbossche, if that's how you feel I can hold off on pursuing that route. Are there other modifications you'd like to see?

@jreback are you OK with this in its current form (so not using the _shared_docs). I agree that we should try to have accurate docstrings for both Series and DataFrame making use of our decorator machinery, but in this case it did not make use of that machinery.

jorisvandenbossche

Nice examples! Thanks for that

jorisvandenbossche · 2016-12-27T15:15:51Z

pandas/core/generic.py

-        Returns
-        -------
-        summary: %(klass)s of summary statistics
+        For ``numeric`` data, the result's index will include ``count``,


I would leave these details in the notes section as it was before. Reason is that otherwise it 'takes a long time' before you get to the 'Parameters'. But maybe we could include in the previous sentence something like "Output format depends on the data type, see Notes for more details"

Got it. I'll move it back down. I had decided to move it up after consulting numpy's HOWTO, which says that Notes should be reserved for background information. But if pandas is consistent about having the extended description there instead of at the top of the docstring that seems a-okay to me. I'll move it down.

Yes, to follow the numpy docstring standard, the first sentence should be shorter. And can then be followed by a longer description. But ideally this longer description is only a couple of lines, and IMO if this is longer, it should better go into Notes (as you typically want to show parameters rather prominently as well).
That your long description has to be so long is actually also an indication that the scope of your function is not well defined IMO, but that is another issue .. (and for describe as a convenience function it is OK to do something different for different dtypes)

jorisvandenbossche · 2016-12-27T15:23:20Z

pandas/core/generic.py

-              documentation. eg. df.describe(include=['O'])
-            - If include is the string 'all', the output column-set will
-              match the input one.
+        Analyzes both ``numeric`` and ``object`` series, as well


'numeric' and 'object' do not really refer to actual code or keywords in this case, so I would leave them in 'normal' text (not back tick quoted)

For DataFrame (line below), you can also leave out the quotes I think (for consistency, although we certainly are not consistent in all places)

Got it. I struggled with how to properly use the single and double backticks. I tried to model my behavior on the recommendations of the numpy HOWTO.

I'll take another pass through and try to be more sparing with their use.

We follow the suggestion for the numpy howto to refer to keywords with single backticks, but use typically double backticks for 'code fragments' or other variables/functions apart from keywords.

Got it. I will take another pass to try to tighten things up to meet your standard.

jorisvandenbossche · 2016-12-27T15:27:57Z

pandas/core/generic.py

+              objects submit the data type ``object``. Strings
+              can also be used in the style of
+              `select_dtypes` (e.g. df.describe(include=['O']))
+        exclude : None (default) or a list of dtypes or strings, optional,


Not a strong opinion here, but I also kind of liked the combined explanation. What is most clear (less repetition vs being more explicit) is always a bit of a difficult/subjective balance ..

I agree this is a balancing act. I decided to separate them after consulting the source code, where I found that include and exclude actually have slightly different behavior. From what I can tell, all is not an acceptable input for exclude. Separating them also allowed me space to try to write more explicit descriptions that contrast the two keywords.

… to the Notes section

palewire · 2016-12-27T18:48:10Z

I've just pushed an update that tried to address your requests, @jorisvandenbossche. Please let me know if there's more I can do.

jorisvandenbossche · 2016-12-28T15:19:40Z

pandas/core/generic.py

+        """
+        Generates descriptive statistics that summarize the central tendency,
+        dispersion and shape of a dataset's distribution, excluding
+        ``NaN`` values.



I would add here the sentence from the notes with something like "Analyzes both numeric and object series, as well
as DataFrame column sets of mixed data types." + that output depends on data type + refer to notes for more details on this

jorisvandenbossche · 2016-12-28T15:22:01Z

pandas/core/generic.py

-              documentation. eg. df.describe(include=['O'])
-            - If include is the string 'all', the output column-set will
-              match the input one.
+        percentiles : list of numbers, optional


Strictly spoken, the change from array-like to list of numbers is not really a correction, as arrays are also accepted, but I agree list is more clear to users. Maybe "list-like of numbers"?

jorisvandenbossche · 2016-12-28T15:24:31Z

pandas/core/generic.py

+            fall between 0 and 1. The default is
+            ``[.25, .5, .75]``, which returns the 25th, 50th, and
+            75th percentiles.
+        include : None (default), 'all', or list of dtypes or strings, optional


We typically put the "default None" at the end.

I will flip them to the bottom.

jorisvandenbossche · 2016-12-28T15:25:14Z

pandas/core/generic.py

+            for `Series`. Here are the options:
+
+            - None (default). The result will include all numeric columns.
+            - 'all'. All columns on the input will be included in the output.


" on the input" -> "of the input" ?

That is my mistake. I will correct it.

jorisvandenbossche · 2016-12-28T15:27:17Z

pandas/core/generic.py


        Returns
        -------
-        summary: %(klass)s of summary statistics


NDFrame -> Series/DataFrame

jorisvandenbossche · 2016-12-28T15:30:40Z

pandas/core/generic.py

+        ``mean``, ``std``, ``min``, ``max`` as well as lower, ``50`` and
+        upper percentiles. By default the lower percentile is ``25`` and the
+        upper percentile is ``75``. The ``50`` percentile is typically the
+        same as the median.


"The 50 percentile is typically the same as the median." -> when is this not the case?

I was thinking of the alternative methods of returning medians when there are an even number of values that might result in differing expectations among users. But that's probably unnecessary. I will remove the qualification.

jorisvandenbossche · 2016-12-28T15:32:24Z

pandas/core/generic.py

        among those with the highest count.

-        The include, exclude arguments are ignored for Series.
+        For mixed data types provided via a DataFrame, the result will
+        include a union of attributes of each type.


Maybe mention first that the default is only to show numeric columns. And if, using include='all', different types are included, then the index is union of the attributes of each type.

Good point. I think that will make things more clear. I will change it.

palewire · 2016-12-28T16:32:12Z

@jorisvandenbossche, I have just pushed a number of new changes to try to conform with your second round of requests. Please let me know if there is more I can do. Thank you for your thorough review.

jorisvandenbossche · 2016-12-31T11:36:15Z

@palewire Looks good! You only have a few PEP8 issues (which causes travis to fail):

pandas/core/generic.py:5222:80: E501 line too long (85 > 79 characters)
pandas/core/generic.py:5239:80: E501 line too long (81 > 79 characters)
pandas/core/generic.py:5320:80: E501 line too long (80 > 79 characters)
pandas/core/generic.py:5353:80: E501 line too long (81 > 79 characters)

If those are fixed this can be merged!

jreback · 2016-12-31T15:47:04Z

I am ok with this as the docs for series are so close to dataframe its fine to combine.

palewire · 2016-12-31T19:22:13Z

@jorisvandenbossche I have just now made and pushed the PEP8 corrections you requested.

…ition/pandas into describe-docs

jorisvandenbossche · 2017-01-02T09:05:55Z

@palewire Thanks!

palewire added 3 commits December 26, 2016 20:43

DOC: Clarified and expanded describe documentation

1d6aa0e

PEP8 fix

a445d2a

Slight change for consistency

55cf4ec

jreback reviewed Dec 27, 2016

View reviewed changes

jreback added the Docs label Dec 27, 2016

jorisvandenbossche reviewed Dec 27, 2016

View reviewed changes

Removed a bunch of tick marks and moved the extended description down…

0161a57

… to the Notes section

Typo fix

38015ce

jorisvandenbossche reviewed Dec 28, 2016

View reviewed changes

Further improvements to describe documentation

86dd44a

jorisvandenbossche added this to the 0.20.0 milestone Dec 31, 2016

PEP8 fixes

8880a89

palewire added 3 commits December 31, 2016 17:55

Fixed whitespace

dff88bb

A third dot

d97df49

Merge branch 'describe-docs' of github.com:california-civic-data-coal…

a61dda1

…ition/pandas into describe-docs

jorisvandenbossche merged commit 8a78a2d into pandas-dev:master Jan 2, 2017

seberg mentioned this pull request Jan 29, 2020

Pandas Tests rely on inconsistent array coercion #29978

Open

		@@ -5201,60 +5201,209 @@ def abs(self):
		"""
		return np.abs(self)

DOC: Clarified and expanded describe documentation #14995

DOC: Clarified and expanded describe documentation #14995

Conversation

palewire commented Dec 27, 2016

codecov-io commented Dec 27, 2016 • edited Loading

Current coverage is 84.77% (diff: 100%)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

palewire commented Dec 27, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

palewire commented Dec 28, 2016

jorisvandenbossche commented Dec 31, 2016

jreback commented Dec 31, 2016

palewire commented Dec 31, 2016

jorisvandenbossche commented Jan 2, 2017

codecov-io commented Dec 27, 2016 •

edited

Loading

palewire commented Dec 27, 2016 •

edited

Loading