ENH: show percentiles in timestamp describe (#30164) #30209

david-cortes · 2019-12-11T13:27:36Z

When creating a pandas Series of timestamps and calling describe on it, it will not show the percentiles of the data, even if these are specified as arguments (s.describe(percentiles = [0.25, 0.5, 0.75])).

This PR solves the issue by introducing a new describe logic for timestamps that would:

Add the percentiles in the same way as for numeric.
Rename first and last to min and max in order to match numeric types.
Add the mean of the series.
~~Show datetime columns alongside numerics by default in the describe method for DataFrames.~~ (just realized there were other issues in which this was deemed to be the desirable behavior)

After this, it more or less matches the functionality of R's summary on POSIXct and Date types.

Why I think this is a good idea: I oftentimes find myself trying to inspect tables of mixed types, and want to get a quick overview of which kind of ranges and variations are there in each column, including timestamps. Currently this is very inconvenient in pandas, and they are not displayed alongside numeric columns by default and don't match the names if passing include = "all", which is what I want to see when I call DataFrame.describe.

Examples:

import numpy as np, pandas as pd
dt_list = [
    "2019-01-01 00:01:00",
    "2019-01-01 00:02:00",
    "2019-01-01 00:02:10",
    "2019-01-01 00:10:01",
    "2019-01-01 00:11:00"
]
s = pd.to_datetime(pd.Series(dt_list))
s.describe()

pd.DataFrame({
    "col1" : s,
    "col2" : np.arange(s.shape[0])
}).describe(include = "all")

Update: corrected the coding style to comply with black and pass the automatic test.

gfyoung · 2019-12-12T08:55:18Z

pandas/core/generic.py

@@ -9806,11 +9788,28 @@ def describe_categorical_1d(data):

            return pd.Series(result, index=names, name=data.name, dtype=dtype)

+        def describe_timestamp_1d(data):
+            tz = data.dt.tz


Reference issue number above this line.

pep8speaks · 2019-12-12T13:19:30Z

Hello @david-cortes! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-20 21:54:50 UTC

david-cortes · 2019-12-12T13:20:50Z

Added the issue number in the code, and also modified it to handle cases with empty inputs. Modified the tests on describe in Series and DataFrame too.

EDIT: updated yet again for code style checks.

EDIT2: updated again for alphabetical order of imports.

MarcoGorelli · 2020-01-15T18:48:01Z

Hi @david-cortes - seems like one of the Azure jobs is failing (but the build isn't there anymore so we can't check it), and that there are now some conflicts with master. Is this something you're still working on?

david-cortes · 2020-01-18T14:25:33Z

@MarcoGorelli : No, not working on anything else here, the PR is finished. Solved the merge conflicts now (they were caused by some tests moving to separate files).

From what I get of the automatic checks that were failing, they were caused by something unrelated to this PR.

jreback

@david-cortes can you create a new issue to move all of the describe helpers to pandas/io/alogorithms.py (or really we should just create pandas/io/algos/describe.py) and then move everything else.

can do before or after this PR

jreback · 2020-01-18T16:27:19Z

pandas/core/generic.py

+        def describe_timestamp_1d(data):
+            # GH-30164
+            tz = data.dt.tz
+            asint = data.dropna().values.view("i8")


you don't need the .values.view, just use len(data.dropna()) for the shape check

data.min() and so on all work and return NaT correctly, none of these gymnastics are needed

I know that's what the current code does, but let's simplify

jreback

this also needs a whats note for 1.1, put in enhancements

david-cortes · 2020-01-20T13:17:43Z

Yes, after the latest changes the internal datetime methods now work just fine with empty inputs and missing values. Changed the code to use those instead, and added the changes to the 1.1.0 what’s new entries. Will now open another one for moving the describe methods to a new file.

EDIT: corrected for linting now.

david-cortes · 2020-01-20T14:17:45Z

@jreback : Did you mean to say to move them to pandas/core/algorithms.py? Would seem out of place in io.

jreback · 2020-01-20T14:20:33Z

@jreback : Did you mean to say to move them to pandas/core/algorithms.py? Would seem out of place in io.

yes though i’d like to create

pandas/core/algos/describe.py

and split up algorithms (so this can. e a step there)

jreback

lgtm. small comment ping on green.

jreback · 2020-01-20T16:35:31Z

doc/source/whatsnew/v1.1.0.rst

@@ -18,6 +18,8 @@ Enhancements
 Other enhancements
 ^^^^^^^^^^^^^^^^^^

+- :meth:`Series.describe` will now show distribution percentiles for ``datetime`` dtypes, statistics ``first`` and ``last``


can you move to api changes section.

jreback · 2020-01-20T16:36:23Z

pandas/tests/frame/methods/test_describe.py

-                    3,
-                    4,
-                ],
+                "s1": [5, 2, 0, 1, 2, 3, 4, 1.581139],


do we have sufficient tests for timedeltas?

can you create a test for Period (which likely don't work), if you'd just xfail it (alt if you'd create an issue)

david-cortes · 2020-01-20T21:55:11Z

Moved the changes in the what’s new and added tests for Period and Timedelta. The describe for Timedelta is already handled as a numeric dtype, and the describe for Period as object dtype, so those are not affected by this PR.

jreback · 2020-01-20T23:37:10Z

pandas/tests/series/methods/test_describe.py

@@ -29,6 +29,36 @@ def test_describe(self):
        )
        tm.assert_series_equal(result, expected)

+        s = Series(


ok for now, in a followon we want to parameterize these.

jreback · 2020-01-20T23:38:33Z

thanks @david-cortes

if you woudn't mind parameterize / separating out the other dtypes tests would be great (could be combined with moving the code out from pandas/core/algorithms.py

david-cortes force-pushed the master branch from 2802dce to 1c6a0e0 Compare December 11, 2019 14:27

gfyoung added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations Datetime Datetime data dtype labels Dec 12, 2019

gfyoung reviewed Dec 12, 2019

View reviewed changes

david-cortes force-pushed the master branch from 1c6a0e0 to c72764a Compare December 12, 2019 13:19

david-cortes force-pushed the master branch 2 times, most recently from b35e73f to b7ce422 Compare December 12, 2019 15:25

david-cortes closed this Jan 18, 2020

david-cortes force-pushed the master branch from b7ce422 to f873fb9 Compare January 18, 2020 14:20

ENH: show percentiles in timestamp describe (pandas-dev#30164)

2f57671

david-cortes reopened this Jan 18, 2020

jreback requested changes Jan 18, 2020

View reviewed changes

david-cortes added a commit to david-cortes/pandas that referenced this pull request Jan 20, 2020

use internal series methods for datetime describe (pandas-dev#30209)

ff73c78

use internal series methods for datetime describe (pandas-dev#30209)

0082f81

david-cortes force-pushed the master branch from ff73c78 to 0082f81 Compare January 20, 2020 13:48

david-cortes mentioned this pull request Jan 20, 2020

REF: move 'describe' functions to new file #31154

Closed

jreback requested changes Jan 20, 2020

View reviewed changes

jreback added this to the 1.1 milestone Jan 20, 2020

more tests for datetimes describe

57a77db

jreback reviewed Jan 20, 2020

View reviewed changes

jreback approved these changes Jan 20, 2020

View reviewed changes

jreback merged commit f1bbb21 into pandas-dev:master Jan 20, 2020

TomAugspurger mentioned this pull request Apr 30, 2020

API: Revert changes to describe #33903

Closed

Uh oh!

ENH: show percentiles in timestamp describe (#30164) #30209

ENH: show percentiles in timestamp describe (#30164) #30209

Uh oh!

Conversation

david-cortes commented Dec 11, 2019 • edited by jreback Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-01-20 21:54:50 UTC

Uh oh!

david-cortes commented Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented Jan 15, 2020

Uh oh!

david-cortes commented Jan 18, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Jan 18, 2020

Choose a reason for hiding this comment

Uh oh!

jreback Jan 18, 2020

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

david-cortes commented Jan 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-cortes commented Jan 20, 2020

Uh oh!

jreback commented Jan 20, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

jreback Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

david-cortes commented Jan 20, 2020

Uh oh!

jreback Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

jreback commented Jan 20, 2020

Uh oh!

Uh oh!

david-cortes commented Dec 11, 2019 •

edited by jreback

Loading

gfyoung Dec 12, 2019 •

edited

Loading

pep8speaks commented Dec 12, 2019 •

edited

Loading

david-cortes commented Dec 12, 2019 •

edited

Loading

david-cortes commented Jan 20, 2020 •

edited

Loading