-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: show percentiles in timestamp describe (#30164) #30209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/generic.py
Outdated
@@ -9806,11 +9788,28 @@ def describe_categorical_1d(data): | |||
|
|||
return pd.Series(result, index=names, name=data.name, dtype=dtype) | |||
|
|||
def describe_timestamp_1d(data): | |||
tz = data.dt.tz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference issue number above this line.
Hello @david-cortes! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-01-20 21:54:50 UTC |
Added the issue number in the code, and also modified it to handle cases with empty inputs. Modified the tests on EDIT: updated yet again for code style checks. EDIT2: updated again for alphabetical order of imports. |
b35e73f
to
b7ce422
Compare
Hi @david-cortes - seems like one of the Azure jobs is failing (but the build isn't there anymore so we can't check it), and that there are now some conflicts with master. Is this something you're still working on? |
@MarcoGorelli : No, not working on anything else here, the PR is finished. Solved the merge conflicts now (they were caused by some tests moving to separate files). From what I get of the automatic checks that were failing, they were caused by something unrelated to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-cortes can you create a new issue to move all of the describe helpers to pandas/io/alogorithms.py (or really we should just create pandas/io/algos/describe.py) and then move everything else.
can do before or after this PR
pandas/core/generic.py
Outdated
def describe_timestamp_1d(data): | ||
# GH-30164 | ||
tz = data.dt.tz | ||
asint = data.dropna().values.view("i8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the .values.view, just use len(data.dropna())
for the shape check
data.min() and so on all work and return NaT correctly, none of these gymnastics are needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that's what the current code does, but let's simplify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also needs a whats note for 1.1, put in enhancements
Yes, after the latest changes the internal datetime methods now work just fine with empty inputs and missing values. Changed the code to use those instead, and added the changes to the 1.1.0 what’s new entries. Will now open another one for moving the EDIT: corrected for linting now. |
@jreback : Did you mean to say to move them to |
yes though i’d like to create pandas/core/algos/describe.py and split up algorithms (so this can. e a step there) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. small comment ping on green.
@@ -18,6 +18,8 @@ Enhancements | |||
Other enhancements | |||
^^^^^^^^^^^^^^^^^^ | |||
|
|||
- :meth:`Series.describe` will now show distribution percentiles for ``datetime`` dtypes, statistics ``first`` and ``last`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move to api changes section.
3, | ||
4, | ||
], | ||
"s1": [5, 2, 0, 1, 2, 3, 4, 1.581139], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have sufficient tests for timedeltas?
can you create a test for Period (which likely don't work), if you'd just xfail it (alt if you'd create an issue)
Moved the changes in the what’s new and added tests for |
@@ -29,6 +29,36 @@ def test_describe(self): | |||
) | |||
tm.assert_series_equal(result, expected) | |||
|
|||
s = Series( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok for now, in a followon we want to parameterize these.
thanks @david-cortes if you woudn't mind parameterize / separating out the other dtypes tests would be great (could be combined with moving the code out from pandas/core/algorithms.py |
closes #30164
When creating a pandas Series of timestamps and calling
describe
on it, it will not show the percentiles of the data, even if these are specified as arguments (s.describe(percentiles = [0.25, 0.5, 0.75])
).This PR solves the issue by introducing a new describe logic for timestamps that would:
first
andlast
tomin
andmax
in order to match numeric types.Show datetime columns alongside numerics by default in the describe method for DataFrames.(just realized there were other issues in which this was deemed to be the desirable behavior)After this, it more or less matches the functionality of R's
summary
onPOSIXct
andDate
types.Why I think this is a good idea: I oftentimes find myself trying to inspect tables of mixed types, and want to get a quick overview of which kind of ranges and variations are there in each column, including timestamps. Currently this is very inconvenient in pandas, and they are not displayed alongside numeric columns by default and don't match the names if passing
include = "all"
, which is what I want to see when I callDataFrame.describe
.Examples:
Update: corrected the coding style to comply with
black
and pass the automatic test.