DOC: Clarify `df.describe()` behavior with Timestamp columns #56918

sfc-gh-joshi · 2024-01-17T00:36:48Z

Pandas version checks

I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

Documentation problem

The Notes section for describe states the following (emphasis mine):

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

Since pandas 2.0 began treating Timestamps as numeric data, as far as I can tell, calling describe on a Series/DF with Timestamp data no longer yields the first or last rows. In fact, the example included in the documentation also has this behavior:

>>> s = pd.Series([
...     np.datetime64("2000-01-01"),
...     np.datetime64("2010-01-01"),
...     np.datetime64("2010-01-01")
... ])
>>> s.describe()
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Suggested fix for documentation

Assuming this behavior is intended: remove mention of the first and last columns, and of timestamps as object data.

For object data (such as strings), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency.

The text was updated successfully, but these errors were encountered:

luke396 · 2024-01-17T09:16:02Z

I have confirmed that it still exists in the main branch. If fixes are indeed needed, I can take it.

s = pd.Series(
    [
        np.datetime64("2020-01-01"),
        np.datetime64("2020-01-02"),
        np.datetime64("2020-01-03"),
    ]
)

df = pd.DataFrame(
    {
        "time": [
            np.datetime64("2020-01-01"),
            np.datetime64("2020-01-02"),
            np.datetime64("2020-01-03"),
        ],
    }
)
print(s.describe())
print(df.describe())

[1/1] Generating write_version_file with a custom command
count                      3
mean     2020-01-02 00:00:00
min      2020-01-01 00:00:00
25%      2020-01-01 12:00:00
50%      2020-01-02 00:00:00
75%      2020-01-02 12:00:00
max      2020-01-03 00:00:00
dtype: object
                      time
count                    3
mean   2020-01-02 00:00:00
min    2020-01-01 00:00:00
25%    2020-01-01 12:00:00
50%    2020-01-02 00:00:00
75%    2020-01-02 12:00:00
max    2020-01-03 00:00:00

rhshadrach · 2024-01-17T21:29:00Z

In pandas.core.methods.describe there is the function describe_timestamp_as_categorical_1d that as far as I can tell isn't used anywhere. I'd guess that it went unused due to some refactors, but haven't checked.

Are first and last the same as min, max, or more like .iloc[0] and .iloc[-1]? In the former case, we can just update the documentation. In the latter case, I also think maybe we just update the documentation as well.

rhshadrach · 2024-01-17T21:30:25Z

It appears to me that first and last would give the same values as min and max:

pandas/pandas/core/methods/describe.py

Lines 322 to 323 in 9e0b655

    
           Timestamp(asint.min(), tz=tz), 
        
           Timestamp(asint.max(), tz=tz),

I think we're good just updating the docs.

alpakpinar · 2024-01-18T01:02:27Z

take

sfc-gh-joshi added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 17, 2024

rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jan 17, 2024

github-actions bot assigned alpakpinar Jan 18, 2024

alpakpinar mentioned this issue Jan 18, 2024

DOC: Remove inconsistency with timestamp data in describe() method docs #56937

Merged

5 tasks

mroeschke closed this as completed in #56937 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Clarify `df.describe()` behavior with Timestamp columns #56918

DOC: Clarify `df.describe()` behavior with Timestamp columns #56918

sfc-gh-joshi commented Jan 17, 2024

luke396 commented Jan 17, 2024

rhshadrach commented Jan 17, 2024

rhshadrach commented Jan 17, 2024 •

edited

Loading

alpakpinar commented Jan 18, 2024

DOC: Clarify df.describe() behavior with Timestamp columns #56918

DOC: Clarify df.describe() behavior with Timestamp columns #56918

Comments

sfc-gh-joshi commented Jan 17, 2024

Pandas version checks

Location of the documentation

Documentation problem

Suggested fix for documentation

luke396 commented Jan 17, 2024

rhshadrach commented Jan 17, 2024

rhshadrach commented Jan 17, 2024 • edited Loading

alpakpinar commented Jan 18, 2024

DOC: Clarify `df.describe()` behavior with Timestamp columns #56918

DOC: Clarify `df.describe()` behavior with Timestamp columns #56918

rhshadrach commented Jan 17, 2024 •

edited

Loading