Skip to content

DOC: Clarify df.describe() behavior with Timestamp columns #56918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
sfc-gh-joshi opened this issue Jan 17, 2024 · 4 comments · Fixed by #56937
Closed
1 task done

DOC: Clarify df.describe() behavior with Timestamp columns #56918

sfc-gh-joshi opened this issue Jan 17, 2024 · 4 comments · Fixed by #56937
Assignees
Labels

Comments

@sfc-gh-joshi
Copy link

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

Documentation problem

The Notes section for describe states the following (emphasis mine):

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

Since pandas 2.0 began treating Timestamps as numeric data, as far as I can tell, calling describe on a Series/DF with Timestamp data no longer yields the first or last rows. In fact, the example included in the documentation also has this behavior:

>>> s = pd.Series([
...     np.datetime64("2000-01-01"),
...     np.datetime64("2010-01-01"),
...     np.datetime64("2010-01-01")
... ])
>>> s.describe()
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Suggested fix for documentation

Assuming this behavior is intended: remove mention of the first and last columns, and of timestamps as object data.

For object data (such as strings), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency.

@sfc-gh-joshi sfc-gh-joshi added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 17, 2024
@luke396
Copy link
Contributor

luke396 commented Jan 17, 2024

I have confirmed that it still exists in the main branch. If fixes are indeed needed, I can take it.

s = pd.Series(
    [
        np.datetime64("2020-01-01"),
        np.datetime64("2020-01-02"),
        np.datetime64("2020-01-03"),
    ]
)

df = pd.DataFrame(
    {
        "time": [
            np.datetime64("2020-01-01"),
            np.datetime64("2020-01-02"),
            np.datetime64("2020-01-03"),
        ],
    }
)
print(s.describe())
print(df.describe())
[1/1] Generating write_version_file with a custom command
count                      3
mean     2020-01-02 00:00:00
min      2020-01-01 00:00:00
25%      2020-01-01 12:00:00
50%      2020-01-02 00:00:00
75%      2020-01-02 12:00:00
max      2020-01-03 00:00:00
dtype: object
                      time
count                    3
mean   2020-01-02 00:00:00
min    2020-01-01 00:00:00
25%    2020-01-01 12:00:00
50%    2020-01-02 00:00:00
75%    2020-01-02 12:00:00
max    2020-01-03 00:00:00

@rhshadrach
Copy link
Member

In pandas.core.methods.describe there is the function describe_timestamp_as_categorical_1d that as far as I can tell isn't used anywhere. I'd guess that it went unused due to some refactors, but haven't checked.

Are first and last the same as min, max, or more like .iloc[0] and .iloc[-1]? In the former case, we can just update the documentation. In the latter case, I also think maybe we just update the documentation as well.

@rhshadrach
Copy link
Member

rhshadrach commented Jan 17, 2024

It appears to me that first and last would give the same values as min and max:

Timestamp(asint.min(), tz=tz),
Timestamp(asint.max(), tz=tz),

I think we're good just updating the docs.

@rhshadrach rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jan 17, 2024
@alpakpinar
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants