Skip to content

FIX value_counts should skip NaT #7424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 17, 2014
Merged

Conversation

hayd
Copy link
Contributor

@hayd hayd commented Jun 11, 2014

fixes #7423
fixes #5569.

@hayd
Copy link
Contributor Author

hayd commented Jun 11, 2014

Hmm this seems to be explicitly tested for in test_base and test_timeseries. I still think correct behaviour is to drop the NaTs...

@sinhrks
Copy link
Member

sinhrks commented Jun 11, 2014

The test was added based on the discussion in #6734. I agree it will be clearer if NaT is treated as the same as nan.

values = values.view(np.int64)
keys, counts = htable.value_count_int64(values)

from pandas.lib import NaT
msk = keys != NaT.value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be comparing vs iNaT no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's iNaT? (NaT.value == -9223372036854775808).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its the same, but just use: pandas.tslib.iNaT is the convention

@jreback
Copy link
Contributor

jreback commented Jun 11, 2014

so this makes this consistent with nan (which is excluded), separately, maybe have an option to value_counts to include nan/NaT?

@jreback jreback added this to the 0.14.1 milestone Jun 11, 2014
@jreback jreback added Timedelta and removed Numeric labels Jun 11, 2014
@hayd
Copy link
Contributor Author

hayd commented Jun 11, 2014

@jreback yes, shouldn't be too trick to put something together. I was thinking the same when I was doing this... of some analogous include_na (maybe get_dummies... called dummy_na)... I suppose dropna is more standard kwarg.

@jreback
Copy link
Contributor

jreback commented Jun 11, 2014

yes, dropna=True (as the default) is prob correct. Pls add this. I think we might want to change the default to dropna=False in 0.15.0 but that is a separate issue.

@jreback
Copy link
Contributor

jreback commented Jun 14, 2014

@hayd looks good...squash and merge

jreback added a commit that referenced this pull request Jun 17, 2014
FIX value_counts should skip NaT
@jreback jreback merged commit faf471d into pandas-dev:master Jun 17, 2014
@jreback
Copy link
Contributor

jreback commented Jun 17, 2014

thanks!

@@ -184,6 +184,8 @@ def value_counts(values, sort=True, ascending=False, normalize=False,
bins : integer, optional
Rather than count values, group them into half-open bins,
convenience for pd.cut, only works with numeric data
dropna : boolean, default False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be True?

@jorisvandenbossche
Copy link
Member

@hayd some comments, but will fix it in a PR

@hayd
Copy link
Contributor Author

hayd commented Jun 17, 2014

@jorisvandenbossche thanks! good points!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

value counts NaT doesn't count as NaN Feature request: Option to include NaNs in value_counts()
4 participants