Skip to content

BUG: make tz_localize operate on values rather than categories #28300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Nov 2, 2019

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Sep 5, 2019

@@ -324,7 +324,9 @@ def __new__(cls, data):

orig = data if is_categorical_dtype(data) else None
if orig is not None:
data = Series(orig.values.categories, name=orig.name, copy=False)
data = Series(
orig.values.astype("datetime64[ns]"), name=orig.name, copy=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the astype? what if we get here with e.g. timedelta64 or Period?

Copy link
Member Author

@MarcoGorelli MarcoGorelli Sep 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it was just a quick attempt based on what I'd read in the issue (but without having fully understood the issue), sorry for having put this up so soon, will work on it further

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sweat. ping me when this is ready for review

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel thanks - I'm happy to have this reviewed, whenever you have time

@MarcoGorelli MarcoGorelli changed the title WIP: make tz_localize operate on values rather than categories BUG: make tz_localize operate on values rather than categories Sep 7, 2019
@@ -324,8 +321,7 @@ def __new__(cls, data):

orig = data if is_categorical_dtype(data) else None
if orig is not None:
data = Series(orig.values.categories, name=orig.name, copy=False)

data = Series(orig.values.to_list(), name=orig.name, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we don't want to convert to a list, I think rather we should still operate on the categories
and just return a new Categorical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand - is data supposed to be a pandas.Categorical? If so, then it won't pass the

        if not isinstance(data, ABCSeries):
            raise TypeError(
                "cannot convert an object of type {0} to a "
                "datetimelike index".format(type(data))
            )

test when we get to

        if is_datetime64_dtype(data.dtype):
            return DatetimeProperties(data, orig)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is data supposed to be a pandas.Categorical?

At this point data is a Series with CategoricalDtype.

I think @jreback's point is that we don't want to call to_list() because that is costly

@@ -373,6 +373,17 @@ def test_tz_localize_convert_copy_inplace_mutate(self, copy, method, tz):
)
tm.assert_series_equal(result, expected)

def test_tz_localize_categorical(self, tz_aware_fixture):
# GH 27952
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in the test_categorical tests

@@ -88,7 +88,7 @@ Categorical
^^^^^^^^^^^

- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
-
- Bug in :meth:`CombinedDatetimelikeProperties.__new__` and :meth:`Properties._delegate_property_get` was resulting in properties being applied to categories rather than values (:issue:`27952`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not clear to a user what you are actually doing, try to use a little more obvious language, like

using date accessors on a categorical dtyped series of datetimes was incorrectly returning a ....

@jreback
Copy link
Contributor

jreback commented Sep 7, 2019

this also needs a bit more general testing, try to add a few more accessor functions that you are testing.

@jreback jreback added Categorical Categorical Data Type Datetime Datetime data dtype labels Sep 7, 2019
@MarcoGorelli
Copy link
Member Author

Thanks @jbrockmendel and @jreback for your comments.

I've moved the location of the tests to indexes/test_category and have removed the to_list call. However, I am using .__array__(). Is this also expensive?

I can't see how to keep data as a Series with CategoricalDtype, because then I'm blocked by #28448 .

@@ -324,7 +321,12 @@ def __new__(cls, data):

orig = data if is_categorical_dtype(data) else None
if orig is not None:
data = Series(orig.values.categories, name=orig.name, copy=False)
data = Series(
orig.values.__array__(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont use __array__() directly. this is meant to be used as np.asarray(orig.values)

expected = datetimes.dt.tz_localize(tz)
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("accessor", [("year"), ("month"), ("day")])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parens around year/month/day look unnecessary

categorical = datetimes.astype("category")
result = categorical.dt.tz_localize(tz)
expected = datetimes.dt.tz_localize(tz)
tm.assert_series_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do the same with tz_convert (starting from a tz-aware dtype)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the result here be datetime64tz or categorical[datetime64tz]? My initial intuition is that it would stay categorical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel thanks for your feedback! I've made the changes you suggested.

Regarding the result dtype - the example from #27952 returns

dtype: datetime64[ns]

and so this isn't affected by my changes. Perhaps it should be a separate issue?

Copy link
Member Author

@MarcoGorelli MarcoGorelli Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the docs say

Note The returned Series (or DataFrame) is of the same type as if you used the .str. / .dt. on a Series of that type (and not of type category!).

so this is consistent with that - but I'm happy to work on changing this if that's what you prefer

@@ -110,7 +110,7 @@ Categorical
^^^^^^^^^^^

- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
-
- Using date accessors on a categorical dtyped series of datetimes was incorrectly operating only on the unique categories (:issue: `27952`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention Series.dt.tz_localize here as an example

tm.assert_series_equal(result, expected)

def test_dt_tz_convert(self, tz_aware_fixture):
# GH 27952
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these should go in pandas/tests/series/tests_datetime_values.py

@@ -135,7 +135,9 @@ Categorical

- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
- Bug in :meth:`Categorical.astype` where ``NaN`` values were handled incorrectly when casting to int (:issue:`28406`)
-
- Using date accessors on a categorical dtyped series of datetimes was not returning a Series (or DataFrame) of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time following this. Is the "not returning a Series / DataFrame" the same issue as the "skipping duplicates?"

If they're distinct issue, break it into multiple lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger the only issue is that it's acting on the unique categories rather than on the values. I've tried rephrasing this more clearly

@@ -165,7 +165,9 @@ Categorical

- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
- Bug in :meth:`Categorical.astype` where ``NaN`` values were handled incorrectly when casting to int (:issue:`28406`)
-
- Using date accessors on a categorical dtyped :class:`Series` of datetimes was not returning the same object that would
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These :meth: aren't correct, since they don't resolve to a real method.

How about

Suggested change
- Using date accessors on a categorical dtyped :class:`Series` of datetimes was not returning the same object that would
- Bug in :class:`Series.str` and :class:`Series.dt` accessor for categorical-dtype returning results with the incorrect length (:issue:`27952`)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I've rephrased the first part as you've suggested, but have kept the Series.dt.tz_localize example as suggested by @jreback

@MarcoGorelli MarcoGorelli force-pushed the tz-localize-categories branch from 5d903a0 to 4b0c2a8 Compare October 8, 2019 16:03
@MarcoGorelli
Copy link
Member Author

@TomAugspurger @jbrockmendel @jreback thanks for your reviews. Now that #28762 has been closed, I've been able to push a solution to this which doesn't involve np.asarray and IMO is a marginally cleaner.

Is this something you're still interested in? Should the return type (and thus the documentation) be changed as well?

If not, no worries, thanks for the time you've already dedicated to helping me, feel free to close it :)

@jreback
Copy link
Contributor

jreback commented Oct 8, 2019

sure this still makes sense

@jbrockmendel
Copy link
Member

@MarcoGorelli can you rebase and we'll take a fresh look and try to get this through

@MarcoGorelli MarcoGorelli force-pushed the tz-localize-categories branch from 4b0c2a8 to 4d0320c Compare November 2, 2019 09:48
@MarcoGorelli
Copy link
Member Author

Sure, done

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. some stylistic commments, ping on green.

@@ -324,7 +321,12 @@ def __new__(cls, data):

orig = data if is_categorical_dtype(data) else None
if orig is not None:
data = Series(orig.values.categories, name=orig.name, copy=False)
data = Series(
orig.values,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this orig.array (doesn't make a practical difference), but is more idiomatic to our usecase here.

@@ -344,6 +344,39 @@ def test_dt_namespace_accessor_categorical(self):
expected = Series([2017, 2017, 2018, 2018], name="foo")
tm.assert_series_equal(result, expected)

def test_dt_tz_localize(self, tz_aware_fixture):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add categorical to these test names to indicate what is going onhere.

@jreback jreback added this to the 1.0 milestone Nov 2, 2019
@MarcoGorelli
Copy link
Member Author

@jreback Sure, here you go

@jreback jreback merged commit bf5848f into pandas-dev:master Nov 2, 2019
@jreback
Copy link
Contributor

jreback commented Nov 2, 2019

thanks @MarcoGorelli nice patch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

series.dt.tz_localize() on Categorical operates on categories, not values
4 participants