Skip to content

TYPING: Partial typing of Categorical #27318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 25, 2019

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Jul 10, 2019

Adds some typing to Categorical.

This PR does the simpler typing of Categorical. In addition there are some more thorny typing that I've stayed away from for now to see if others can implement the basic types (list-likes etc.).

@@ -26,8 +26,11 @@
)
ArrayLike = TypeVar("ArrayLike", ABCExtensionArray, np.ndarray)
DatetimeLikeScalar = TypeVar("DatetimeLikeScalar", Period, Timestamp, Timedelta)
Dtype = Union[str, np.dtype, ExtensionDtype]
Dtype = Union[str, np.dtype, "dtypes.ExtensionDtype"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't do it like this, I get a circular import error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually resolve back to "ExtensionDtype" or does it replace it with Any?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works in PyCharm + I tried setting the string to a random string, which mypy won't accept.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. Wondering if we can extend that to solve #27146 (separate PR obviously)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that should be possible, though it wold be even better if the ABCs could be used...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you use 'ExtendionDtype' here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I guess maybe revert to what you had before is more clear (e.g. do't use the string)

@@ -519,6 +521,8 @@ def astype(self, dtype, copy=True):

"""
if is_categorical_dtype(dtype):
dtype = cast(Union[str, CategoricalDtype], dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy is not able to infer the dtype type here, so I have to do a cast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a typing dummy, can you explain why this is needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add something to the contributing guide about this

@topper-123 topper-123 added Typing type annotations, mypy/pyright type checking Categorical Categorical Data Type labels Jul 10, 2019
@topper-123 topper-123 added this to the 0.25.0 milestone Jul 10, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for keeping these going

@@ -26,8 +26,11 @@
)
ArrayLike = TypeVar("ArrayLike", ABCExtensionArray, np.ndarray)
DatetimeLikeScalar = TypeVar("DatetimeLikeScalar", Period, Timestamp, Timedelta)
Dtype = Union[str, np.dtype, ExtensionDtype]
Dtype = Union[str, np.dtype, "dtypes.ExtensionDtype"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually resolve back to "ExtensionDtype" or does it replace it with Any?

FilePathOrBuffer = Union[str, Path, IO[AnyStr]]

FrameOrSeries = TypeVar("FrameOrSeries", ABCSeries, ABCDataFrame)
Scalar = Union[str, int, float]

# TODO(GH26403): Replace with Optional[bool] or bool
OrderedType = Union[None, bool, object]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always go for Optional instead of Union[None...].

Also object is very rarely what you want in Typing so would want to use Any instead, though the Union of Any and another object also doesn't make sense since Any encompasses everything...

Can't this just be Optional[bool]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reading through the entire diff I see you just moved this from elsewhere so no worries but makes sense to correct here

Copy link
Contributor Author

@topper-123 topper-123 Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional only works with a single argument, so we have to use Union here.

The object type will also be removed in 1.0, so this type can just be bool. I think @TomAugspurger will deal with that. If you look at the comment in core/dtypes/dtypes.py, line 21 to see the reasoning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha but the sentinel never actually gets returned from def ordered below right? If that's correct then I think should be Ordered = Optional[bool] and in the init for CategoricalDtype should be Union[Ordered, object]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I can't get that to work. Categorical.ordered's type is same as CategoricalDtype._ordered which typewise may be a object (but actually can't be).

I don't have a good solution, except waiting for ordered_sentinel to be removed in 1.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share the error it gives you? If the problem is that the ordered_sentinel is in the constructor of CategoricalDtype I think could just explicitly annotate that parameter to be Optional[bool] as well, which would be cleanest in the long run

Copy link
Contributor Author

@topper-123 topper-123 Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas\core\arrays\categorical.py:475: error: Incompatible return value type (got "object", expected "Optional[bool]")

It all goes back to the ordered parameter being of type Union[None, bool, object], while the ordered attribute should be of type Optional[bool] (but isn't, as its type comes from the parameter's type).

Don't know if there's a pattern for having different parameter and attribute types...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just add # type: ignore to that one line? I think that is the lesser of all evils and easiest to revert whenever that change gets made

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nitpicky but can you also change the name from OrderedType? That would seem to imply that this is something like a TypeVar or NewType.

Either OrderedAlias or just plain Ordered would be a better fit

@@ -26,8 +26,11 @@
)
ArrayLike = TypeVar("ArrayLike", ABCExtensionArray, np.ndarray)
DatetimeLikeScalar = TypeVar("DatetimeLikeScalar", Period, Timestamp, Timedelta)
Dtype = Union[str, np.dtype, ExtensionDtype]
Dtype = Union[str, np.dtype, "dtypes.ExtensionDtype"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. Wondering if we can extend that to solve #27146 (separate PR obviously)

FilePathOrBuffer = Union[str, Path, IO[AnyStr]]

FrameOrSeries = TypeVar("FrameOrSeries", ABCSeries, ABCDataFrame)
Scalar = Union[str, int, float]

# TODO(GH26403): Replace with Optional[bool] or bool
OrderedType = Union[None, bool, object]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha but the sentinel never actually gets returned from def ordered below right? If that's correct then I think should be Ordered = Optional[bool] and in the init for CategoricalDtype should be Union[Ordered, object]

FilePathOrBuffer = Union[str, Path, IO[AnyStr]]

FrameOrSeries = TypeVar("FrameOrSeries", ABCSeries, ABCDataFrame)
Scalar = Union[str, int, float]

# TODO(GH26403): Replace with Optional[bool] or bool
OrderedType = Union[None, bool, object]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just add # type: ignore to that one line? I think that is the lesser of all evils and easiest to revert whenever that change gets made

FilePathOrBuffer = Union[str, Path, IO[AnyStr]]

FrameOrSeries = TypeVar("FrameOrSeries", ABCSeries, ABCDataFrame)
Scalar = Union[str, int, float]

# TODO(GH26403): Replace with Optional[bool] or bool
OrderedType = Union[None, bool, object]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nitpicky but can you also change the name from OrderedType? That would seem to imply that this is something like a TypeVar or NewType.

Either OrderedAlias or just plain Ordered would be a better fit

@@ -547,6 +548,8 @@ def update_dtype(self, dtype: "CategoricalDtype") -> "CategoricalDtype":
).format(dtype=dtype)
raise ValueError(msg)

dtype = cast(CategoricalDtype, dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this below subsequent comment? Seems to read better and more logically fit there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside I was kind of curious why this couldn't be inferred more easily since it's really just a str and a CategoricalDtype that are allowed here. So ideally I thought isinstance(str) and isinstance(self) would be two possible branches.

I think the reason this wasn't laid out that way is to allow Categorical objects to be passed as well, but doing so actually causes an error:

t = pd.CategoricalDtype(categories=['b', 'a'], ordered=True)
c = pd.Categorical(['a','b','c','a','b','c'], ordered=True, categories=['c', 'b', 'a'])
>>> t.update_dtype(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/williamayd/clones/pandas/pandas/core/dtypes/dtypes.py", line 561, in update_dtype
    new_ordered = dtype._ordered
AttributeError: 'Categorical' object has no attribute '_ordered'

cc @TomAugspurger for intent. If a bug maybe we drop this annotation for now and clean up with bug resolution

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bug in update_dtype due to the changes we recently made with ordered cc @jschendel

The code in there is assuming we have a CategoricalDtype (which has a _ordered now), but it can apparently also be a Categorical (or should that raise?), because is_dtype allows that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks - for now I think OK to remove annotation from this PR then and can be addressed with whatever bug fix is there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #27338 to address this

@jreback
Copy link
Contributor

jreback commented Jul 12, 2019

looks good, small comment, merge master and ping on green.

@WillAyd
Copy link
Member

WillAyd commented Jul 12, 2019

Can we address my comment on OrderedType before merging? I don't think the definition in pandas._typing of that should add object because one method in the code base uses that as a temporary workaround

@jreback jreback removed this from the 0.25.0 milestone Jul 17, 2019
@jreback
Copy link
Contributor

jreback commented Jul 23, 2019

can you rebase

@topper-123 topper-123 force-pushed the Categorical_typing branch from c3919fe to 5a6de11 Compare July 24, 2019 22:58
@topper-123 topper-123 force-pushed the Categorical_typing branch from 5a6de11 to b76f27e Compare July 24, 2019 23:41
@topper-123
Copy link
Contributor Author

Rebased.

@jreback jreback added this to the 1.0 milestone Jul 25, 2019
@jreback
Copy link
Contributor

jreback commented Jul 25, 2019

lgtm. over to you @WillAyd

@WillAyd WillAyd merged commit 0a712fa into pandas-dev:master Jul 25, 2019
@WillAyd
Copy link
Member

WillAyd commented Jul 25, 2019

Great thanks a lot @topper-123

quintusdias pushed a commit to quintusdias/pandas_dev that referenced this pull request Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants