Skip to content

TYP: hashing #39949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 8, 2021
Merged

TYP: hashing #39949

merged 8 commits into from
Mar 8, 2021

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

return _hash_ndarray(vals, encoding, hash_key, categorize)


def _hash_ndarray(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to separate this out, as i couldn't convince mypy to recognize vals = cast(np.ndarray, vals) or assert isinstance(vals, np.ndarray)

@jreback jreback added the Typing type annotations, mypy/pyright type checking label Feb 23, 2021
@jreback jreback added this to the 1.3 milestone Feb 23, 2021
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Copy link
Contributor

jreback commented Feb 27, 2021

lgtm. @simonjayhawkins if you have a chance.

@jbrockmendel
Copy link
Member Author

if theres no interest here we can close it to clear the queue

@jreback
Copy link
Contributor

jreback commented Mar 5, 2021

if theres no interest here we can close it to clear the queue

its fine, this is worth merging.

@@ -9150,7 +9150,7 @@ def count(

return result.astype("int64")

def _count_level(self, level: Level, axis: Axis = 0, numeric_only=False):
def _count_level(self, level: Level, axis: int = 0, numeric_only: bool = False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change axis? DataFrame._get_axis accepts Axis.

from https://github.com/microsoft/pyright/blob/master/docs/typed-libraries.md#wide-vs-narrow-types

In general, a function input parameter should be annotated with the widest possible type supported by the implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its only called from count, and count calls _get_axis_number just before doing so

@@ -139,25 +167,25 @@ def hash_pandas_object(
return h


def hash_tuples(vals, encoding="utf8", hash_key: str = _default_hash_key):
def hash_tuples(
vals: Union[MultiIndex, List[Tuple]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List -> Iterable

again from https://github.com/microsoft/pyright/blob/master/docs/typed-libraries.md#wide-vs-narrow-types

As a specific application of the “use the widest type possible” rule, libraries should generally use immutable forms of container types instead of mutable forms (unless the function needs to modify the container). Use Sequence rather than List, Mapping rather than Dict, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the rule for Sequence vs Iterable?

In this case, we only get here with MultiIndex and List[Tuple]. I get the wider-is-better for public APIs, but for purely internal code strictness seems beneficial. is that distinction not relevant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK mypy works by checking each function is isolation. i.e. it checks that all the code within the function is compatible with the function annotation. i.e. checks the types in function calls within the body with the type parameters of the called functions.

so doing the reverse of this, i.e. annotating with the actual types passed seems like extra work imo. (and could be simply achieved using MonkeyType)

Iterable because c_is_list_like accepts an iterable as list-like

Using List is bad practice. function code could modify the contents of the List. If the parameter is typed as an immutable container, mypy won't allow it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated

@@ -139,25 +167,25 @@ def hash_pandas_object(
return h


def hash_tuples(vals, encoding="utf8", hash_key: str = _default_hash_key):
def hash_tuples(
vals: Union[MultiIndex, List[Tuple]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type parameters for Tuple.


Parameters
----------
vals : MultiIndex, list-of-tuples, or single tuple
vals : MultiIndex or list-of-tuples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list-like of tuples

"""
Hash an MultiIndex / list-of-tuples efficiently
Hash an MultiIndex / list-of-tuples efficiently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list-like-of-tuples

@@ -224,11 +252,11 @@ def _hash_categorical(c, encoding: str, hash_key: str):


def hash_array(
vals,
vals: ArrayLike,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vals in docstring also needs updating.

@@ -255,11 +283,26 @@ def hash_array(
# hash values. (This check is above the complex check so that we don't ask
# numpy if categorical is a subdtype of complex, as it will choke).
if is_categorical_dtype(dtype):
vals = cast("Categorical", vals)
return _hash_categorical(vals, encoding, hash_key)
elif is_extension_array_dtype(dtype):
vals, _ = vals._values_for_factorize()
dtype = vals.dtype
Copy link
Member

@simonjayhawkins simonjayhawkins Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype = vals.dtype not now needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will update

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jbrockmendel merge on green

@jbrockmendel jbrockmendel merged commit c29facc into pandas-dev:master Mar 8, 2021
@jbrockmendel jbrockmendel deleted the typ-hashing branch March 8, 2021 18:12
jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Mar 11, 2021
* TYP: hashing

* platform compat, mypy fixup

* setops fix

* port more annotations

* update annotations, docstring

* List-> Iterable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants