Skip to content

Add null object, and update top-level API specification #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 18, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 49 additions & 3 deletions spec/API_specification/dataframe_api/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"""
from __future__ import annotations

from typing import Mapping, Sequence
from typing import Mapping, Sequence, Any

from .column_object import *
from .dataframe_object import *
Expand All @@ -14,8 +14,9 @@

__dataframe_api_version__: str = "YYYY.MM"
"""
String representing the version of the DataFrame API specification to which the
conforming implementation adheres.
String representing the version of the DataFrame API specification to which
the conforming implementation adheres. Set to a concrete value for a stable
implementation of the dataframe API standard.
"""

def concat(dataframes: Sequence[DataFrame]) -> DataFrame:
Expand Down Expand Up @@ -73,3 +74,48 @@ def dataframe_from_dict(data: Mapping[str, Column]) -> DataFrame:
DataFrame
"""
...

class null:
"""
A `null` object to represent missing data.

``null`` is a scalar, and may be used when constructing a `Column` from a
Python sequence with `column_from_sequence`. It does not support ``is``,
``==`` or ``bool``.

Raises
------
TypeError
From ``__eq__`` and from ``__bool__``.

For ``_eq__``: a missing value must not be compared for equality
directly. Instead, use `DataFrame.isnull` or `Column.isnull` to check
for presence of missing values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can one check if a scalar is null reliably then? Do we need a namespace isnull method?

I.E. someone runs my_column.sum(skip_null=False) and it yields null as the result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I see a few options:

  • Use is. Runs into the "that does not allow duck typing" problem
  • Use ==. This may also not work when duck typing, or at least not desirable - null_duck_on device == null is not going to do what you want if null lives on host, and also it may cause folks to do column == null which probably works if one implements __eq__
  • Add a free isnull function. Better perhaps, only cost is a new object in the namespace.
  • Add a comparison method, e.g. null.is_equal(val). Doesn't read all that well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the first two options are viable: is doesn't work for type specific nulls (and I think we shouldn't disallow that if an implementation uses that?), and == shouldn't give the behaviour you want here (it shouldn't necessarily return True, just like nan == nan also doesn't give true)

A free is_null function might be the easiest? (a method on the generic namespace.null method probably works as well, but a top-level function feels a bit more natural to me)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a free isnull function which checks for a scalar null. That should address this issue.


For ``__bool__``: truthiness of a missing value is ambiguous.

Notes
-----
Like for Python scalars, the ``null`` object may be duck typed so it can
reside on (e.g.) a GPU. Hence, the builtin ``is`` keyword should not be
used to check if an object *is* the ``null`` object.

"""
...

def isnull(value: Any, /) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use : object instead? Any turns off the typechecker, and if we're saying that any input type is valid, then we probably want to refrain from using any type-specific methods on value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can always address this later though - I'd say let's just get this in now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let's do that now. and you made that review comment on another PR before, sorry I forgot about it. will update in a minute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if we're saying that any input type is valid, then we probably want to refrain from using any type-specific methods on value?

Sure. There's nothing saying that those exist or must be used, right? This seems like a detail that an implementer has to get right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's just easier for the implementer to get it right if the type checker flags to them if they're using something which all python objects don't have

"""
Check if an object is a `null` scalar.

Parameters
----------
value : Any
Any input type is valid.

Returns
-------
bool
True if the input is a `null` object from the same library which
implements the dataframe API standard, False otherwise.

"""
15 changes: 15 additions & 0 deletions spec/API_specification/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,21 @@ API specification

.. currentmodule:: dataframe_api

The API consists of dataframe, column and groupby classes, plus a small number
of objects and functions in the top-level namespace. The latter are:

.. autosummary::
:toctree: generated
:template: attribute.rst
:nosignatures:

__dataframe_api_version__
isnull
null

The ``DataFrame``, ``Column`` and ``GroupBy`` objects have the following
methods and attributes:

.. toctree::
:maxdepth: 3

Expand Down