Skip to content

Add a fill_null method to dataframe and column #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 7, 2023

Conversation

rgommers
Copy link
Member

Follow-up to gh-167, which added fill_nan, and closes gh-142.

As discussed in gh-142, this does only the scalar case now. I think the main thing that could be done differently is column_names. This seemed like the most logical thing to do, keeping value to a scalar only. It matches what Vaex does. Pandas instead allows value to be: scalar, dict, Series, or DataFrame. Polars doesn't seem to have any way of dealing with non-uniform dtypes of columns.

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, but generally in favour

Comment on lines 733 to 735
This method can only be used if all columns that are to be filled are
of the same dtype kind (e.g., all floating-point, all integer, all
string or all datetime dtypes). If that is not the case, it is not
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clarify whether we expect them to all be the same bit-width? I.E. how this current reads is that a mix of int32 and int64 is potentially allowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did intend to allow that, because the input is a Python int or float either way. I don't feel strongly about it though; I think it works as is, but if we want to be more strict then we can do that (loosening later is always possible).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did intend to allow that, because the input is a Python int or float either way.

It's a ducktyped Python int or float that is potentially strongly typed under the hood and the casting could be potentially lossy in the case of float64 --> float32 or datetime resolution downgrade ns --> us.

I'd prefer to be more strict now and figure out the path towards loosening later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move forwards with this? is the solution just to replace "dtype kind" with "dtype"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the solution just to replace "dtype kind" with "dtype"?

Yes I think so - done now.

value : Scalar
Value used to replace any ``null`` values in the dataframe with.
Must be of the Python scalar type matching the dtype(s) of the dataframe.
column_names : list[str] | None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential alternative name could be subset, to indicate you want to apply the filling operation only to a subset of the columns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently don't use either of column_names or subset. I'm happy either way, I think the former is a bit more explicit about what it refers to while the latter expresses better what is done with the provided names.

Anyone else have a preference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems fine

@@ -641,3 +641,16 @@ def fill_nan(self, value: float | 'null', /) -> Column:

"""
...

def fill_null(self, value: Scalar, /) -> Column:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the PR to discuss it but it would be really nice if for typing purposes we could type this as:

    def fill_null(self[T], value: Scalar[T], /) -> Column[T]:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gh-187 makes a big step in this direction. That said, the generics don't quite work here I think, since the T in Scalar[T] is a Python scalar while in Column[T] it's a dtype. We happened to have a good discussion about gh-187 yesterday, and striking a balance between making the spec fully type-checkable and keeping it readable. I suspect this kind of case requires overloads, which may be useful at some point but can't be rendered as html - so it divorces static typing a bit from spec writing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a way to make Scalar[T] work, and every library can define what their scalar type is corresponding to each of the standard's types. But that's for another issue, let's pick it up again after #187 is in (I'll update it today)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and every library can define what their scalar type is corresponding to each of the standard's types

Quick comment: they are, or at least include, Python builtin types, which one has no control over. So it'd be int) -> Column[Int64] or some such thing.

But that's for another issue, let's pick it up again after #187 is in (I'll update it today)

Sounds good, thanks.

@MarcoGorelli MarcoGorelli merged commit a0f757b into data-apis:main Jul 7, 2023
@rgommers rgommers deleted the add-fill_null branch July 7, 2023 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nan to null strategy?
4 participants