Add a `fill_null` method to dataframe and column #173

rgommers · 2023-05-22T09:36:21Z

Follow-up to gh-167, which added fill_nan, and closes gh-142.

As discussed in gh-142, this does only the scalar case now. I think the main thing that could be done differently is column_names. This seemed like the most logical thing to do, keeping value to a scalar only. It matches what Vaex does. Pandas instead allows value to be: scalar, dict, Series, or DataFrame. Polars doesn't seem to have any way of dealing with non-uniform dtypes of columns.

MarcoGorelli

small comment, but generally in favour

spec/API_specification/dataframe_api/dataframe_object.py

kkraus14 · 2023-05-22T17:51:20Z

spec/API_specification/dataframe_api/dataframe_object.py

+        This method can only be used if all columns that are to be filled are
+        of the same dtype kind (e.g., all floating-point, all integer, all
+        string or all datetime dtypes). If that is not the case, it is not


Should we clarify whether we expect them to all be the same bit-width? I.E. how this current reads is that a mix of int32 and int64 is potentially allowed.

I did intend to allow that, because the input is a Python int or float either way. I don't feel strongly about it though; I think it works as is, but if we want to be more strict then we can do that (loosening later is always possible).

I did intend to allow that, because the input is a Python int or float either way.

It's a ducktyped Python int or float that is potentially strongly typed under the hood and the casting could be potentially lossy in the case of float64 --> float32 or datetime resolution downgrade ns --> us.

I'd prefer to be more strict now and figure out the path towards loosening later.

can we move forwards with this? is the solution just to replace "dtype kind" with "dtype"?

is the solution just to replace "dtype kind" with "dtype"?

Yes I think so - done now.

jorisvandenbossche · 2023-05-25T13:53:45Z

spec/API_specification/dataframe_api/dataframe_object.py

+        value : Scalar
+            Value used to replace any ``null`` values in the dataframe with.
+            Must be of the Python scalar type matching the dtype(s) of the dataframe.
+        column_names : list[str] | None


Potential alternative name could be subset, to indicate you want to apply the filling operation only to a subset of the columns

We currently don't use either of column_names or subset. I'm happy either way, I think the former is a bit more explicit about what it refers to while the latter expresses better what is done with the provided names.

Anyone else have a preference?

this seems fine

Follow-up to data-apisgh-167, which added `fill_nan`, and closes data-apisgh-142.

kkraus14 · 2023-07-07T03:01:01Z

spec/API_specification/dataframe_api/column_object.py

@@ -641,3 +641,16 @@ def fill_nan(self, value: float | 'null', /) -> Column:

        """
        ...
+
+    def fill_null(self, value: Scalar, /) -> Column:


Not the PR to discuss it but it would be really nice if for typing purposes we could type this as:

def fill_null(self[T], value: Scalar[T], /) -> Column[T]:

gh-187 makes a big step in this direction. That said, the generics don't quite work here I think, since the T in Scalar[T] is a Python scalar while in Column[T] it's a dtype. We happened to have a good discussion about gh-187 yesterday, and striking a balance between making the spec fully type-checkable and keeping it readable. I suspect this kind of case requires overloads, which may be useful at some point but can't be rendered as html - so it divorces static typing a bit from spec writing.

I think there should be a way to make Scalar[T] work, and every library can define what their scalar type is corresponding to each of the standard's types. But that's for another issue, let's pick it up again after #187 is in (I'll update it today)

and every library can define what their scalar type is corresponding to each of the standard's types

Quick comment: they are, or at least include, Python builtin types, which one has no control over. So it'd be int) -> Column[Int64] or some such thing.

But that's for another issue, let's pick it up again after #187 is in (I'll update it today)

Sounds good, thanks.

rgommers added the API design label May 22, 2023

MarcoGorelli reviewed May 22, 2023

View reviewed changes

spec/API_specification/dataframe_api/dataframe_object.py Outdated Show resolved Hide resolved

kkraus14 reviewed May 22, 2023

View reviewed changes

jorisvandenbossche reviewed May 25, 2023

View reviewed changes

rgommers added 2 commits July 5, 2023 23:17

Add a fill_null method to dataframe and column

7d84cc3

Follow-up to data-apisgh-167, which added `fill_nan`, and closes data-apisgh-142.

Address review comments about DataFrame.fill_null

502cd9c

rgommers force-pushed the add-fill_null branch from d43ecf1 to 502cd9c Compare July 5, 2023 21:25

MarcoGorelli approved these changes Jul 6, 2023

View reviewed changes

kkraus14 reviewed Jul 7, 2023

View reviewed changes

kkraus14 approved these changes Jul 7, 2023

View reviewed changes

MarcoGorelli merged commit a0f757b into data-apis:main Jul 7, 2023

rgommers deleted the add-fill_null branch July 7, 2023 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `fill_null` method to dataframe and column #173

Add a `fill_null` method to dataframe and column #173

rgommers commented May 22, 2023

MarcoGorelli left a comment

kkraus14 May 22, 2023

rgommers May 22, 2023

kkraus14 May 22, 2023

MarcoGorelli Jun 26, 2023

rgommers Jul 5, 2023

jorisvandenbossche May 25, 2023

rgommers Jul 5, 2023

MarcoGorelli Jul 6, 2023

kkraus14 Jul 7, 2023

rgommers Jul 7, 2023

MarcoGorelli Jul 7, 2023

rgommers Jul 7, 2023

Add a fill_null method to dataframe and column #173

Add a fill_null method to dataframe and column #173

Conversation

rgommers commented May 22, 2023

MarcoGorelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add a `fill_null` method to dataframe and column #173

Add a `fill_null` method to dataframe and column #173