-
Notifications
You must be signed in to change notification settings - Fork 21
Add a fill_null
method to dataframe and column
#173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment, but generally in favour
This method can only be used if all columns that are to be filled are | ||
of the same dtype kind (e.g., all floating-point, all integer, all | ||
string or all datetime dtypes). If that is not the case, it is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we clarify whether we expect them to all be the same bit-width? I.E. how this current reads is that a mix of int32
and int64
is potentially allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did intend to allow that, because the input is a Python int
or float
either way. I don't feel strongly about it though; I think it works as is, but if we want to be more strict then we can do that (loosening later is always possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did intend to allow that, because the input is a Python
int
orfloat
either way.
It's a ducktyped Python int
or float
that is potentially strongly typed under the hood and the casting could be potentially lossy in the case of float64
--> float32
or datetime resolution downgrade ns
--> us
.
I'd prefer to be more strict now and figure out the path towards loosening later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move forwards with this? is the solution just to replace "dtype kind" with "dtype"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the solution just to replace "dtype kind" with "dtype"?
Yes I think so - done now.
value : Scalar | ||
Value used to replace any ``null`` values in the dataframe with. | ||
Must be of the Python scalar type matching the dtype(s) of the dataframe. | ||
column_names : list[str] | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential alternative name could be subset
, to indicate you want to apply the filling operation only to a subset of the columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently don't use either of column_names
or subset
. I'm happy either way, I think the former is a bit more explicit about what it refers to while the latter expresses better what is done with the provided names.
Anyone else have a preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems fine
Follow-up to data-apisgh-167, which added `fill_nan`, and closes data-apisgh-142.
@@ -641,3 +641,16 @@ def fill_nan(self, value: float | 'null', /) -> Column: | |||
|
|||
""" | |||
... | |||
|
|||
def fill_null(self, value: Scalar, /) -> Column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the PR to discuss it but it would be really nice if for typing purposes we could type this as:
def fill_null(self[T], value: Scalar[T], /) -> Column[T]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gh-187 makes a big step in this direction. That said, the generics don't quite work here I think, since the T
in Scalar[T]
is a Python scalar while in Column[T]
it's a dtype. We happened to have a good discussion about gh-187 yesterday, and striking a balance between making the spec fully type-checkable and keeping it readable. I suspect this kind of case requires overloads, which may be useful at some point but can't be rendered as html - so it divorces static typing a bit from spec writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be a way to make Scalar[T]
work, and every library can define what their scalar type is corresponding to each of the standard's types. But that's for another issue, let's pick it up again after #187 is in (I'll update it today)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and every library can define what their scalar type is corresponding to each of the standard's types
Quick comment: they are, or at least include, Python builtin types, which one has no control over. So it'd be int) -> Column[Int64]
or some such thing.
But that's for another issue, let's pick it up again after #187 is in (I'll update it today)
Sounds good, thanks.
Follow-up to gh-167, which added
fill_nan
, and closes gh-142.As discussed in gh-142, this does only the scalar case now. I think the main thing that could be done differently is
column_names
. This seemed like the most logical thing to do, keepingvalue
to a scalar only. It matches what Vaex does. Pandas instead allows value to be:scalar, dict, Series, or DataFrame
. Polars doesn't seem to have any way of dealing with non-uniform dtypes of columns.