Skip to content

nan to null strategy? #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarcoGorelli opened this issue Apr 18, 2023 · 9 comments · Fixed by #173
Closed

nan to null strategy? #142

MarcoGorelli opened this issue Apr 18, 2023 · 9 comments · Fixed by #173

Comments

@MarcoGorelli
Copy link
Contributor

Both cudf and polars have both null and nan (and there's some discussion about having in pandas too), and they both have methods for converting nans to nulls:

I think we need something similar here. Perhaps fill_nans and fill_nulls would be most generic?

@rgommers
Copy link
Member

That seems like useful/necessary functionality indeed.

Perhaps fill_nans and fill_nulls would be most generic?

Those both seems useful in principle. That's for actually filling values though (see the polars fill_nan doc), so they seem to be orthogonal to nans_to_nulls (it's not clear that polars allows fill_nan(null)).

I also note that in polars, fill_nan is a much simpler API than fill_null, the latter has a bunch of strategies that are potentially complex to implement. cuDF's fillna has different strategies.

Given that fill_null is complex, maybe start with just dealing with converting nan's into (a) nulls, and (b) scalar values?

@kkraus14
Copy link
Collaborator

nans_to_nulls could be represented as fill_nan(null) more flexibly. Is there an API for a more general scalar value replacement? I.E. replace value a with value b in a Column? If so would probably make more sense to ensure that both null and NaN are supported in that function instead?

@rgommers
Copy link
Member

I think I like fill_nan(null).

Is there an API for a more general scalar value replacement? I.E. replace value a with value b in a Column?

I don't think there is yet. It'd usually be done with indexing like x[x == 3] = 99, or with a where function/method. df.where(df == float('nan'), null) seems unintuitive though and creates a large temporary, so a fill_nan method seems preferred.

@MarcoGorelli
Copy link
Contributor Author

sure, let's do fillnan then

just a comment on

df == float('nan')

, there'd be no guarantees about this, right? I think people should use df.isnan()

@rgommers
Copy link
Member

, there'd be no guarantees about this, right? I think people should use df.isnan()

you're completely right there. I remembered we had __eq__, but it's only defined for df-df comparisons.

@MarcoGorelli
Copy link
Contributor Author

__eq__ also accepts df-scalar comparisons, I just meant that using float('nan') shouldn't be the supported way of comparing to nan


Anyway, if we have Column.fillna, then we can do:

idx = df.get_columns().index(label)
col = df.get_column_by_name(label).fillna(None)
df = df.drop_column(label).insert(idx, label, col)

rgommers added a commit to rgommers/dataframe-api that referenced this issue May 10, 2023
Addresses half of data-apisgh-142 (`fill_null` is more complex, and not included
here).
@rgommers
Copy link
Member

gh-167 adds fill_nan, which I think is pretty straightforward.

For fill_null, things are a little more complex, see for example:

Adding "fill with scalar value" is easy, anything else I'm less sure about.

@jorisvandenbossche
Copy link
Member

I think it would be fine to start with the case of filling with a scalar

@rgommers
Copy link
Member

After the discussion today, folks seemed to agree that adding fill_null with only scalar support is fine for now.

The strategy that was discussed as perhaps useful was filling nulls with the corresponding element of another column - but there weren't any active users of that, so we can punt on it for now.

rgommers added a commit to rgommers/dataframe-api that referenced this issue May 18, 2023
Addresses half of data-apisgh-142 (`fill_null` is more complex, and not included
here).
rgommers added a commit that referenced this issue May 22, 2023
Addresses half of gh-142 (`fill_null` is more complex, and not included here).
rgommers added a commit to rgommers/dataframe-api that referenced this issue May 22, 2023
Follow-up to data-apisgh-167, which added `fill_nan`, and closes data-apisgh-142.
rgommers added a commit to rgommers/dataframe-api that referenced this issue Jul 5, 2023
Follow-up to data-apisgh-167, which added `fill_nan`, and closes data-apisgh-142.
MarcoGorelli pushed a commit that referenced this issue Jul 7, 2023
* Add a `fill_null` method to dataframe and column

Follow-up to gh-167, which added `fill_nan`, and closes gh-142.

* Address review comments about `DataFrame.fill_null`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants