Skip to content

ENH: Add na_value argument to DataFrame.to_numpy #33857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
May 13, 2020
Merged

ENH: Add na_value argument to DataFrame.to_numpy #33857

merged 32 commits into from
May 13, 2020

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Apr 29, 2020

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dsaxton maybe some tests for error messages and mixed dataframes

@@ -1337,6 +1344,10 @@ def to_numpy(self, dtype=None, copy: bool = False) -> np.ndarray:
[2, 4.5, Timestamp('2000-01-02 00:00:00')]], dtype=object)
"""
result = np.array(self.values, dtype=dtype, copy=copy)

if na_value is not lib.no_default:
result[isna(result)] = na_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> df = pd.DataFrame({"A": [1, 2, None]})
>>> df.to_numpy(na_value=pd.NA)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\pandas\pandas\core\frame.py", line 1349, in to_numpy
    result[isna(result)] = na_value
TypeError: float() argument must be a string or a number, not 'NAType'
>>>

do we want a more user friendly message, or perhaps validate the fill_value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be difficult in general, I think (since it depends on the behaviour of numpy on what it accepts as value to set to the array).
For this specific case, pd.NA is basically only possible for object dtype (so that could be checked specifically. Although it is also not the typical use case for to_numpy, since you typically want to get rid of pd.NA, since that doesn't work nicely with numpy).

Now, for an extension array, you already get a slightly better error message:

In [18]: pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA)   
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-3c4196341619> in <module>
----> 1 pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA)

~/scipy/pandas/pandas/core/arrays/masked.py in to_numpy(self, dtype, copy, na_value)
    145             ):
    146                 raise ValueError(
--> 147                     f"cannot convert to '{dtype}'-dtype NumPy array "
    148                     "with missing values. Specify an appropriate 'na_value' "
    149                     "for this dtype."

ValueError: cannot convert to '<class 'float'>'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.

@@ -1337,6 +1344,10 @@ def to_numpy(self, dtype=None, copy: bool = False) -> np.ndarray:
[2, 4.5, Timestamp('2000-01-02 00:00:00')]], dtype=object)
"""
result = np.array(self.values, dtype=dtype, copy=copy)

if na_value is not lib.no_default:
result[isna(result)] = na_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be difficult in general, I think (since it depends on the behaviour of numpy on what it accepts as value to set to the array).
For this specific case, pd.NA is basically only possible for object dtype (so that could be checked specifically. Although it is also not the typical use case for to_numpy, since you typically want to get rid of pd.NA, since that doesn't work nicely with numpy).

Now, for an extension array, you already get a slightly better error message:

In [18]: pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA)   
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-3c4196341619> in <module>
----> 1 pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA)

~/scipy/pandas/pandas/core/arrays/masked.py in to_numpy(self, dtype, copy, na_value)
    145             ):
    146                 raise ValueError(
--> 147                     f"cannot convert to '{dtype}'-dtype NumPy array "
    148                     "with missing values. Specify an appropriate 'na_value' "
    149                     "for this dtype."

ValueError: cannot convert to '<class 'float'>'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.

@dsaxton

This comment has been minimized.

@jorisvandenbossche
Copy link
Member

So I think you need to deal with this deeper into the internals, as I mentioned before. It is there that the custom pandas logic to decide on the resulting dtype is done (which is indeed different from numpy's result dtype)

@jorisvandenbossche
Copy link
Member

Check BlockManager.as_array -> this also already handles the case of a single Block (to keep the no copy behaviour there).

I think we can expand that method to support na_value

@@ -365,7 +365,7 @@ def test_to_numpy_copy(self):
df = pd.DataFrame(arr)
assert df.values.base is arr
assert df.to_numpy(copy=False).base is arr
assert df.to_numpy(copy=True).base is None
assert df.to_numpy(copy=True).base is not arr
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had to change because now the output array is a view, but only on the transpose of itself (which is a copy of the original data)

Co-authored-by: Joris Van den Bossche <[email protected]>
@@ -798,24 +812,43 @@ def as_array(self, transpose: bool = False) -> np.ndarray:
arr = np.empty(self.shape, dtype=float)
return arr.transpose() if transpose else arr

should_copy = copy or na_value is not lib.no_default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to avoid making a second copy in certain cases where we can detect this (eg with multiple dtypes (going through _interleave), the constructed array is never the original values, so an additional copy is not needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then i would just make this

copy = copy or na_value is not lib.no_default

and add an explanation line

if self._is_single_block and self.blocks[0].is_datetimetz:
# TODO(Block.get_values): Make DatetimeTZBlock.get_values
# always be object dtype. Some callers seem to want the
# DatetimeArray (previously DTI)
arr = self.blocks[0].get_values(dtype=object)
elif self._is_single_block and self.blocks[0].is_extension:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might be able to remove the block aboev (is_datetimetz) as that is an extension block already

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the DatetimeArray.to_numpy behaviour seems correct for the default (having object dtype with timestamps instead of datetime64)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

@jreback jreback added this to the 1.1 milestone May 13, 2020
@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels May 13, 2020
@jreback jreback merged commit 4bd6905 into pandas-dev:master May 13, 2020
@jreback
Copy link
Contributor

jreback commented May 13, 2020

thanks @dsaxton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add na_value to DataFrame.to_numpy
4 participants