-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add na_value argument to DataFrame.to_numpy #33857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dsaxton maybe some tests for error messages and mixed dataframes
pandas/core/frame.py
Outdated
@@ -1337,6 +1344,10 @@ def to_numpy(self, dtype=None, copy: bool = False) -> np.ndarray: | |||
[2, 4.5, Timestamp('2000-01-02 00:00:00')]], dtype=object) | |||
""" | |||
result = np.array(self.values, dtype=dtype, copy=copy) | |||
|
|||
if na_value is not lib.no_default: | |||
result[isna(result)] = na_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> df = pd.DataFrame({"A": [1, 2, None]})
>>> df.to_numpy(na_value=pd.NA)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\simon\pandas\pandas\core\frame.py", line 1349, in to_numpy
result[isna(result)] = na_value
TypeError: float() argument must be a string or a number, not 'NAType'
>>>
do we want a more user friendly message, or perhaps validate the fill_value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be difficult in general, I think (since it depends on the behaviour of numpy on what it accepts as value to set to the array).
For this specific case, pd.NA
is basically only possible for object dtype (so that could be checked specifically. Although it is also not the typical use case for to_numpy
, since you typically want to get rid of pd.NA, since that doesn't work nicely with numpy).
Now, for an extension array, you already get a slightly better error message:
In [18]: pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-3c4196341619> in <module>
----> 1 pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA)
~/scipy/pandas/pandas/core/arrays/masked.py in to_numpy(self, dtype, copy, na_value)
145 ):
146 raise ValueError(
--> 147 f"cannot convert to '{dtype}'-dtype NumPy array "
148 "with missing values. Specify an appropriate 'na_value' "
149 "for this dtype."
ValueError: cannot convert to '<class 'float'>'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.
pandas/core/frame.py
Outdated
@@ -1337,6 +1344,10 @@ def to_numpy(self, dtype=None, copy: bool = False) -> np.ndarray: | |||
[2, 4.5, Timestamp('2000-01-02 00:00:00')]], dtype=object) | |||
""" | |||
result = np.array(self.values, dtype=dtype, copy=copy) | |||
|
|||
if na_value is not lib.no_default: | |||
result[isna(result)] = na_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be difficult in general, I think (since it depends on the behaviour of numpy on what it accepts as value to set to the array).
For this specific case, pd.NA
is basically only possible for object dtype (so that could be checked specifically. Although it is also not the typical use case for to_numpy
, since you typically want to get rid of pd.NA, since that doesn't work nicely with numpy).
Now, for an extension array, you already get a slightly better error message:
In [18]: pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-3c4196341619> in <module>
----> 1 pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA)
~/scipy/pandas/pandas/core/arrays/masked.py in to_numpy(self, dtype, copy, na_value)
145 ):
146 raise ValueError(
--> 147 f"cannot convert to '{dtype}'-dtype NumPy array "
148 "with missing values. Specify an appropriate 'na_value' "
149 "for this dtype."
ValueError: cannot convert to '<class 'float'>'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.
This comment has been minimized.
This comment has been minimized.
So I think you need to deal with this deeper into the internals, as I mentioned before. It is there that the custom pandas logic to decide on the resulting dtype is done (which is indeed different from numpy's result dtype) |
Check BlockManager.as_array -> this also already handles the case of a single Block (to keep the no copy behaviour there). I think we can expand that method to support na_value |
Otherwise this can raise for an all integer DataFrame without any missing values
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
@@ -365,7 +365,7 @@ def test_to_numpy_copy(self): | |||
df = pd.DataFrame(arr) | |||
assert df.values.base is arr | |||
assert df.to_numpy(copy=False).base is arr | |||
assert df.to_numpy(copy=True).base is None | |||
assert df.to_numpy(copy=True).base is not arr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This had to change because now the output array is a view, but only on the transpose of itself (which is a copy of the original data)
Co-authored-by: Joris Van den Bossche <[email protected]>
pandas/core/internals/managers.py
Outdated
@@ -798,24 +812,43 @@ def as_array(self, transpose: bool = False) -> np.ndarray: | |||
arr = np.empty(self.shape, dtype=float) | |||
return arr.transpose() if transpose else arr | |||
|
|||
should_copy = copy or na_value is not lib.no_default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to avoid making a second copy in certain cases where we can detect this (eg with multiple dtypes (going through _interleave
), the constructed array is never the original values, so an additional copy is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then i would just make this
copy = copy or na_value is not lib.no_default
and add an explanation line
pandas/core/internals/managers.py
Outdated
if self._is_single_block and self.blocks[0].is_datetimetz: | ||
# TODO(Block.get_values): Make DatetimeTZBlock.get_values | ||
# always be object dtype. Some callers seem to want the | ||
# DatetimeArray (previously DTI) | ||
arr = self.blocks[0].get_values(dtype=object) | ||
elif self._is_single_block and self.blocks[0].is_extension: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might be able to remove the block aboev (is_datetimetz) as that is an extension block already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the DatetimeArray.to_numpy
behaviour seems correct for the default (having object dtype with timestamps instead of datetime64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍
thanks @dsaxton |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff