-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Value returns ndarray for dataframes with a single column with datetime64 tz-aware #15736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Value returns ndarray for dataframes with a single column with datetime64 tz-aware #15736
Conversation
…sistency on values
…rray instead of DateTimeIndex
- _where was not special casing pd.Panels - np.putmask was used instead of rs.mask
… get_values of father class NonConsolidatableMixIn)
… get_values of father class NonConsolidatableMixIn)
pandas/core/generic.py
Outdated
@@ -4850,7 +4850,7 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None, | |||
|
|||
msg = "Boolean array expected for the condition, not {dtype}" | |||
|
|||
if not isinstance(cond, pd.DataFrame): | |||
if not isinstance(cond, (pd.DataFrame, pd.Panel)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't change unrelated things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fix the test test pct_change in pandas/pandas/tests/test_panel.py:1901 that failed after making the other changes (the rabbit hole...).
Reading the comment "This is a single-dimensional object." and observing the code was passing through these lines for Panel, I guess it was the change to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take this out it is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this code is relatively new (not in 0.19.2)
Can you explain why my adaptation is wrong ? the cond argument can be a Panel, for instance when calling Panel.mask(cond=some other panel), and if you do not adapt the test as I propose, it will run the two following lines that uses cond.dtype
that does not exist for Panel (test fails)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not supporting Panels much anymore (its going to be deprecated soon anyhow). I don't want to have to deal with this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we need to remove the test with Panel that fails, correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just xfail
it for now and can look later
@@ -5783,7 +5784,7 @@ def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None, | |||
**kwargs)) - 1) | |||
if freq is None: | |||
mask = isnull(_values_from_object(self)) | |||
np.putmask(rs.values, mask, np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rs[mask] = np.nan
is enough here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted back to rs.mask(mask, np.nan, inplace=True)
as rs[mask] = np.nan
fails the test test_pct_change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again you are not fixing the underlying issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you confirm the documentation on mask is correct ? if so, I do not see the issue with fixing the previously buggy code (you confirm this np.putmask(rs.values, mask, np.nan)
is buggy?) as otherwise test_pct_change
fails.
Can you also confirm that your proposal rs[mask] = np.nan
is not the equivalent to rs.mask(mask, np.nan,inplace!True)
which is equivalent to the original np.putmask(rs.values, mask, np.nan)
but without assuming rs.values is a view ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.putmask
assumes this is a view. We don't use inplace
anywhere internally in the code, it is completely non-idiomatic.
rs[mask] = np.nan
looks fine to me, what is the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work ;-) I have tried.
Isn't boolean indexation only working for vectors ? Or did you mean rs.loc[mask] (I haven't tried)?
If rs.loc[mask] works, I can replace it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rs.mask replaced by rs.iloc[mask]
self.values.ravel(), f).reshape(self.values.shape) | ||
return self.values | ||
|
||
if values.ndim == self.ndim - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what the heck is all this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this, df[["B"]].values returns an ndarray with 1 dim, e.g. (3,) while we want (3,1).
I saw the logic in NonConsolidatableMixIn.get_values and as DatetimeTZBlock inherits from NonConsolidatableMixIn, I suspected the logic should also apply to DatetimeTZBlock. And it indeed fixes the issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change however breaks another test, the test_apply_non_numpy_dtype, that enters in a stack overflow due to the fact that https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196 now returns an ndarray with (3,1) shape instead of a (3,) shape. On this https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196, what is self.values expected to return ? Seeing the creation of a Series that follows, I would guess it is a ndim=1 ndarray but then why is self.values called, with isinstance(self, DataFrame), as this is expected to return a ndim=2 ndarray ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are doing reshaping of any kind then this is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain shortly why reshaping is not wrong in NonConsolidatableMixIn
(as it is today) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can reshape, but that doesn't belong here. we are trying to make the code simpler, not more complex.
The issue is that you are dealing with a categorical (and for that matter a DatetimeIndex), which by-definition are always 1-d objects.
The blocks need to emulate a 2-d structure for compat.
If you find that you need to reshape, need to find another way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so reshaping in NonConsolidatableMixIn is also wrong ?
have_numeric = have_float or have_complex or have_int | ||
has_non_numeric = have_dt64 or have_dt64_tz or have_td64 or have_cat | ||
have_numeric = have_float + have_complex + have_int | ||
have_dt = have_dt64 + have_dt64_tz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is all of this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked the logic to be able to take care the case where there was a single column (that was bypassed before as there was a shortcut in get_matrix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conceptually this is fine here. Ideally have simpler logic though. It maybe that we shouldn'tchange _interleave_dtype
at all; I thought it might be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should _interleave_dtype be called when there is a single Block ? because, it is essentially what I did (make it work for the single Block case)
return np.dtype(bool) | ||
return np.dtype("bool") | ||
elif have_cat: | ||
# return blocks[0].get_values().dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what the heck is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as now in we always go through the method mgt._interleave (even for the case "_is_single_block"), we must ensure that the function _interleaved_dtype takes care of the case where there is only 1 dtype which is why I added the "else have_cat:" block (which is very identical to the block following it except for CategoricalBlock <> IntBlock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment
pandas/tests/frame/test_dtypes.py
Outdated
arr = df[col].values | ||
fst_elem = arr[0, 0] | ||
|
||
self.assertEqual(type(arr), np.ndarray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test an exactly output, IOW, use assert_numpy_array_equal
with an appropriate expected array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done
pandas/tests/frame/test_dtypes.py
Outdated
else: | ||
self.assertEqual(type(fst_elem), np.datetime64) | ||
|
||
def test_values_dtypes_with_datetime64tz(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much better to write this as a pytest.mark.parametrized version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seeing all the self.assertEqual, I though we were using "nosetests" style and avoided to use any pytest features. BTW, in http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-test-suite, it is mentionned to run the tests suite with nosetests and not pytest. I could never run the suite with nosetests but well with pytest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stable docs still mention nose, as we only switched recently. In the dev docs it is already updated: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-test-suite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Tx Joris, good to know ! specially the test_fast ;-)
pandas/tests/frame/test_dtypes.py
Outdated
) | ||
for col in cols: | ||
|
||
df_sub = df[list(col)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly are you trying to test here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing the logic of _interleaved_dtype by asking to test all cases of df.values on all combination of dtypes and verifying the expected dtype. I have adapted the test comments to make it more explicit
…e more explicity that we test _interleave_dtype
Codecov Report
@@ Coverage Diff @@
## master #15736 +/- ##
==========================================
- Coverage 91.01% 90.99% -0.03%
==========================================
Files 143 143
Lines 49380 49402 +22
==========================================
+ Hits 44944 44953 +9
- Misses 4436 4449 +13
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -792,6 +792,7 @@ Performance Improvements | |||
Bug Fixes | |||
~~~~~~~~~ | |||
|
|||
- Bug in ``DataFrame.values`` now returns an ``numpy.ndarray`` of ``pandas.Timestamp`` for tz-aware columns; previously this returned ``pandas.DateTimeIndex`` (:issue:`14052`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say object
dtyped numpy array of Timestamp
you don't need to say pandas.*
that's implied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
pandas/core/generic.py
Outdated
will result in a flot64 dtype. | ||
will result in a float64 dtype. | ||
|
||
Unlike ``Series.values``, Timezone aware datetime data are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just say that for tz-aware dtypes you will be upcast to an object
array.
pandas/core/generic.py
Outdated
@@ -4850,7 +4850,7 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None, | |||
|
|||
msg = "Boolean array expected for the condition, not {dtype}" | |||
|
|||
if not isinstance(cond, pd.DataFrame): | |||
if not isinstance(cond, (pd.DataFrame, pd.Panel)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take this out it is wrong.
@@ -5783,7 +5784,7 @@ def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None, | |||
**kwargs)) - 1) | |||
if freq is None: | |||
mask = isnull(_values_from_object(self)) | |||
np.putmask(rs.values, mask, np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again you are not fixing the underlying issue
self.values.ravel(), f).reshape(self.values.shape) | ||
return self.values | ||
|
||
if values.ndim == self.ndim - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are doing reshaping of any kind then this is wrong
|
||
# return 1 bigger on the itemsize if unsinged | ||
if lcd.kind == 'u': | ||
return np.dtype('int%s' % (lcd.itemsize * 8 * 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a Categorical is a 1-d object. So what you are doing is fundamentally wrong. You don't reshape things like. You must convert categoricals to object arrays. That logic was done before, not sure why you trying to jump thru hoops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what code was doing that conversion to array previously ? this line https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L3443 ? or somewhere else ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.array(cat)
does this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my general point is to try follow the existing code (yes there is a lot, so I'll give you tips along the way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overal, it would be nice for internal functions/methods to have even a minimal docstring or comments to know the input/output expected and the objective ... not easy to understand otherwise the "contract" of the function.
I have the feeling the _interleaved_dtype function is similar in purpose to the _find_common_type function but limited to numpy dtypes (vs pandas dtypes), is that correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to do it, but I currently fail to understand the real purpose of each of these functions. I propose that I first add the doc (and you both chceck/confirm) before I continue trying to fix the issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdementen certainly welcome more doc-strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdementen yes _interleave_dtype
is similar to _find_common_type
; it might work / can-re replaced by it. (something like this I would try on an independent PR; if it works, submit separately, and then you can rebase on top of it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback thank you for this ! I will work again on the PR based on this but not immediately as I have personal constraints this week. Should the fix be simple now that you (or someone else) want to move forward with another PR than the mine, I will not be offended (I learned already a lot!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdementen nope, open for you.
remove bug fix on Panel support
@sdementen I merged #15765 so I think this will be quite a lot simpler now. |
xref pandas-dev#15736 xref pandas-dev#12780 Author: Jeff Reback <[email protected]> Closes pandas-dev#15765 from jreback/common_types and squashes the following commits: d472646 [Jeff Reback] try removing restriction on windows 8d07cae [Jeff Reback] CLN: replace _interleave_dtype with _find_common_type
xref pandas-dev#15736 xref pandas-dev#12780 Author: Jeff Reback <[email protected]> Closes pandas-dev#15765 from jreback/common_types and squashes the following commits: d472646 [Jeff Reback] try removing restriction on windows 8d07cae [Jeff Reback] CLN: replace _interleave_dtype with _find_common_type
@sdementen rebase / update? |
can you rebase and update? |
@jreback I am currently overloaded, I'll have to wait for a better period with some free time ! |
good ideas, needs a rebase / update. pls comment if you'd like to continue. |
git diff upstream/master | flake8 --diff