Value returns ndarray for dataframes with a single column with datetime64 tz-aware #15736

sdementen · 2017-03-19T20:08:37Z

closes Inconsistent .values for tz-aware columns #14052
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

…sistency on values

…rray instead of DateTimeIndex

- _where was not special casing pd.Panels - np.putmask was used instead of rs.mask

…erived dtypes

… get_values of father class NonConsolidatableMixIn)

jreback · 2017-03-20T00:19:02Z

pandas/core/generic.py

@@ -4850,7 +4850,7 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,

        msg = "Boolean array expected for the condition, not {dtype}"

-        if not isinstance(cond, pd.DataFrame):
+        if not isinstance(cond, (pd.DataFrame, pd.Panel)):


don't change unrelated things

this fix the test test pct_change in pandas/pandas/tests/test_panel.py:1901 that failed after making the other changes (the rabbit hole...).
Reading the comment "This is a single-dimensional object." and observing the code was passing through these lines for Panel, I guess it was the change to do.

take this out it is wrong.

I see that this code is relatively new (not in 0.19.2)
Can you explain why my adaptation is wrong ? the cond argument can be a Panel, for instance when calling Panel.mask(cond=some other panel), and if you do not adapt the test as I propose, it will run the two following lines that uses cond.dtype that does not exist for Panel (test fails)

we are not supporting Panels much anymore (its going to be deprecated soon anyhow). I don't want to have to deal with this here.

But then we need to remove the test with Panel that fails, correct ?

just xfail it for now and can look later

jreback · 2017-03-20T00:20:45Z

pandas/core/generic.py

@@ -5783,7 +5784,7 @@ def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None,
                                  **kwargs)) - 1)
        if freq is None:
            mask = isnull(_values_from_object(self))
-            np.putmask(rs.values, mask, np.nan)


rs[mask] = np.nan is enough here

yes, indeed

I reverted back to rs.mask(mask, np.nan, inplace=True) as rs[mask] = np.nan fails the test test_pct_change

again you are not fixing the underlying issue

can you confirm the documentation on mask is correct ? if so, I do not see the issue with fixing the previously buggy code (you confirm this np.putmask(rs.values, mask, np.nan) is buggy?) as otherwise test_pct_change fails.
Can you also confirm that your proposal rs[mask] = np.nan is not the equivalent to rs.mask(mask, np.nan,inplace!True) which is equivalent to the original np.putmask(rs.values, mask, np.nan) but without assuming rs.values is a view ?

np.putmask assumes this is a view. We don't use inplace anywhere internally in the code, it is completely non-idiomatic.

rs[mask] = np.nan looks fine to me, what is the problem?

It doesn't work ;-) I have tried.

Isn't boolean indexation only working for vectors ? Or did you mean rs.loc[mask] (I haven't tried)?

If rs.loc[mask] works, I can replace it.

rs.mask replaced by rs.iloc[mask]

jreback · 2017-03-20T00:21:19Z

pandas/core/internals.py

                self.values.ravel(), f).reshape(self.values.shape)
-        return self.values
+
+            if values.ndim == self.ndim - 1:


what the heck is all this?

without this, df[["B"]].values returns an ndarray with 1 dim, e.g. (3,) while we want (3,1).
I saw the logic in NonConsolidatableMixIn.get_values and as DatetimeTZBlock inherits from NonConsolidatableMixIn, I suspected the logic should also apply to DatetimeTZBlock. And it indeed fixes the issue

this change however breaks another test, the test_apply_non_numpy_dtype, that enters in a stack overflow due to the fact that https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196 now returns an ndarray with (3,1) shape instead of a (3,) shape. On this https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196, what is self.values expected to return ? Seeing the creation of a Series that follows, I would guess it is a ndim=1 ndarray but then why is self.values called, with isinstance(self, DataFrame), as this is expected to return a ndim=2 ndarray ?

if you are doing reshaping of any kind then this is wrong

can you explain shortly why reshaping is not wrong in NonConsolidatableMixIn (as it is today) ?

you can reshape, but that doesn't belong here. we are trying to make the code simpler, not more complex.

The issue is that you are dealing with a categorical (and for that matter a DatetimeIndex), which by-definition are always 1-d objects.

The blocks need to emulate a 2-d structure for compat.

If you find that you need to reshape, need to find another way.

so reshaping in NonConsolidatableMixIn is also wrong ?

jreback · 2017-03-20T00:22:00Z

pandas/core/internals.py

-    have_numeric = have_float or have_complex or have_int
-    has_non_numeric = have_dt64 or have_dt64_tz or have_td64 or have_cat
+    have_numeric = have_float + have_complex + have_int
+    have_dt = have_dt64 + have_dt64_tz


why is all of this changing?

I reworked the logic to be able to take care the case where there was a single column (that was bypassed before as there was a shortcut in get_matrix)

conceptually this is fine here. Ideally have simpler logic though. It maybe that we shouldn'tchange _interleave_dtype at all; I thought it might be simpler.

should _interleave_dtype be called when there is a single Block ? because, it is essentially what I did (make it work for the single Block case)

jreback · 2017-03-20T00:22:11Z

pandas/core/internals.py

-        return np.dtype(bool)
+        return np.dtype("bool")
+    elif have_cat:
+        # return blocks[0].get_values().dtype


what the heck is this?

as now in we always go through the method mgt._interleave (even for the case "_is_single_block"), we must ensure that the function _interleaved_dtype takes care of the case where there is only 1 dtype which is why I added the "else have_cat:" block (which is very identical to the block following it except for CategoricalBlock <> IntBlock

see above comment

jreback · 2017-03-20T00:23:08Z

pandas/tests/frame/test_dtypes.py

+            arr = df[col].values
+            fst_elem = arr[0, 0]
+
+            self.assertEqual(type(arr), np.ndarray)


test an exactly output, IOW, use assert_numpy_array_equal with an appropriate expected array.

jreback · 2017-03-20T00:23:45Z

pandas/tests/frame/test_dtypes.py

+            else:
+                self.assertEqual(type(fst_elem), np.datetime64)
+
+    def test_values_dtypes_with_datetime64tz(self):


much better to write this as a pytest.mark.parametrized version.

seeing all the self.assertEqual, I though we were using "nosetests" style and avoided to use any pytest features. BTW, in http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-test-suite, it is mentionned to run the tests suite with nosetests and not pytest. I could never run the suite with nosetests but well with pytest

The stable docs still mention nose, as we only switched recently. In the dev docs it is already updated: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-test-suite

@jorisvandenbossche Tx Joris, good to know ! specially the test_fast ;-)

jreback · 2017-03-20T00:24:09Z

pandas/tests/frame/test_dtypes.py

+        )
+        for col in cols:
+
+            df_sub = df[list(col)]


what exactly are you trying to test here?

testing the logic of _interleaved_dtype by asking to test all cases of df.values on all combination of dtypes and verifying the expected dtype. I have adapted the test comments to make it more explicit

…e more explicity that we test _interleave_dtype

codecov · 2017-03-20T05:22:33Z

Codecov Report

Merging #15736 into master will decrease coverage by 0.02%.
The diff coverage is 86.48%.

@@            Coverage Diff             @@
##           master   #15736      +/-   ##
==========================================
- Coverage   91.01%   90.99%   -0.03%     
==========================================
  Files         143      143              
  Lines       49380    49402      +22     
==========================================
+ Hits        44944    44953       +9     
- Misses       4436     4449      +13

Impacted Files	Coverage Δ
pandas/core/generic.py	`96.24% <100%> (ø)`	⬆️
pandas/core/internals.py	`93.49% <85.71%> (-0.1%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.86% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ab57dc...538af78. Read the comment docs.

…mns + fix typo

…pct_change

jreback · 2017-03-20T12:27:04Z

doc/source/whatsnew/v0.20.0.txt

@@ -792,6 +792,7 @@ Performance Improvements
 Bug Fixes
 ~~~~~~~~~

+- Bug in ``DataFrame.values`` now returns an ``numpy.ndarray`` of ``pandas.Timestamp`` for tz-aware columns; previously this returned ``pandas.DateTimeIndex`` (:issue:`14052`)


say object dtyped numpy array of Timestamp you don't need to say pandas.* that's implied.

jreback · 2017-03-20T12:28:01Z

pandas/core/generic.py

-        will result in a flot64 dtype.
+        will result in a float64 dtype.
+
+        Unlike ``Series.values``, Timezone aware datetime data are


just say that for tz-aware dtypes you will be upcast to an object array.

jreback · 2017-03-20T12:28:23Z

pandas/core/generic.py

@@ -4850,7 +4850,7 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,

        msg = "Boolean array expected for the condition, not {dtype}"

-        if not isinstance(cond, pd.DataFrame):
+        if not isinstance(cond, (pd.DataFrame, pd.Panel)):


take this out it is wrong.

jreback · 2017-03-20T12:28:38Z

pandas/core/generic.py

@@ -5783,7 +5784,7 @@ def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None,
                                  **kwargs)) - 1)
        if freq is None:
            mask = isnull(_values_from_object(self))
-            np.putmask(rs.values, mask, np.nan)


again you are not fixing the underlying issue

jreback · 2017-03-20T12:28:56Z

pandas/core/internals.py

                self.values.ravel(), f).reshape(self.values.shape)
-        return self.values
+
+            if values.ndim == self.ndim - 1:


if you are doing reshaping of any kind then this is wrong

jreback · 2017-03-20T12:31:48Z

pandas/core/internals.py

+
+        # return 1 bigger on the itemsize if unsinged
+        if lcd.kind == 'u':
+            return np.dtype('int%s' % (lcd.itemsize * 8 * 2))


a Categorical is a 1-d object. So what you are doing is fundamentally wrong. You don't reshape things like. You must convert categoricals to object arrays. That logic was done before, not sure why you trying to jump thru hoops.

what code was doing that conversion to array previously ? this line https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L3443 ? or somewhere else ?

np.array(cat) does this

my general point is to try follow the existing code (yes there is a lot, so I'll give you tips along the way).

overal, it would be nice for internal functions/methods to have even a minimal docstring or comments to know the input/output expected and the objective ... not easy to understand otherwise the "contract" of the function.
I have the feeling the _interleaved_dtype function is similar in purpose to the _find_common_type function but limited to numpy dtypes (vs pandas dtypes), is that correct ?

I would love to do it, but I currently fail to understand the real purpose of each of these functions. I propose that I first add the doc (and you both chceck/confirm) before I continue trying to fix the issue

@sdementen certainly welcome more doc-strings

@sdementen yes _interleave_dtypeis similar to _find_common_type; it might work / can-re replaced by it. (something like this I would try on an independent PR; if it works, submit separately, and then you can rebase on top of it)

@jreback thank you for this ! I will work again on the PR based on this but not immediately as I have personal constraints this week. Should the fix be simple now that you (or someone else) want to move forward with another PR than the mine, I will not be offended (I learned already a lot!)

@sdementen nope, open for you.

remove bug fix on Panel support

xref #15736 xref #12780 Author: Jeff Reback <[email protected]> Closes #15765 from jreback/common_types and squashes the following commits: d472646 [Jeff Reback] try removing restriction on windows 8d07cae [Jeff Reback] CLN: replace _interleave_dtype with _find_common_type

jreback · 2017-03-21T15:08:11Z

@sdementen I merged #15765 so I think this will be quite a lot simpler now.

xref pandas-dev#15736 xref pandas-dev#12780 Author: Jeff Reback <[email protected]> Closes pandas-dev#15765 from jreback/common_types and squashes the following commits: d472646 [Jeff Reback] try removing restriction on windows 8d07cae [Jeff Reback] CLN: replace _interleave_dtype with _find_common_type

jreback · 2017-05-07T13:57:32Z

@sdementen rebase / update?

jreback · 2017-06-10T19:05:00Z

can you rebase and update?

sdementen · 2017-06-12T07:56:35Z

@jreback I am currently overloaded, I'll have to wait for a better period with some free time !

jreback · 2017-07-26T23:57:53Z

good ideas, needs a rebase / update. pls comment if you'd like to continue.

sdementen added 18 commits March 18, 2017 06:01

add tests that succeed with current pandas version but show the incon…

cac51c6

…sistency on values

BUG: fix df.values when df is of type datetime64[ns,tz] to return nda…

afc4303

…rray instead of DateTimeIndex

fix small bugs:

8b46fad

- _where was not special casing pd.Panels - np.putmask was used instead of rs.mask

BUG: fix interleave and as_matrix by reworking logic

78f931d

add test for issue pandas-dev#14052

6fa7426

BUG: fix interleave and as_matrix by reworking logic

49b0769

return datetime64[ns] for values for any block with only datetime64 d…

415d120

…erived dtypes

fix test on dtype of values for datetim64 with 2 different tz

b535cf4

remove test on dframes as buggy (just to try CI)

59e4ca5

use dtype in get_values for special case

59c6358

reshape values in get_values of DatetimeTZBlock (follow same logic of…

46fe072

… get_values of father class NonConsolidatableMixIn)

reshape values in get_values of DatetimeTZBlock (follow same logic of…

fd32043

… get_values of father class NonConsolidatableMixIn)

revert changes

69cec8e

revert changes

e83bc0a

clean tests

ce9ef4f

update whatsnew

7dbd3a7

fix for flake8

479b16d

add control of type of first element of df.values

99e71d4

sdementen changed the title ~~Value returns ndarray for series with datetime64 tz~~ Value returns ndarray for dataframes with a single column with datetime64 tz-aware Mar 19, 2017

fix call of reshape on dti, only call on ndarray

d4a2554

jreback requested changes Mar 20, 2017

View reviewed changes

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Mar 20, 2017

sdementen added 2 commits March 20, 2017 06:20

use rs[mask] instead of rs.mask(mask,...

48e5e79

use assert_numpy_array_equal when testing result of df.values and mak…

767218f

…e more explicity that we test _interleave_dtype

sdementen added 3 commits March 20, 2017 06:34

add test for categorical with integer values

8e874ea

add comment to clarify difference with Series.values re tz aware colu…

9a463a6

…mns + fix typo

revert change from commit 48e5e79 as the latter breaks the test test_…

6cd4832

…pct_change

jreback requested changes Mar 20, 2017

View reviewed changes

sdementen added 3 commits March 20, 2017 18:26

clarify doc following review

a72a01a

use rs.iloc[mask] instead of rs.mask

5158f2b

remove bug fix on Panel support

simplify docstring

538af78

jreback mentioned this pull request Mar 21, 2017

CLN: replace _interleave_dtype with _find_common_type #15765

Closed

jreback closed this Jul 26, 2017

Uh oh!

Value returns ndarray for dataframes with a single column with datetime64 tz-aware #15736

Value returns ndarray for dataframes with a single column with datetime64 tz-aware #15736

Uh oh!

Conversation

sdementen commented Mar 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

codecov bot commented Mar 20, 2017 •

edited

Loading