Skip to content

Value returns ndarray for dataframes with a single column with datetime64 tz-aware #15736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
cac51c6
add tests that succeed with current pandas version but show the incon…
sdementen Mar 18, 2017
afc4303
BUG: fix df.values when df is of type datetime64[ns,tz] to return nda…
sdementen Mar 18, 2017
8b46fad
fix small bugs:
sdementen Mar 18, 2017
78f931d
BUG: fix interleave and as_matrix by reworking logic
sdementen Mar 18, 2017
6fa7426
add test for issue #14052
sdementen Mar 18, 2017
49b0769
BUG: fix interleave and as_matrix by reworking logic
sdementen Mar 18, 2017
415d120
return datetime64[ns] for values for any block with only datetime64 d…
sdementen Mar 19, 2017
b535cf4
fix test on dtype of values for datetim64 with 2 different tz
sdementen Mar 19, 2017
59e4ca5
remove test on dframes as buggy (just to try CI)
sdementen Mar 19, 2017
59c6358
use dtype in get_values for special case
sdementen Mar 19, 2017
46fe072
reshape values in get_values of DatetimeTZBlock (follow same logic of…
sdementen Mar 19, 2017
fd32043
reshape values in get_values of DatetimeTZBlock (follow same logic of…
sdementen Mar 19, 2017
69cec8e
revert changes
sdementen Mar 19, 2017
e83bc0a
revert changes
sdementen Mar 19, 2017
ce9ef4f
clean tests
sdementen Mar 19, 2017
7dbd3a7
update whatsnew
sdementen Mar 19, 2017
479b16d
fix for flake8
sdementen Mar 19, 2017
99e71d4
add control of type of first element of df.values
sdementen Mar 19, 2017
d4a2554
fix call of reshape on dti, only call on ndarray
sdementen Mar 19, 2017
48e5e79
use rs[mask] instead of rs.mask(mask,...
sdementen Mar 20, 2017
767218f
use assert_numpy_array_equal when testing result of df.values and mak…
sdementen Mar 20, 2017
8e874ea
add test for categorical with integer values
sdementen Mar 20, 2017
9a463a6
add comment to clarify difference with Series.values re tz aware colu…
sdementen Mar 20, 2017
6cd4832
revert change from commit 48e5e7942d4edb962c4fa51d69d070f87f985372 as…
sdementen Mar 20, 2017
a72a01a
clarify doc following review
sdementen Mar 20, 2017
5158f2b
use rs.iloc[mask] instead of rs.mask
sdementen Mar 21, 2017
538af78
simplify docstring
sdementen Mar 21, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -792,6 +792,7 @@ Performance Improvements
Bug Fixes
~~~~~~~~~

- Bug in ``DataFrame.values`` now returns object dtyped numpy array of ``Timestamp`` for tz-aware columns; previously this returned ``DateTimeIndex`` (:issue:`14052`)
- Bug in ``Timestamp.replace`` now raises ``TypeError`` when incorrect argument names are given; previously this raised ``ValueError`` (:issue:`15240`)
- Bug in ``Index`` power operations with reversed operands (:issue:`14973`)
- Bug in ``TimedeltaIndex`` addition where overflow was being allowed without error (:issue:`14816`)
Expand Down
7 changes: 5 additions & 2 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -3076,7 +3076,9 @@ def values(self):
e.g. If the dtypes are float16 and float32, dtype will be upcast to
float32. If dtypes are int32 and uint8, dtype will be upcast to
int32. By numpy.find_common_type convention, mixing int64 and uint64
will result in a flot64 dtype.
will result in a float64 dtype.

Unlike ``Series.values``, tz-aware dtypes will be upcasted to object.
"""
return self.as_matrix()

Expand Down Expand Up @@ -5098,6 +5100,7 @@ def where(self, cond, other=np.nan, inplace=False, axis=None, level=None,
try_cast=False, raise_on_error=True):

other = com._apply_if_callable(other, self)

return self._where(cond, other, inplace, axis, level, try_cast,
raise_on_error)

Expand Down Expand Up @@ -5783,7 +5786,7 @@ def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None,
**kwargs)) - 1)
if freq is None:
mask = isnull(_values_from_object(self))
np.putmask(rs.values, mask, np.nan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rs[mask] = np.nan is enough here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, indeed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted back to rs.mask(mask, np.nan, inplace=True) as rs[mask] = np.nan fails the test test_pct_change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again you are not fixing the underlying issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm the documentation on mask is correct ? if so, I do not see the issue with fixing the previously buggy code (you confirm this np.putmask(rs.values, mask, np.nan) is buggy?) as otherwise test_pct_change fails.
Can you also confirm that your proposal rs[mask] = np.nan is not the equivalent to rs.mask(mask, np.nan,inplace!True) which is equivalent to the original np.putmask(rs.values, mask, np.nan) but without assuming rs.values is a view ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.putmask assumes this is a view. We don't use inplace anywhere internally in the code, it is completely non-idiomatic.

rs[mask] = np.nan looks fine to me, what is the problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work ;-) I have tried.

Isn't boolean indexation only working for vectors ? Or did you mean rs.loc[mask] (I haven't tried)?

If rs.loc[mask] works, I can replace it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rs.mask replaced by rs.iloc[mask]

rs.iloc[mask] = np.nan
return rs

def _agg_by_level(self, name, axis=0, level=0, skipna=True, **kwargs):
Expand Down
72 changes: 55 additions & 17 deletions pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -2388,9 +2388,15 @@ def get_values(self, dtype=None):
# return object dtype as Timestamps with the zones
if is_object_dtype(dtype):
f = lambda x: lib.Timestamp(x, tz=self.values.tz)
return lib.map_infer(
values = lib.map_infer(
self.values.ravel(), f).reshape(self.values.shape)
return self.values

if values.ndim == self.ndim - 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the heck is all this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this, df[["B"]].values returns an ndarray with 1 dim, e.g. (3,) while we want (3,1).
I saw the logic in NonConsolidatableMixIn.get_values and as DatetimeTZBlock inherits from NonConsolidatableMixIn, I suspected the logic should also apply to DatetimeTZBlock. And it indeed fixes the issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change however breaks another test, the test_apply_non_numpy_dtype, that enters in a stack overflow due to the fact that https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196 now returns an ndarray with (3,1) shape instead of a (3,) shape. On this https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196, what is self.values expected to return ? Seeing the creation of a Series that follows, I would guess it is a ndim=1 ndarray but then why is self.values called, with isinstance(self, DataFrame), as this is expected to return a ndim=2 ndarray ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are doing reshaping of any kind then this is wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain shortly why reshaping is not wrong in NonConsolidatableMixIn (as it is today) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can reshape, but that doesn't belong here. we are trying to make the code simpler, not more complex.

The issue is that you are dealing with a categorical (and for that matter a DatetimeIndex), which by-definition are always 1-d objects.

The blocks need to emulate a 2-d structure for compat.

If you find that you need to reshape, need to find another way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so reshaping in NonConsolidatableMixIn is also wrong ?

values = values.reshape((1,) + values.shape)
else:
return self.values

return values

def to_object_block(self, mgr):
"""
Expand Down Expand Up @@ -3424,10 +3430,7 @@ def as_matrix(self, items=None):
else:
mgr = self

if self._is_single_block or not self.is_mixed_type:
return mgr.blocks[0].get_values()
else:
return mgr._interleave()
return mgr._interleave()

def _interleave(self):
"""
Expand All @@ -3436,6 +3439,10 @@ def _interleave(self):
"""
dtype = _interleaved_dtype(self.blocks)

if self._is_single_block or not self.is_mixed_type:
return np.array(self.blocks[0].get_values(dtype),
dtype=dtype, copy=False)

result = np.empty(self.shape, dtype=dtype)

if result.shape[0] == 0:
Expand Down Expand Up @@ -4485,33 +4492,64 @@ def _interleaved_dtype(blocks):
for x in blocks:
counts[type(x)].append(x)

have_int = len(counts[IntBlock]) > 0
have_bool = len(counts[BoolBlock]) > 0
have_object = len(counts[ObjectBlock]) > 0
have_int = len(counts[IntBlock]) > 0
have_float = len(counts[FloatBlock]) > 0
have_complex = len(counts[ComplexBlock]) > 0
have_dt64 = len(counts[DatetimeBlock]) > 0
have_dt64_tz = len(counts[DatetimeTZBlock]) > 0
have_td64 = len(counts[TimeDeltaBlock]) > 0
have_cat = len(counts[CategoricalBlock]) > 0
have_cat = len(counts[CategoricalBlock])
# TODO: have_sparse is not used
have_sparse = len(counts[SparseBlock]) > 0 # noqa
have_numeric = have_float or have_complex or have_int
has_non_numeric = have_dt64 or have_dt64_tz or have_td64 or have_cat
have_numeric = have_float + have_complex + have_int
have_dt = have_dt64 + have_dt64_tz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is all of this changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked the logic to be able to take care the case where there was a single column (that was bypassed before as there was a shortcut in get_matrix)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conceptually this is fine here. Ideally have simpler logic though. It maybe that we shouldn'tchange _interleave_dtype at all; I thought it might be simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should _interleave_dtype be called when there is a single Block ? because, it is essentially what I did (make it work for the single Block case)

have_non_numeric = have_dt64 + have_dt64_tz + have_td64 + have_cat
have_non_dt = have_td64 + have_cat
have_mixed = bool(have_numeric) + bool(have_non_dt) + bool(have_dt)

if (have_object or
(have_bool and
(have_numeric or have_dt64 or have_dt64_tz or have_td64)) or
(have_numeric and has_non_numeric) or have_cat or have_dt64 or
have_dt64_tz or have_td64):
(have_non_numeric > 1) or # more than one type of non numeric
(have_bool and have_mixed) or # mix of a numeric et non numeric
(have_mixed > 1) or # mix of a numeric et non numeric
have_dt64_tz or
(have_cat > 1)):
return np.dtype(object)
elif have_dt64:
return np.dtype("datetime64[ns]")
elif have_td64:
return np.dtype("timedelta64[ns]")
elif have_bool:
return np.dtype(bool)
return np.dtype("bool")
elif have_cat:
# return blocks[0].get_values().dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the heck is this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as now in we always go through the method mgt._interleave (even for the case "_is_single_block"), we must ensure that the function _interleaved_dtype takes care of the case where there is only 1 dtype which is why I added the "else have_cat:" block (which is very identical to the block following it except for CategoricalBlock <> IntBlock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment

# if we are mixing unsigned and signed, then return
# the next biggest int type (if we can)

dts = [b.get_values().dtype for b in counts[CategoricalBlock]]
lcd = _find_common_type(dts)
kinds = set([_dt.kind for _dt in dts])

if len(kinds) == 1:
return lcd

if lcd == 'uint64' or lcd == 'int64':
return np.dtype('int64')

# return 1 bigger on the itemsize if unsinged
if lcd.kind == 'u':
return np.dtype('int%s' % (lcd.itemsize * 8 * 2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a Categorical is a 1-d object. So what you are doing is fundamentally wrong. You don't reshape things like. You must convert categoricals to object arrays. That logic was done before, not sure why you trying to jump thru hoops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what code was doing that conversion to array previously ? this line https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L3443 ? or somewhere else ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.array(cat) does this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my general point is to try follow the existing code (yes there is a lot, so I'll give you tips along the way).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overal, it would be nice for internal functions/methods to have even a minimal docstring or comments to know the input/output expected and the objective ... not easy to understand otherwise the "contract" of the function.
I have the feeling the _interleaved_dtype function is similar in purpose to the _find_common_type function but limited to numpy dtypes (vs pandas dtypes), is that correct ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to do it, but I currently fail to understand the real purpose of each of these functions. I propose that I first add the doc (and you both chceck/confirm) before I continue trying to fix the issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdementen certainly welcome more doc-strings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdementen yes _interleave_dtypeis similar to _find_common_type; it might work / can-re replaced by it. (something like this I would try on an independent PR; if it works, submit separately, and then you can rebase on top of it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thank you for this ! I will work again on the PR based on this but not immediately as I have personal constraints this week. Should the fix be simple now that you (or someone else) want to move forward with another PR than the mine, I will not be offended (I learned already a lot!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdementen nope, open for you.

return lcd

elif have_int and not have_float and not have_complex:
# if we are mixing unsigned and signed, then return
# the next biggest int type (if we can)
lcd = _find_common_type([b.dtype for b in counts[IntBlock]])
kinds = set([i.dtype.kind for i in counts[IntBlock]])

dts = [b.dtype for b in counts[IntBlock]]
lcd = _find_common_type(dts)
kinds = set([_dt.kind for _dt in dts])

if len(kinds) == 1:
return lcd

Expand Down
87 changes: 81 additions & 6 deletions pandas/tests/frame/test_dtypes.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,25 @@
# -*- coding: utf-8 -*-

from __future__ import print_function

import itertools
from datetime import timedelta

import numpy as np

import pandas as pd
import pandas.util.testing as tm
from pandas import (DataFrame, Series, date_range, Timedelta, Timestamp,
compat, concat, option_context)
from pandas.compat import u
from pandas.types.dtypes import DatetimeTZDtype
from pandas.tests.frame.common import TestData
from pandas.types.dtypes import DatetimeTZDtype
from pandas.util.testing import (assert_series_equal,
assert_frame_equal,
makeCustomDataframe as mkdf)
import pandas.util.testing as tm
import pandas as pd


class TestDataFrameDataTypes(tm.TestCase, TestData):

def test_concat_empty_dataframe_dtypes(self):
df = DataFrame(columns=list("abc"))
df['a'] = df['a'].astype(np.bool_)
Expand Down Expand Up @@ -198,7 +200,7 @@ def test_select_dtypes_not_an_attr_but_still_valid_dtype(self):
def test_select_dtypes_empty(self):
df = DataFrame({'a': list('abc'), 'b': list(range(1, 4))})
with tm.assertRaisesRegexp(ValueError, 'at least one of include or '
'exclude must be nonempty'):
'exclude must be nonempty'):
df.select_dtypes()

def test_select_dtypes_raises_on_string(self):
Expand Down Expand Up @@ -536,7 +538,6 @@ def test_arg_for_errors_in_astype(self):


class TestDataFrameDatetimeWithTZ(tm.TestCase, TestData):

def test_interleave(self):

# interleave with object
Expand Down Expand Up @@ -622,3 +623,77 @@ def test_astype_str(self):
'NaT NaT' in result)
self.assertTrue('2 2013-01-03 2013-01-03 00:00:00-05:00 '
'2013-01-03 00:00:00+01:00' in result)

def test_values_is_ndarray_with_datetime64tz(self):
df = DataFrame({
'A': date_range('20130101', periods=3),
'B': date_range('20130101', periods=3, tz='US/Eastern'),
})

for col in [
["A"],
["A", "A"],
["A", "B"],
["B", "B"],
["B"],
]:
arr = df[col].values
dtype_expected = "<M8[ns]" if "B" not in col else object
arr_expected = np.array(list(df[col].itertuples(index=False)),
dtype=dtype_expected)

tm.assert_numpy_array_equal(arr, arr_expected)

def test_values_dtypes_with_datetime64tz(self):
df = DataFrame({'dt': date_range('20130101', periods=3),
'dttz': date_range('20130101', periods=3,
tz='US/Eastern'),
'td': (date_range('20130102', periods=3) -
date_range('20130101', periods=3)),
'cat': pd.Categorical(['a', 'b', 'b']),
'cati': pd.Categorical([100, 4, 3]),
'b': [True, False, False],
'i': [1, 2, 3],
'f': [1.3, 2, 3],
'c': [1j, 2, 3],
})

cols = itertools.chain(
itertools.combinations_with_replacement(df.columns, 1),
itertools.combinations_with_replacement(df.columns, 2)
)
for col in cols:
df_sub = df[list(col)]
dts = df_sub.dtypes.values

# calculate dtype_expected in function of dtypes of dataframe
# (testing the logic of the _interleaved_dtype
# function in pandas/core/internals.py

# all columns of the same type
if len(set(dts)) == 1:
if dts[0] in ("M8[ns]", "m8[ns]",
bool, complex, int, float):
dtype_expected = dts[0]
else:
if col == ("cati", ):
dtype_expected = 'int64'
else:
dtype_expected = object

# different type of columns
else:
# all numeric and complex
if all(np.in1d(dts, (complex, int, float))) and complex in dts:
dtype_expected = complex
# all numeric and float
elif all(np.in1d(dts, (complex, int, float))) and float in dts:
dtype_expected = float
else:
dtype_expected = object

arr = df_sub.values
arr_expected = np.array(list(df_sub.itertuples(index=False)),
dtype=dtype_expected)

tm.assert_numpy_array_equal(arr, arr_expected)
1 change: 1 addition & 0 deletions pandas/tests/io/json/test_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -938,6 +938,7 @@ def test_tz_range_is_utc(self):
'"1":"2013-01-02T05:00:00.000Z"}}')

tz_range = pd.date_range('2013-01-01 05:00:00Z', periods=2)

self.assertEqual(exp, dumps(tz_range, iso_dates=True))
dti = pd.DatetimeIndex(tz_range)
self.assertEqual(exp, dumps(dti, iso_dates=True))
Expand Down