-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Value returns ndarray for dataframes with a single column with datetime64 tz-aware #15736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 24 commits
cac51c6
afc4303
8b46fad
78f931d
6fa7426
49b0769
415d120
b535cf4
59e4ca5
59c6358
46fe072
fd32043
69cec8e
e83bc0a
ce9ef4f
7dbd3a7
479b16d
99e71d4
d4a2554
48e5e79
767218f
8e874ea
9a463a6
6cd4832
a72a01a
5158f2b
538af78
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3076,7 +3076,10 @@ def values(self): | |
e.g. If the dtypes are float16 and float32, dtype will be upcast to | ||
float32. If dtypes are int32 and uint8, dtype will be upcast to | ||
int32. By numpy.find_common_type convention, mixing int64 and uint64 | ||
will result in a flot64 dtype. | ||
will result in a float64 dtype. | ||
|
||
Unlike ``Series.values``, Timezone aware datetime data are | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just say that for tz-aware dtypes you will be upcast to an |
||
converted to ``pandas.Timestamp`` objects and not to UTC datetime64. | ||
""" | ||
return self.as_matrix() | ||
|
||
|
@@ -4850,7 +4853,7 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None, | |
|
||
msg = "Boolean array expected for the condition, not {dtype}" | ||
|
||
if not isinstance(cond, pd.DataFrame): | ||
if not isinstance(cond, (pd.DataFrame, pd.Panel)): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't change unrelated things There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this fix the test test pct_change in pandas/pandas/tests/test_panel.py:1901 that failed after making the other changes (the rabbit hole...). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. take this out it is wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see that this code is relatively new (not in 0.19.2) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we are not supporting Panels much anymore (its going to be deprecated soon anyhow). I don't want to have to deal with this here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But then we need to remove the test with Panel that fails, correct ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just |
||
# This is a single-dimensional object. | ||
if not is_bool_dtype(cond): | ||
raise ValueError(msg.format(dtype=cond.dtype)) | ||
|
@@ -5098,6 +5101,7 @@ def where(self, cond, other=np.nan, inplace=False, axis=None, level=None, | |
try_cast=False, raise_on_error=True): | ||
|
||
other = com._apply_if_callable(other, self) | ||
|
||
return self._where(cond, other, inplace, axis, level, try_cast, | ||
raise_on_error) | ||
|
||
|
@@ -5783,7 +5787,7 @@ def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None, | |
**kwargs)) - 1) | ||
if freq is None: | ||
mask = isnull(_values_from_object(self)) | ||
np.putmask(rs.values, mask, np.nan) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, indeed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I reverted back to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. again you are not fixing the underlying issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you confirm the documentation on mask is correct ? if so, I do not see the issue with fixing the previously buggy code (you confirm this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't work ;-) I have tried. Isn't boolean indexation only working for vectors ? Or did you mean rs.loc[mask] (I haven't tried)? If rs.loc[mask] works, I can replace it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rs.mask replaced by rs.iloc[mask] |
||
rs.mask(mask, np.nan, inplace=True) | ||
return rs | ||
|
||
def _agg_by_level(self, name, axis=0, level=0, skipna=True, **kwargs): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2388,9 +2388,15 @@ def get_values(self, dtype=None): | |
# return object dtype as Timestamps with the zones | ||
if is_object_dtype(dtype): | ||
f = lambda x: lib.Timestamp(x, tz=self.values.tz) | ||
return lib.map_infer( | ||
values = lib.map_infer( | ||
self.values.ravel(), f).reshape(self.values.shape) | ||
return self.values | ||
|
||
if values.ndim == self.ndim - 1: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what the heck is all this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. without this, df[["B"]].values returns an ndarray with 1 dim, e.g. (3,) while we want (3,1). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this change however breaks another test, the test_apply_non_numpy_dtype, that enters in a stack overflow due to the fact that https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196 now returns an ndarray with (3,1) shape instead of a (3,) shape. On this https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196, what is self.values expected to return ? Seeing the creation of a Series that follows, I would guess it is a ndim=1 ndarray but then why is self.values called, with isinstance(self, DataFrame), as this is expected to return a ndim=2 ndarray ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you are doing reshaping of any kind then this is wrong There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you explain shortly why reshaping is not wrong in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can reshape, but that doesn't belong here. we are trying to make the code simpler, not more complex. The issue is that you are dealing with a categorical (and for that matter a DatetimeIndex), which by-definition are always 1-d objects. The blocks need to emulate a 2-d structure for compat. If you find that you need to reshape, need to find another way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so reshaping in NonConsolidatableMixIn is also wrong ? |
||
values = values.reshape((1,) + values.shape) | ||
else: | ||
return self.values | ||
|
||
return values | ||
|
||
def to_object_block(self, mgr): | ||
""" | ||
|
@@ -3424,10 +3430,7 @@ def as_matrix(self, items=None): | |
else: | ||
mgr = self | ||
|
||
if self._is_single_block or not self.is_mixed_type: | ||
return mgr.blocks[0].get_values() | ||
else: | ||
return mgr._interleave() | ||
return mgr._interleave() | ||
|
||
def _interleave(self): | ||
""" | ||
|
@@ -3436,6 +3439,10 @@ def _interleave(self): | |
""" | ||
dtype = _interleaved_dtype(self.blocks) | ||
|
||
if self._is_single_block or not self.is_mixed_type: | ||
return np.array(self.blocks[0].get_values(dtype), | ||
dtype=dtype, copy=False) | ||
|
||
result = np.empty(self.shape, dtype=dtype) | ||
|
||
if result.shape[0] == 0: | ||
|
@@ -4485,33 +4492,64 @@ def _interleaved_dtype(blocks): | |
for x in blocks: | ||
counts[type(x)].append(x) | ||
|
||
have_int = len(counts[IntBlock]) > 0 | ||
have_bool = len(counts[BoolBlock]) > 0 | ||
have_object = len(counts[ObjectBlock]) > 0 | ||
have_int = len(counts[IntBlock]) > 0 | ||
have_float = len(counts[FloatBlock]) > 0 | ||
have_complex = len(counts[ComplexBlock]) > 0 | ||
have_dt64 = len(counts[DatetimeBlock]) > 0 | ||
have_dt64_tz = len(counts[DatetimeTZBlock]) > 0 | ||
have_td64 = len(counts[TimeDeltaBlock]) > 0 | ||
have_cat = len(counts[CategoricalBlock]) > 0 | ||
have_cat = len(counts[CategoricalBlock]) | ||
# TODO: have_sparse is not used | ||
have_sparse = len(counts[SparseBlock]) > 0 # noqa | ||
have_numeric = have_float or have_complex or have_int | ||
has_non_numeric = have_dt64 or have_dt64_tz or have_td64 or have_cat | ||
have_numeric = have_float + have_complex + have_int | ||
have_dt = have_dt64 + have_dt64_tz | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is all of this changing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I reworked the logic to be able to take care the case where there was a single column (that was bypassed before as there was a shortcut in get_matrix) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. conceptually this is fine here. Ideally have simpler logic though. It maybe that we shouldn'tchange There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should _interleave_dtype be called when there is a single Block ? because, it is essentially what I did (make it work for the single Block case) |
||
have_non_numeric = have_dt64 + have_dt64_tz + have_td64 + have_cat | ||
have_non_dt = have_td64 + have_cat | ||
have_mixed = bool(have_numeric) + bool(have_non_dt) + bool(have_dt) | ||
|
||
if (have_object or | ||
(have_bool and | ||
(have_numeric or have_dt64 or have_dt64_tz or have_td64)) or | ||
(have_numeric and has_non_numeric) or have_cat or have_dt64 or | ||
have_dt64_tz or have_td64): | ||
(have_non_numeric > 1) or # more than one type of non numeric | ||
(have_bool and have_mixed) or # mix of a numeric et non numeric | ||
(have_mixed > 1) or # mix of a numeric et non numeric | ||
have_dt64_tz or | ||
(have_cat > 1)): | ||
return np.dtype(object) | ||
elif have_dt64: | ||
return np.dtype("datetime64[ns]") | ||
elif have_td64: | ||
return np.dtype("timedelta64[ns]") | ||
elif have_bool: | ||
return np.dtype(bool) | ||
return np.dtype("bool") | ||
elif have_cat: | ||
# return blocks[0].get_values().dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what the heck is this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as now in we always go through the method mgt._interleave (even for the case "_is_single_block"), we must ensure that the function _interleaved_dtype takes care of the case where there is only 1 dtype which is why I added the "else have_cat:" block (which is very identical to the block following it except for CategoricalBlock <> IntBlock There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see above comment |
||
# if we are mixing unsigned and signed, then return | ||
# the next biggest int type (if we can) | ||
|
||
dts = [b.get_values().dtype for b in counts[CategoricalBlock]] | ||
lcd = _find_common_type(dts) | ||
kinds = set([_dt.kind for _dt in dts]) | ||
|
||
if len(kinds) == 1: | ||
return lcd | ||
|
||
if lcd == 'uint64' or lcd == 'int64': | ||
return np.dtype('int64') | ||
|
||
# return 1 bigger on the itemsize if unsinged | ||
if lcd.kind == 'u': | ||
return np.dtype('int%s' % (lcd.itemsize * 8 * 2)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a Categorical is a 1-d object. So what you are doing is fundamentally wrong. You don't reshape things like. You must convert categoricals to object arrays. That logic was done before, not sure why you trying to jump thru hoops. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what code was doing that conversion to array previously ? this line https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L3443 ? or somewhere else ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my general point is to try follow the existing code (yes there is a lot, so I'll give you tips along the way). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. overal, it would be nice for internal functions/methods to have even a minimal docstring or comments to know the input/output expected and the objective ... not easy to understand otherwise the "contract" of the function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would love to do it, but I currently fail to understand the real purpose of each of these functions. I propose that I first add the doc (and you both chceck/confirm) before I continue trying to fix the issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sdementen certainly welcome more doc-strings There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sdementen yes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback thank you for this ! I will work again on the PR based on this but not immediately as I have personal constraints this week. Should the fix be simple now that you (or someone else) want to move forward with another PR than the mine, I will not be offended (I learned already a lot!) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sdementen nope, open for you. |
||
return lcd | ||
|
||
elif have_int and not have_float and not have_complex: | ||
# if we are mixing unsigned and signed, then return | ||
# the next biggest int type (if we can) | ||
lcd = _find_common_type([b.dtype for b in counts[IntBlock]]) | ||
kinds = set([i.dtype.kind for i in counts[IntBlock]]) | ||
|
||
dts = [b.dtype for b in counts[IntBlock]] | ||
lcd = _find_common_type(dts) | ||
kinds = set([_dt.kind for _dt in dts]) | ||
|
||
if len(kinds) == 1: | ||
return lcd | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,23 +1,25 @@ | ||
# -*- coding: utf-8 -*- | ||
|
||
from __future__ import print_function | ||
|
||
import itertools | ||
from datetime import timedelta | ||
|
||
import numpy as np | ||
|
||
import pandas as pd | ||
import pandas.util.testing as tm | ||
from pandas import (DataFrame, Series, date_range, Timedelta, Timestamp, | ||
compat, concat, option_context) | ||
from pandas.compat import u | ||
from pandas.types.dtypes import DatetimeTZDtype | ||
from pandas.tests.frame.common import TestData | ||
from pandas.types.dtypes import DatetimeTZDtype | ||
from pandas.util.testing import (assert_series_equal, | ||
assert_frame_equal, | ||
makeCustomDataframe as mkdf) | ||
import pandas.util.testing as tm | ||
import pandas as pd | ||
|
||
|
||
class TestDataFrameDataTypes(tm.TestCase, TestData): | ||
|
||
def test_concat_empty_dataframe_dtypes(self): | ||
df = DataFrame(columns=list("abc")) | ||
df['a'] = df['a'].astype(np.bool_) | ||
|
@@ -198,7 +200,7 @@ def test_select_dtypes_not_an_attr_but_still_valid_dtype(self): | |
def test_select_dtypes_empty(self): | ||
df = DataFrame({'a': list('abc'), 'b': list(range(1, 4))}) | ||
with tm.assertRaisesRegexp(ValueError, 'at least one of include or ' | ||
'exclude must be nonempty'): | ||
'exclude must be nonempty'): | ||
df.select_dtypes() | ||
|
||
def test_select_dtypes_raises_on_string(self): | ||
|
@@ -536,7 +538,6 @@ def test_arg_for_errors_in_astype(self): | |
|
||
|
||
class TestDataFrameDatetimeWithTZ(tm.TestCase, TestData): | ||
|
||
def test_interleave(self): | ||
|
||
# interleave with object | ||
|
@@ -622,3 +623,77 @@ def test_astype_str(self): | |
'NaT NaT' in result) | ||
self.assertTrue('2 2013-01-03 2013-01-03 00:00:00-05:00 ' | ||
'2013-01-03 00:00:00+01:00' in result) | ||
|
||
def test_values_is_ndarray_with_datetime64tz(self): | ||
df = DataFrame({ | ||
'A': date_range('20130101', periods=3), | ||
'B': date_range('20130101', periods=3, tz='US/Eastern'), | ||
}) | ||
|
||
for col in [ | ||
["A"], | ||
["A", "A"], | ||
["A", "B"], | ||
["B", "B"], | ||
["B"], | ||
]: | ||
arr = df[col].values | ||
dtype_expected = "<M8[ns]" if "B" not in col else object | ||
arr_expected = np.array(list(df[col].itertuples(index=False)), | ||
dtype=dtype_expected) | ||
|
||
tm.assert_numpy_array_equal(arr, arr_expected) | ||
|
||
def test_values_dtypes_with_datetime64tz(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. much better to write this as a pytest.mark.parametrized version. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seeing all the self.assertEqual, I though we were using "nosetests" style and avoided to use any pytest features. BTW, in http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-test-suite, it is mentionned to run the tests suite with nosetests and not pytest. I could never run the suite with nosetests but well with pytest There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The stable docs still mention nose, as we only switched recently. In the dev docs it is already updated: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-test-suite There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche Tx Joris, good to know ! specially the test_fast ;-) |
||
df = DataFrame({'dt': date_range('20130101', periods=3), | ||
'dttz': date_range('20130101', periods=3, | ||
tz='US/Eastern'), | ||
'td': (date_range('20130102', periods=3) - | ||
date_range('20130101', periods=3)), | ||
'cat': pd.Categorical(['a', 'b', 'b']), | ||
'cati': pd.Categorical([100, 4, 3]), | ||
'b': [True, False, False], | ||
'i': [1, 2, 3], | ||
'f': [1.3, 2, 3], | ||
'c': [1j, 2, 3], | ||
}) | ||
|
||
cols = itertools.chain( | ||
itertools.combinations_with_replacement(df.columns, 1), | ||
itertools.combinations_with_replacement(df.columns, 2) | ||
) | ||
for col in cols: | ||
df_sub = df[list(col)] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what exactly are you trying to test here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. testing the logic of _interleaved_dtype by asking to test all cases of df.values on all combination of dtypes and verifying the expected dtype. I have adapted the test comments to make it more explicit |
||
dts = df_sub.dtypes.values | ||
|
||
# calculate dtype_expected in function of dtypes of dataframe | ||
# (testing the logic of the _interleaved_dtype | ||
# function in pandas/core/internals.py | ||
|
||
# all columns of the same type | ||
if len(set(dts)) == 1: | ||
if dts[0] in ("M8[ns]", "m8[ns]", | ||
bool, complex, int, float): | ||
dtype_expected = dts[0] | ||
else: | ||
if col == ("cati", ): | ||
dtype_expected = 'int64' | ||
else: | ||
dtype_expected = object | ||
|
||
# different type of columns | ||
else: | ||
# all numeric and complex | ||
if all(np.in1d(dts, (complex, int, float))) and complex in dts: | ||
dtype_expected = complex | ||
# all numeric and float | ||
elif all(np.in1d(dts, (complex, int, float))) and float in dts: | ||
dtype_expected = float | ||
else: | ||
dtype_expected = object | ||
|
||
arr = df_sub.values | ||
arr_expected = np.array(list(df_sub.itertuples(index=False)), | ||
dtype=dtype_expected) | ||
|
||
tm.assert_numpy_array_equal(arr, arr_expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say
object
dtyped numpy array ofTimestamp
you don't need to saypandas.*
that's implied.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok