-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Value returns ndarray for dataframes with a single column with datetime64 tz-aware #15736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
cac51c6
afc4303
8b46fad
78f931d
6fa7426
49b0769
415d120
b535cf4
59e4ca5
59c6358
46fe072
fd32043
69cec8e
e83bc0a
ce9ef4f
7dbd3a7
479b16d
99e71d4
d4a2554
48e5e79
767218f
8e874ea
9a463a6
6cd4832
a72a01a
5158f2b
538af78
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2388,9 +2388,15 @@ def get_values(self, dtype=None): | |
# return object dtype as Timestamps with the zones | ||
if is_object_dtype(dtype): | ||
f = lambda x: lib.Timestamp(x, tz=self.values.tz) | ||
return lib.map_infer( | ||
values = lib.map_infer( | ||
self.values.ravel(), f).reshape(self.values.shape) | ||
return self.values | ||
|
||
if values.ndim == self.ndim - 1: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what the heck is all this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. without this, df[["B"]].values returns an ndarray with 1 dim, e.g. (3,) while we want (3,1). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this change however breaks another test, the test_apply_non_numpy_dtype, that enters in a stack overflow due to the fact that https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196 now returns an ndarray with (3,1) shape instead of a (3,) shape. On this https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196, what is self.values expected to return ? Seeing the creation of a Series that follows, I would guess it is a ndim=1 ndarray but then why is self.values called, with isinstance(self, DataFrame), as this is expected to return a ndim=2 ndarray ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you are doing reshaping of any kind then this is wrong There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you explain shortly why reshaping is not wrong in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can reshape, but that doesn't belong here. we are trying to make the code simpler, not more complex. The issue is that you are dealing with a categorical (and for that matter a DatetimeIndex), which by-definition are always 1-d objects. The blocks need to emulate a 2-d structure for compat. If you find that you need to reshape, need to find another way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so reshaping in NonConsolidatableMixIn is also wrong ? |
||
values = values.reshape((1,) + values.shape) | ||
else: | ||
return self.values | ||
|
||
return values | ||
|
||
def to_object_block(self, mgr): | ||
""" | ||
|
@@ -3424,10 +3430,7 @@ def as_matrix(self, items=None): | |
else: | ||
mgr = self | ||
|
||
if self._is_single_block or not self.is_mixed_type: | ||
return mgr.blocks[0].get_values() | ||
else: | ||
return mgr._interleave() | ||
return mgr._interleave() | ||
|
||
def _interleave(self): | ||
""" | ||
|
@@ -3436,6 +3439,10 @@ def _interleave(self): | |
""" | ||
dtype = _interleaved_dtype(self.blocks) | ||
|
||
if self._is_single_block or not self.is_mixed_type: | ||
return np.array(self.blocks[0].get_values(dtype), | ||
dtype=dtype, copy=False) | ||
|
||
result = np.empty(self.shape, dtype=dtype) | ||
|
||
if result.shape[0] == 0: | ||
|
@@ -4485,33 +4492,64 @@ def _interleaved_dtype(blocks): | |
for x in blocks: | ||
counts[type(x)].append(x) | ||
|
||
have_int = len(counts[IntBlock]) > 0 | ||
have_bool = len(counts[BoolBlock]) > 0 | ||
have_object = len(counts[ObjectBlock]) > 0 | ||
have_int = len(counts[IntBlock]) > 0 | ||
have_float = len(counts[FloatBlock]) > 0 | ||
have_complex = len(counts[ComplexBlock]) > 0 | ||
have_dt64 = len(counts[DatetimeBlock]) > 0 | ||
have_dt64_tz = len(counts[DatetimeTZBlock]) > 0 | ||
have_td64 = len(counts[TimeDeltaBlock]) > 0 | ||
have_cat = len(counts[CategoricalBlock]) > 0 | ||
have_cat = len(counts[CategoricalBlock]) | ||
# TODO: have_sparse is not used | ||
have_sparse = len(counts[SparseBlock]) > 0 # noqa | ||
have_numeric = have_float or have_complex or have_int | ||
has_non_numeric = have_dt64 or have_dt64_tz or have_td64 or have_cat | ||
have_numeric = have_float + have_complex + have_int | ||
have_dt = have_dt64 + have_dt64_tz | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is all of this changing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I reworked the logic to be able to take care the case where there was a single column (that was bypassed before as there was a shortcut in get_matrix) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. conceptually this is fine here. Ideally have simpler logic though. It maybe that we shouldn'tchange There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should _interleave_dtype be called when there is a single Block ? because, it is essentially what I did (make it work for the single Block case) |
||
have_non_numeric = have_dt64 + have_dt64_tz + have_td64 + have_cat | ||
have_non_dt = have_td64 + have_cat | ||
have_mixed = bool(have_numeric) + bool(have_non_dt) + bool(have_dt) | ||
|
||
if (have_object or | ||
(have_bool and | ||
(have_numeric or have_dt64 or have_dt64_tz or have_td64)) or | ||
(have_numeric and has_non_numeric) or have_cat or have_dt64 or | ||
have_dt64_tz or have_td64): | ||
(have_non_numeric > 1) or # more than one type of non numeric | ||
(have_bool and have_mixed) or # mix of a numeric et non numeric | ||
(have_mixed > 1) or # mix of a numeric et non numeric | ||
have_dt64_tz or | ||
(have_cat > 1)): | ||
return np.dtype(object) | ||
elif have_dt64: | ||
return np.dtype("datetime64[ns]") | ||
elif have_td64: | ||
return np.dtype("timedelta64[ns]") | ||
elif have_bool: | ||
return np.dtype(bool) | ||
return np.dtype("bool") | ||
elif have_cat: | ||
# return blocks[0].get_values().dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what the heck is this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as now in we always go through the method mgt._interleave (even for the case "_is_single_block"), we must ensure that the function _interleaved_dtype takes care of the case where there is only 1 dtype which is why I added the "else have_cat:" block (which is very identical to the block following it except for CategoricalBlock <> IntBlock There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see above comment |
||
# if we are mixing unsigned and signed, then return | ||
# the next biggest int type (if we can) | ||
|
||
dts = [b.get_values().dtype for b in counts[CategoricalBlock]] | ||
lcd = _find_common_type(dts) | ||
kinds = set([_dt.kind for _dt in dts]) | ||
|
||
if len(kinds) == 1: | ||
return lcd | ||
|
||
if lcd == 'uint64' or lcd == 'int64': | ||
return np.dtype('int64') | ||
|
||
# return 1 bigger on the itemsize if unsinged | ||
if lcd.kind == 'u': | ||
return np.dtype('int%s' % (lcd.itemsize * 8 * 2)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a Categorical is a 1-d object. So what you are doing is fundamentally wrong. You don't reshape things like. You must convert categoricals to object arrays. That logic was done before, not sure why you trying to jump thru hoops. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what code was doing that conversion to array previously ? this line https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L3443 ? or somewhere else ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my general point is to try follow the existing code (yes there is a lot, so I'll give you tips along the way). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. overal, it would be nice for internal functions/methods to have even a minimal docstring or comments to know the input/output expected and the objective ... not easy to understand otherwise the "contract" of the function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would love to do it, but I currently fail to understand the real purpose of each of these functions. I propose that I first add the doc (and you both chceck/confirm) before I continue trying to fix the issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sdementen certainly welcome more doc-strings There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sdementen yes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback thank you for this ! I will work again on the PR based on this but not immediately as I have personal constraints this week. Should the fix be simple now that you (or someone else) want to move forward with another PR than the mine, I will not be offended (I learned already a lot!) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sdementen nope, open for you. |
||
return lcd | ||
|
||
elif have_int and not have_float and not have_complex: | ||
# if we are mixing unsigned and signed, then return | ||
# the next biggest int type (if we can) | ||
lcd = _find_common_type([b.dtype for b in counts[IntBlock]]) | ||
kinds = set([i.dtype.kind for i in counts[IntBlock]]) | ||
|
||
dts = [b.dtype for b in counts[IntBlock]] | ||
lcd = _find_common_type(dts) | ||
kinds = set([_dt.kind for _dt in dts]) | ||
|
||
if len(kinds) == 1: | ||
return lcd | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rs[mask] = np.nan
is enough hereThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted back to
rs.mask(mask, np.nan, inplace=True)
asrs[mask] = np.nan
fails the test test_pct_changeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again you are not fixing the underlying issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you confirm the documentation on mask is correct ? if so, I do not see the issue with fixing the previously buggy code (you confirm this
np.putmask(rs.values, mask, np.nan)
is buggy?) as otherwisetest_pct_change
fails.Can you also confirm that your proposal
rs[mask] = np.nan
is not the equivalent tors.mask(mask, np.nan,inplace!True)
which is equivalent to the originalnp.putmask(rs.values, mask, np.nan)
but without assuming rs.values is a view ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.putmask
assumes this is a view. We don't useinplace
anywhere internally in the code, it is completely non-idiomatic.rs[mask] = np.nan
looks fine to me, what is the problem?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work ;-) I have tried.
Isn't boolean indexation only working for vectors ? Or did you mean rs.loc[mask] (I haven't tried)?
If rs.loc[mask] works, I can replace it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rs.mask replaced by rs.iloc[mask]