Skip to content

Value returns ndarray for dataframes with a single column with datetime64 tz-aware #15736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

sdementen
Copy link
Contributor

@sdementen sdementen changed the title Value returns ndarray for series with datetime64 tz Value returns ndarray for dataframes with a single column with datetime64 tz-aware Mar 19, 2017
@@ -4850,7 +4850,7 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,

msg = "Boolean array expected for the condition, not {dtype}"

if not isinstance(cond, pd.DataFrame):
if not isinstance(cond, (pd.DataFrame, pd.Panel)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change unrelated things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fix the test test pct_change in pandas/pandas/tests/test_panel.py:1901 that failed after making the other changes (the rabbit hole...).
Reading the comment "This is a single-dimensional object." and observing the code was passing through these lines for Panel, I guess it was the change to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take this out it is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this code is relatively new (not in 0.19.2)
Can you explain why my adaptation is wrong ? the cond argument can be a Panel, for instance when calling Panel.mask(cond=some other panel), and if you do not adapt the test as I propose, it will run the two following lines that uses cond.dtype that does not exist for Panel (test fails)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not supporting Panels much anymore (its going to be deprecated soon anyhow). I don't want to have to deal with this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we need to remove the test with Panel that fails, correct ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just xfail it for now and can look later

@@ -5783,7 +5784,7 @@ def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None,
**kwargs)) - 1)
if freq is None:
mask = isnull(_values_from_object(self))
np.putmask(rs.values, mask, np.nan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rs[mask] = np.nan is enough here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, indeed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted back to rs.mask(mask, np.nan, inplace=True) as rs[mask] = np.nan fails the test test_pct_change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again you are not fixing the underlying issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm the documentation on mask is correct ? if so, I do not see the issue with fixing the previously buggy code (you confirm this np.putmask(rs.values, mask, np.nan) is buggy?) as otherwise test_pct_change fails.
Can you also confirm that your proposal rs[mask] = np.nan is not the equivalent to rs.mask(mask, np.nan,inplace!True) which is equivalent to the original np.putmask(rs.values, mask, np.nan) but without assuming rs.values is a view ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.putmask assumes this is a view. We don't use inplace anywhere internally in the code, it is completely non-idiomatic.

rs[mask] = np.nan looks fine to me, what is the problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work ;-) I have tried.

Isn't boolean indexation only working for vectors ? Or did you mean rs.loc[mask] (I haven't tried)?

If rs.loc[mask] works, I can replace it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rs.mask replaced by rs.iloc[mask]

self.values.ravel(), f).reshape(self.values.shape)
return self.values

if values.ndim == self.ndim - 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the heck is all this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this, df[["B"]].values returns an ndarray with 1 dim, e.g. (3,) while we want (3,1).
I saw the logic in NonConsolidatableMixIn.get_values and as DatetimeTZBlock inherits from NonConsolidatableMixIn, I suspected the logic should also apply to DatetimeTZBlock. And it indeed fixes the issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change however breaks another test, the test_apply_non_numpy_dtype, that enters in a stack overflow due to the fact that https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196 now returns an ndarray with (3,1) shape instead of a (3,) shape. On this https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4196, what is self.values expected to return ? Seeing the creation of a Series that follows, I would guess it is a ndim=1 ndarray but then why is self.values called, with isinstance(self, DataFrame), as this is expected to return a ndim=2 ndarray ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are doing reshaping of any kind then this is wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain shortly why reshaping is not wrong in NonConsolidatableMixIn (as it is today) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can reshape, but that doesn't belong here. we are trying to make the code simpler, not more complex.

The issue is that you are dealing with a categorical (and for that matter a DatetimeIndex), which by-definition are always 1-d objects.

The blocks need to emulate a 2-d structure for compat.

If you find that you need to reshape, need to find another way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so reshaping in NonConsolidatableMixIn is also wrong ?

have_numeric = have_float or have_complex or have_int
has_non_numeric = have_dt64 or have_dt64_tz or have_td64 or have_cat
have_numeric = have_float + have_complex + have_int
have_dt = have_dt64 + have_dt64_tz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is all of this changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked the logic to be able to take care the case where there was a single column (that was bypassed before as there was a shortcut in get_matrix)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conceptually this is fine here. Ideally have simpler logic though. It maybe that we shouldn'tchange _interleave_dtype at all; I thought it might be simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should _interleave_dtype be called when there is a single Block ? because, it is essentially what I did (make it work for the single Block case)

return np.dtype(bool)
return np.dtype("bool")
elif have_cat:
# return blocks[0].get_values().dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the heck is this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as now in we always go through the method mgt._interleave (even for the case "_is_single_block"), we must ensure that the function _interleaved_dtype takes care of the case where there is only 1 dtype which is why I added the "else have_cat:" block (which is very identical to the block following it except for CategoricalBlock <> IntBlock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment

arr = df[col].values
fst_elem = arr[0, 0]

self.assertEqual(type(arr), np.ndarray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test an exactly output, IOW, use assert_numpy_array_equal with an appropriate expected array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

else:
self.assertEqual(type(fst_elem), np.datetime64)

def test_values_dtypes_with_datetime64tz(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better to write this as a pytest.mark.parametrized version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing all the self.assertEqual, I though we were using "nosetests" style and avoided to use any pytest features. BTW, in http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-test-suite, it is mentionned to run the tests suite with nosetests and not pytest. I could never run the suite with nosetests but well with pytest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stable docs still mention nose, as we only switched recently. In the dev docs it is already updated: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-test-suite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Tx Joris, good to know ! specially the test_fast ;-)

)
for col in cols:

df_sub = df[list(col)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly are you trying to test here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing the logic of _interleaved_dtype by asking to test all cases of df.values on all combination of dtypes and verifying the expected dtype. I have adapted the test comments to make it more explicit

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Mar 20, 2017
@codecov
Copy link

codecov bot commented Mar 20, 2017

Codecov Report

Merging #15736 into master will decrease coverage by 0.02%.
The diff coverage is 86.48%.

@@            Coverage Diff             @@
##           master   #15736      +/-   ##
==========================================
- Coverage   91.01%   90.99%   -0.03%     
==========================================
  Files         143      143              
  Lines       49380    49402      +22     
==========================================
+ Hits        44944    44953       +9     
- Misses       4436     4449      +13
Impacted Files Coverage Δ
pandas/core/generic.py 96.24% <100%> (ø) ⬆️
pandas/core/internals.py 93.49% <85.71%> (-0.1%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.86% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ab57dc...538af78. Read the comment docs.

@@ -792,6 +792,7 @@ Performance Improvements
Bug Fixes
~~~~~~~~~

- Bug in ``DataFrame.values`` now returns an ``numpy.ndarray`` of ``pandas.Timestamp`` for tz-aware columns; previously this returned ``pandas.DateTimeIndex`` (:issue:`14052`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say object dtyped numpy array of Timestamp you don't need to say pandas.* that's implied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

will result in a flot64 dtype.
will result in a float64 dtype.

Unlike ``Series.values``, Timezone aware datetime data are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just say that for tz-aware dtypes you will be upcast to an object array.

@@ -4850,7 +4850,7 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,

msg = "Boolean array expected for the condition, not {dtype}"

if not isinstance(cond, pd.DataFrame):
if not isinstance(cond, (pd.DataFrame, pd.Panel)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take this out it is wrong.

@@ -5783,7 +5784,7 @@ def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None,
**kwargs)) - 1)
if freq is None:
mask = isnull(_values_from_object(self))
np.putmask(rs.values, mask, np.nan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again you are not fixing the underlying issue

self.values.ravel(), f).reshape(self.values.shape)
return self.values

if values.ndim == self.ndim - 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are doing reshaping of any kind then this is wrong


# return 1 bigger on the itemsize if unsinged
if lcd.kind == 'u':
return np.dtype('int%s' % (lcd.itemsize * 8 * 2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a Categorical is a 1-d object. So what you are doing is fundamentally wrong. You don't reshape things like. You must convert categoricals to object arrays. That logic was done before, not sure why you trying to jump thru hoops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what code was doing that conversion to array previously ? this line https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L3443 ? or somewhere else ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.array(cat) does this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my general point is to try follow the existing code (yes there is a lot, so I'll give you tips along the way).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overal, it would be nice for internal functions/methods to have even a minimal docstring or comments to know the input/output expected and the objective ... not easy to understand otherwise the "contract" of the function.
I have the feeling the _interleaved_dtype function is similar in purpose to the _find_common_type function but limited to numpy dtypes (vs pandas dtypes), is that correct ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to do it, but I currently fail to understand the real purpose of each of these functions. I propose that I first add the doc (and you both chceck/confirm) before I continue trying to fix the issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdementen certainly welcome more doc-strings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdementen yes _interleave_dtypeis similar to _find_common_type; it might work / can-re replaced by it. (something like this I would try on an independent PR; if it works, submit separately, and then you can rebase on top of it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thank you for this ! I will work again on the PR based on this but not immediately as I have personal constraints this week. Should the fix be simple now that you (or someone else) want to move forward with another PR than the mine, I will not be offended (I learned already a lot!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdementen nope, open for you.

jreback added a commit that referenced this pull request Mar 21, 2017
xref #15736   xref #12780

Author: Jeff Reback <[email protected]>

Closes #15765 from jreback/common_types and squashes the following commits:

d472646 [Jeff Reback] try removing restriction on windows
8d07cae [Jeff Reback] CLN: replace _interleave_dtype with _find_common_type
@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

@sdementen I merged #15765 so I think this will be quite a lot simpler now.

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
xref pandas-dev#15736   xref pandas-dev#12780

Author: Jeff Reback <[email protected]>

Closes pandas-dev#15765 from jreback/common_types and squashes the following commits:

d472646 [Jeff Reback] try removing restriction on windows
8d07cae [Jeff Reback] CLN: replace _interleave_dtype with _find_common_type
mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
xref pandas-dev#15736   xref pandas-dev#12780

Author: Jeff Reback <[email protected]>

Closes pandas-dev#15765 from jreback/common_types and squashes the following commits:

d472646 [Jeff Reback] try removing restriction on windows
8d07cae [Jeff Reback] CLN: replace _interleave_dtype with _find_common_type
@jreback
Copy link
Contributor

jreback commented May 7, 2017

@sdementen rebase / update?

@jreback
Copy link
Contributor

jreback commented Jun 10, 2017

can you rebase and update?

@sdementen
Copy link
Contributor Author

@jreback I am currently overloaded, I'll have to wait for a better period with some free time !

@jreback
Copy link
Contributor

jreback commented Jul 26, 2017

good ideas, needs a rebase / update. pls comment if you'd like to continue.

@jreback jreback closed this Jul 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent .values for tz-aware columns
3 participants