Skip to content

WIP: make _holder changeover #24540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
Closed

Conversation

jbrockmendel
Copy link
Member

cc @jreback @TomAugspurger

changes _holder from DTI/TDI to DTA/TDA, pretty much all the diff from blocks.internals and core.dtypes

Not yet green locally, but I'm presenting this as a viable alternative to the all-at-once still in #24024

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 1, 2019 via email

@jbrockmendel
Copy link
Member Author

I hear where you're coming from (and am definitely ready to spend some time away from DTA/TDA), but I don't see a resolution to the constructor discussion, and that's a blocker for me on #24024.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 1, 2019 via email

@jbrockmendel
Copy link
Member Author

An actual blocker, as in you'll -1 the merge until it's addressed?

I know I'm not the one who needs to be convinced, but yah.

FYI, I think the handling of i8values has been addressed. Working on freq
validation now.

That would totally get me to stop complaining!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 1, 2019 via email

@jbrockmendel
Copy link
Member Author

well, mostly. There should still be a private, inference-free, validation-free constructor. We could call it "easy_init"

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 1, 2019 via email

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Jan 2, 2019
@jbrockmendel
Copy link
Member Author

I've gotten the test failures down to almost entirely pickle-based. @TomAugspurger IIRC you had to solve that in the other branch. Was there a trick there?

@TomAugspurger
Copy link
Contributor

Define DatetimeArray.__init__ ;)

@jbrockmendel
Copy link
Member Author

Define DatetimeArray.init ;)

Yep, that did it. And plays nicely with _simple_new to boot.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2019

well, mostly. There should still be a private, inference-free, validation-free constructor. We could call it "easy_init"

Why can't we simply make the DTA (et all) __init__ just be unfriendly if you don't pass accordning the guarantees; yes they will be public but will have very constrained inputs. ?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 2, 2019 via email

@jbrockmendel
Copy link
Member Author

Why can't we simply make the DTA (et all) init just be unfriendly if you don't pass accordning the guarantees; yes they will be public but will have very constrained inputs. ?

My immediate claim is that this discussion is large enough that it should be separated out from #24024. That is the purpose of this PR.

The answer to the question: because it's public! because people will use it! because having it behave differently from DTI.__new__ is confusing! becuase having it fail to do freq validation* means users are going to shift themselves to absurd outputs!

*possibly out of date, but even if so, there should be non-validating (performant) private constructor!

@jreback
Copy link
Contributor

jreback commented Jan 2, 2019

ok, I am fine with exploring alternative ways of doing this. But at this point, would be more productive to merge #24024 yes?

@jbrockmendel
Copy link
Member Author

But at this point, would be more productive to merge #24024 yes?

If the concensus is that smaller bits-for-better-review is no longer worthwhile, then we can close this and I'll do one more mondo pass on 24024.

That said, the only tests that are still failing here are in test_packers, so if we can fix those and do this indepndently, we cut 40% off the remaining diff in 24024

@jreback
Copy link
Contributor

jreback commented Jan 2, 2019

@jbrockmendel #24540 (comment)

ok let's do that (merge this), then rebase and merge #24024

@jreback jreback added this to the 0.24.0 milestone Jan 2, 2019
@jbrockmendel
Copy link
Member Author

ok let's do that (merge this)

I like the attitude, but a lot of the point of smaller pieces is making it feasible to fine-tooth-comb it.

@@ -228,6 +228,11 @@ static PyObject *get_values(PyObject *obj) {
PRINTMARK();

if (values && !PyArray_CheckExact(values)) {

if (PyObject_HasAttrString(values, "to_numpy")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here about what this is doing, and . TODO to see if we can remove the .values below

class DatetimeTZDtype(PandasExtensionDtype):

@register_extension_dtype
class DatetimeTZDtype(PandasExtensionDtype, ExtensionDtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can now move PandasExtensionDtype to subclass ExtensionDtype, see the note in pandas.core.dtypes.base for _DtypeMixin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good as a followup item I think. IIRC I created an issue when I had to create PandasExtensionDtype in the first place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i agree, do we have an issue about fixing this?

def get_values(self, dtype=None):
"""
return object dtype as boxed values, such as Timestamps/Timedelta
"""
if is_object_dtype(dtype):
return lib.map_infer(self.values.ravel(),
self._box_func).reshape(self.values.shape)
values = self.values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't you always call .ravel()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For starters because DTA/TDA doesn't have ravel() (though I think it should (ditto transpose) that just returns self)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I agree, can you make an issue about this


values = lib.map_infer(values, self._box_func)

if self.ndim > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this reshape always true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. The gymnastics in internals to get around the fact that we can't reshape to (nrows, 1) is over my head.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it was ok before, puzzled why it needs changing now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Series ndim will be 1

In [2]: ser = pd.Series(pd.date_range('2000', periods=4, tz="UTC"))

In [3]: ser._data.blocks[0].ndim
Out[3]: 1

Or perhaps I misunderstood the question.

@@ -2198,13 +2216,15 @@ class TimeDeltaBlock(DatetimeLikeBlockMixin, IntBlock):
def __init__(self, values, placement, ndim=None):
if values.dtype != _TD_DTYPE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory this is not needed anymore

@@ -2771,6 +2794,10 @@ def _maybe_coerce_values(self, values):
"""
if values.dtype != _NS_DTYPE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above

@@ -2885,12 +2914,16 @@ def set(self, locs, values, check=False):

self.values[locs] = values

def external_values(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this be on EA base block? (parmetrized on the dtype)

value, limit=limit, inplace=inplace, downcast=downcast
)

def setitem(self, indexer, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be in the EA block class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but I don't think we have a good story for casting extension arrays yet. I'd prefer to keep that behavior undefined for now, until we can do better than "fall back to object".

@@ -414,6 +414,13 @@ def _formatter(self, boxed=False):
# ----------------------------------------------------------------
# Array-Like / EA-Interface Methods

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well add ravel for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to avoid adding these if possible.

What requires this? Was it for pytables IO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transpose was for a pytables test, yah. we discussed it briefly over in 24024. I prefer adding it, but not gonna fuss over it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also until i get the packers tests and one last feather test fixed I'm completely stuck)

@@ -240,11 +240,14 @@ def _simple_new(cls, values, freq=None, tz=None):
result._dtype = dtype
return result

def __new__(cls, values, freq=None, tz=None, dtype=None, copy=False,
def __init__(self, values, freq=None, tz=None, dtype=None, copy=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init should be values, freq, dtype, copy to match PeriodArray.

@@ -1568,6 +1573,8 @@ def sequence_to_dt64ns(data, dtype=None, copy=False,
copy = False
elif isinstance(data, ABCSeries):
data = data._values
elif isinstance(data, ABCPandasArray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this is necessary? Why not fall through to the usual sequence handling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a test in tests.series.test_missing that ended up calling lib.infer_dtype and raising an AttributeError because PandasDType doesn't have a base attribute

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That probably indicates a bug that this is hiding. Shouldn't all non ndarray sequences be converted to ndarrays before calling infer_dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i seem to recall something about extension dtypes being registered in such a way that infer_dtype could pick up on them. will take a look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
infer_dtype does not know about extension dtypes per se (and cannot because the point is to figure out whats hiding behind a dtype (mostly object), so dispatching based on dtype is not possible).

The only ones that are directly implemented are Period and Interval, but no other EA type, including Datetime64TZ.

class DatetimeTZDtype(PandasExtensionDtype):

@register_extension_dtype
class DatetimeTZDtype(PandasExtensionDtype, ExtensionDtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good as a followup item I think. IIRC I created an issue when I had to create PandasExtensionDtype in the first place.


values = lib.map_infer(values, self._box_func)

if self.ndim > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Series ndim will be 1

In [2]: ser = pd.Series(pd.date_range('2000', periods=4, tz="UTC"))

In [3]: ser._data.blocks[0].ndim
Out[3]: 1

Or perhaps I misunderstood the question.

value, limit=limit, inplace=inplace, downcast=downcast
)

def setitem(self, indexer, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but I don't think we have a good story for casting extension arrays yet. I'd prefer to keep that behavior undefined for now, until we can do better than "fall back to object".


class TestGroupby(BaseDatetimeTests, base.BaseGroupbyTests):
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just merged #23255, so add in

class TestParsing(base.BaseParsingTests):
    pass

here

@jbrockmendel
Copy link
Member Author

So it looks like the packers tests are resolved if I change DatetimeArray.__init__ to treat M8[ns] values like int64 values, which I'd really rather not do. Any thoughts on another way to resolve this?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 2, 2019 via email

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 2, 2019 via email

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 2, 2019 via email

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 2, 2019 via email

@jorisvandenbossche
Copy link
Member

What is currently the status of this PR in relation to #24024 ?

@jbrockmendel
Copy link
Member Author

What is currently the status of this PR in relation to #24024 ?

Many of the comments here need to be addressed in 24024 as well, but on its own this PR is dead in the water since I can't get the packers tests passing.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 2, 2019 via email

@jbrockmendel jbrockmendel deleted the race branch January 2, 2019 21:23
@jorisvandenbossche
Copy link
Member

Many of the comments here need to be addressed in 24024 as well,

@jbrockmendel were they addressed in the end?

@jbrockmendel
Copy link
Member Author

were they addressed in the end?

I'll double-check. Still have a browser tab open trying to keep track of which issues from 24024 are still open. Will see if one of the follow-up Issues can serve as a tracker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants