-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
clean up PeriodIndex constructor #13277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/tseries/period.py
Outdated
@@ -41,10 +41,13 @@ def f(self): | |||
|
|||
def _get_ordinals(data, freq): | |||
f = lambda x: Period(x, freq=freq).ordinal | |||
if len(data) == 0: | |||
return np.array([], dtype='int64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do if/elsif here (really more a stylistic comment)
@jreback One of the reasons the Here's an example: https://github.com/SixtyCapital/pandas/blob/master/pandas/tseries/tests/test_period.py#L2216 What do you think about not allowing that? |
@MaximilianR I think that's convenient. The issue is you need a constructor that only deals with codes (otherwise its ambiguous), ok; let's rename |
OK, but then you have these being different results:
Agree on separation of concerns there - I have But we still need |
hmm, those should all do the same thing. |
Right. So we can either allow supplying years as ints OR those all the same thing: choose one! |
actually looking closer I don't think integers are EVER interpreted as ordinals directly, instead the integers are intrepreted according to the frequency. Can you give a counter-example? hmm, maybe that's not true. but in any event an example would help. |
From some test failures earlier, but without much exploration on its frequency): in some indexing operations,
You need the freq to convert an ordinal to a Period, for sure |
IOW separating out ordinal construction. |
OK, let me reflect a bit. My current thinking depends what you mean by 'internal access methods' - absolutely fine if that's within the PeriodIndex machinery. My fear is that there's a lot of pandas that relies on assertion above, and that going down this road would make One option would be to move towards the Index interface relying on |
You mean that people are actually constructing this like If you can fix up the internal consistency, does anything actually break? (e.g. create |
No! But there are places in pandas this happens outside of the PeriodIndex class. I need to turn this off and see what breaks - I suspect a reasonable amount but not a torrent |
b252774
to
274bf06
Compare
pandas/tseries/period.py
Outdated
raise ValueError('PeriodIndex() must be called with a ' | ||
'collection of some kind, %s was passed' | ||
% repr(data)) | ||
def _simple_new(cls, data, name=None, freq=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be called data
rather than values
in order for the current implementation of pickling to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally: I haven't got my head around why we need all the custom pickling infrastructure around pickling with indexes - isn't reconstructing the object's __dict__
enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there isn't any current infrastructure, Index.__reduce__/__setstate__
does everything generally. Except for the back-compat stuff (which at some point should just take out). Everythings done as a dict now (was a tuple a few versions ago).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By 'infrastructure' I mean the Index.__reduce__/__setstate__
methods themselves.
Or you're saying they're only there to allow back compat? And that we could remove those, default to the normal python machinery, and it would work the same for everything from recent versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no I don't think it would work. You have to tell pickle what to do. E.g. we want to pickle only certain attributes and the data, nothing more. I don't think it works correctly otherwise (it tries to do too much).
You can try removing them but (aside from back-compat) I don't think it would work, but I could be wrong.
In any event I don't see it as a big deal; you want custom actions and that's what the special methods are for.
otherwise its too presumptous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a quick go at simplifying it - this takes out all of the custom stuff except for back-compat + a one line __getstate__
. Tests pass. But also agree, not a big deal
max-sixty/pandas@period-float...SixtyCapital:use-getstate
d44d1f0
to
e6695fa
Compare
pandas/core/algorithms.py
Outdated
index = PeriodIndex(values) | ||
freq = index.freq | ||
values = index.values | ||
|
||
from pandas.core.series import Series | ||
values = Series(values).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe reorg this a bit, we should catch the datetime/timedelta/period ones before we do these coercion lines
@MaximilianR ok does look a lot cleaner |
I've been remiss about the final fixes here. Will need a couple of weeks on other commitments but I will get to this Unless @sinhrks is going to spoil all my hard work with moving PeriodIndex to be a proper type |
I'm using |
@MaximilianR do you have time to update this? (the period dtype for index PR has been merged) Then we could get it in 0.19 |
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -283,6 +283,7 @@ Bug Fixes | |||
- Bug in ``astype()`` where ``inf`` values were incorrectly converted to integers. Now raises error now with ``astype()`` for Series and DataFrames (:issue:`14265`) | |||
- Bug in ``DataFrame(..).apply(to_numeric)`` when values are of type decimal.Decimal. (:issue:`14827`) | |||
- Bug in ``describe()`` when passing a numpy array which does not contain the median to the ``percentiles`` keyword argument (:issue:`14908`) | |||
- Cleaned up ``PeriodIndex`` constructor, including raising on floats more consistently (:issue:`13277`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we put it in the api section, there should be a clear 'change' documented for users to be aware of. Is the raising on floats not more a bug fix?
setup.cfg
Outdated
@@ -13,6 +13,7 @@ parentdir_prefix = pandas- | |||
|
|||
[flake8] | |||
ignore = E731 | |||
max-line-length = 79 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some people will have local defaults that are different. This will override those
@MaximilianR Can you run some of the asv benchmarks to check there is no impact on performance? (also the plotting benchmarks, we have see regressions there in the past due to changes in period index) |
f9a05e3
to
d0e744d
Compare
@jorisvandenbossche Good call. There is one significant slow down, by about 30x: I've tried to debug this, but am struggling a bit and could do with some help (and general guidance on how to approach these problems). So far:
0.183 __new__ pandas/tseries/period.py:183
├─ 0.168 wrapper pandas/util/decorators.py:65
│ └─ 0.150 to_offset pandas/tseries/frequencies.py:453
│ ├─ 0.089 get_stride_from_decimal pandas/tseries/frequencies.py:186
│ │ └─ 0.086 isclose numpy/core/numeric.py:2375
│ │ ├─ 0.032 within_tol numpy/core/numeric.py:2434
│ │ └─ 0.021 all numpy/core/fromnumeric.py:1987
│ │ ├─ 0.013 _all numpy/core/_methods.py:40
│ │ └─ 0.003 asanyarray numpy/core/numeric.py:484
│ ├─ 0.019 get_offset pandas/tseries/frequencies.py:605
│ │ └─ 0.014 copy pandas/tseries/offsets.py:306
│ │ └─ 0.013 __init__ pandas/tseries/offsets.py:192
│ ├─ 0.014 __mul__ pandas/tseries/offsets.py:412
│ │ └─ 0.006 __init__ pandas/tseries/offsets.py:192
│ │ └─ 0.004 _determine_offset pandas/tseries/offsets.py:198
│ │ └─ 0.003 <genexpr> pandas/tseries/offsets.py:203
│ └─ 0.008 split re.py:195
└─ 0.014 get_freq_code pandas/tseries/frequencies.py:300
└─ 0.003 _period_str_to_code pandas/tseries/frequencies.py:732 Apart from exactly how the code traces through:
CC @sinhrks |
d0e744d
to
d04692f
Compare
I think I figured it out - |
Performance issues solved! I think all all asv changes are from external compute interference given none are period related. Here are all <50% or >200% for transparency:
|
@jreback green! |
pandas/tseries/period.py
Outdated
if not isinstance(data, (np.ndarray, PeriodIndex, | ||
DatetimeIndex, Int64Index)): | ||
if is_scalar(data) or isinstance(data, Period): | ||
raise ValueError('PeriodIndex() must be called with a ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is generally a TypeError, in fact changing this to cls._scalar_data_error(data)
is appropriate (I am changing on merge)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually why don't you make this change (as I have a couple of more comments).
diff --git a/pandas/tseries/period.py b/pandas/tseries/period.py
index 903b4e3..cefe1cf 100644
--- a/pandas/tseries/period.py
+++ b/pandas/tseries/period.py
@@ -239,9 +239,7 @@ class PeriodIndex(DatelikeOps, DatetimeIndexOpsMixin, Int64Index):
if not isinstance(data, (np.ndarray, PeriodIndex,
DatetimeIndex, Int64Index)):
if is_scalar(data) or isinstance(data, Period):
- raise ValueError('PeriodIndex() must be called with a '
- 'collection of some kind, %s was passed'
- % repr(data))
+ cls._scalar_data_error(data)
# other iterable of some kind
if not isinstance(data, (list, tuple)):
diff --git a/pandas/tseries/tests/test_period.py b/pandas/tseries/tests/test_period.py
index c7fd4ae..8ec2fae 100644
--- a/pandas/tseries/tests/test_period.py
+++ b/pandas/tseries/tests/test_period.py
@@ -1753,7 +1753,7 @@ class TestPeriodIndex(tm.TestCase):
self.assertRaises(ValueError, PeriodIndex, idx._values)
self.assertRaises(ValueError, PeriodIndex, list(idx._values))
- self.assertRaises(ValueError, PeriodIndex,
+ self.assertRaises(TypeError, PeriodIndex,
data=Period('2007', freq='A'))
result = PeriodIndex(iter(idx))
pandas/tseries/period.py
Outdated
data = np.asarray(data) | ||
|
||
# datetime other than period | ||
if np.issubdtype(data.dtype, np.datetime64): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be is_datetime64_dtype
. But not sure what should happend if you have tz-aware or datetime64 (but not ns) passed. We may not have tests for this. You could handle datetime64tz by pretty easily though (happy to have this done in another PR as well).
pandas/tseries/period.py
Outdated
if not is_integer_dtype(values): | ||
values = np.array(values, copy=False) | ||
if (len(values) > 0 and is_float_dtype(values)): | ||
if len(values) > 0 and is_float_dtype(values): | ||
raise TypeError("PeriodIndex can't take floats") | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can eliminate the else clause (and just do the return)
pandas/tseries/period.py
Outdated
@@ -325,7 +322,6 @@ def _shallow_copy_with_infer(self, values=None, **kwargs): | |||
|
|||
def _shallow_copy(self, values=None, **kwargs): | |||
if kwargs.get('freq') is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would explicitly list freq=None
I think in the signature (then pass it directly) as its so crucial to the PeriodIndex nice to be noted
couple of comments. lgtm otherwise. if you'd fix / create a new issue (for the handling of datetime-aware tz, other datetimes), then ok with merging. |
can you rebase / update |
@MaximilianR You have a bunch of errors, all seems related to the use of |
this lgtm. any perf impacts? |
thanks @MaximilianR figured out to push this in so no rebases are needed! |
closes pandas-dev#13232 Material clean up of PeriodIndex constructor, which was doing a few weird things (pandas-dev#13232 (comment) nt-220788816), and generally getting messy. Author: Maximilian Roos <[email protected]> Closes pandas-dev#13277 from MaximilianR/period-float and squashes the following commits: 5cae7aa [Maximilian Roos] @jreback changes 75ff54d [Maximilian Roos] _new_PeriodIndex for unpickling 240172f [Maximilian Roos] coerce freq object earlier for perf ba5133b [Maximilian Roos] documentation b0fc0a7 [Maximilian Roos] final changes fa0fa9d [Maximilian Roos] clean up PeriodIndex constructor
git diff upstream/master | flake8 --diff
Material clean up of PeriodIndex constructor, which was doing a few weird things (#13232 (comment)), and generally getting messy.