clean up PeriodIndex constructor #13277

max-sixty · 2016-05-25T04:08:22Z

closes PeriodIndex with float input inconsistency #13232
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

Material clean up of PeriodIndex constructor, which was doing a few weird things (#13232 (comment)), and generally getting messy.

jreback · 2016-05-25T12:16:05Z

pandas/tseries/period.py

@@ -41,10 +41,13 @@ def f(self):

 def _get_ordinals(data, freq):
    f = lambda x: Period(x, freq=freq).ordinal
+    if len(data) == 0:
+        return np.array([], dtype='int64')


you can do if/elsif here (really more a stylistic comment)

max-sixty · 2016-05-25T17:19:41Z

@jreback One of the reasons the PeriodIndex constructer is so convoluted is there are some odd cases supported (I imagine through many bugfixes & iterations!). One of them is a list of ints representing years. An array of ints are interpreted as ordinals, but a list is coerced to years within the Period constructor.

Here's an example: https://github.com/SixtyCapital/pandas/blob/master/pandas/tseries/tests/test_period.py#L2216

What do you think about not allowing that?

jreback · 2016-05-25T17:41:45Z

@MaximilianR I think that's convenient. The issue is you need a constructor that only deals with codes (otherwise its ambiguous), ok; let's rename _from_arraylike to from_codes and make it public; that will solve the ambiguity problem then (and also provide the internal mechanism for creation which is THE way to create from codes).

max-sixty · 2016-05-25T17:56:44Z

OK, but then you have these being different results:

pd.PeriodIndex([2007, 2008], freq='A')  # years
pd.PeriodIndex(np.array([2007, 2008]), freq='A')  # ordinals 
pd.PeriodIndex(np.array([2007, 2008], dtype='object'), freq='A')  # ?

Agree on separation of concerns there - I have ._simple_new handling ordinals, and __new__ handling the other cases now.

But we still need pd.PeriodIndex(idx.values, idx.freq) to treat the values as ordinals rather than years (otherwise would break a fundamental tenet of indexes). So we're left with the above.
I can add back the branching logic for lists vs. arrays if you really think it's a helpful use case?

jreback · 2016-05-25T18:10:36Z

hmm, those should all do the same thing.

max-sixty · 2016-05-25T18:27:43Z

hmm, those should all do the same thing.

Right. So we can either allow supplying years as ints OR those all the same thing: choose one!
Or a creative solution?

jreback · 2016-05-25T18:30:02Z

actually looking closer I don't think integers are EVER interpreted as ordinals directly, instead the integers are intrepreted according to the frequency. Can you give a counter-example?

hmm, maybe that's not true. but in any event an example would help.

max-sixty · 2016-05-25T18:59:29Z

actually looking closer I don't think integers are EVER interpreted as ordinals directly,

From some test failures earlier, but without much exploration on its frequency): in some indexing operations, PeriodIndex(some_values_from_a_periodindex) is called.
There may be ways around that but are you sure we want to violate the idx._constructor(idx.values) == idx consistency?

instead the integers are intrepreted according to the frequency.

You need the freq to convert an ordinal to a Period, for sure

jreback · 2016-05-25T19:26:30Z

idx._constructor(idx.values) == idx

idx.values is an implementation detail. I have no problem with forcing US (meaning you!) to change internal access methods to PeriodIndex.from_ordinals(idx.values) if you want to construct from ordinals and leave PeriodIndex([2000,2001],freq='A-JUNE') to work as the public contstructor.

IOW separating out ordinal construction.

max-sixty · 2016-05-25T22:41:46Z

OK, let me reflect a bit.

My current thinking depends what you mean by 'internal access methods' - absolutely fine if that's within the PeriodIndex machinery. My fear is that there's a lot of pandas that relies on assertion above, and that going down this road would make PeriodIndex need more special casing throughout the library. And I would love to see less, both for PeriodIndex and everything else!

One option would be to move towards the Index interface relying on ._simple_new to interpret any output of the values of an Index. I think that probably isn't a big change.

jreback · 2016-05-26T16:18:49Z

My current thinking depends what you mean by 'internal access methods- absolutely fine if that's within the PeriodIndex machinery. My fear is that there's a lot of pandas that relies on assertion above, and that going down this road would makePeriodIndexneed more special casing throughout the library. And I would love to see less, both forPeriodIndex` and everything else!

You mean that people are actually constructing this like PeriodIndex(idx.values)? and expecting it to work? in external code, really?

If you can fix up the internal consistency, does anything actually break? (e.g. create PeriodIndex.from_ordinals then move all internal code to use that).

max-sixty · 2016-05-26T17:00:51Z

You mean that people are actually constructing this like PeriodIndex(idx.values)? and expecting it to work? in external code, really?

No! But there are places in pandas this happens outside of the PeriodIndex class. I need to turn this off and see what breaks - I suspect a reasonable amount but not a torrent

max-sixty · 2016-05-27T20:15:29Z

pandas/tseries/period.py

-                raise ValueError('PeriodIndex() must be called with a '
-                                 'collection of some kind, %s was passed'
-                                 % repr(data))
+    def _simple_new(cls, data, name=None, freq=None, **kwargs):


This needs to be called data rather than values in order for the current implementation of pickling to work

More generally: I haven't got my head around why we need all the custom pickling infrastructure around pickling with indexes - isn't reconstructing the object's __dict__ enough?

there isn't any current infrastructure, Index.__reduce__/__setstate__ does everything generally. Except for the back-compat stuff (which at some point should just take out). Everythings done as a dict now (was a tuple a few versions ago).

By 'infrastructure' I mean the Index.__reduce__/__setstate__ methods themselves.
Or you're saying they're only there to allow back compat? And that we could remove those, default to the normal python machinery, and it would work the same for everything from recent versions?

no I don't think it would work. You have to tell pickle what to do. E.g. we want to pickle only certain attributes and the data, nothing more. I don't think it works correctly otherwise (it tries to do too much).

You can try removing them but (aside from back-compat) I don't think it would work, but I could be wrong.

In any event I don't see it as a big deal; you want custom actions and that's what the special methods are for.

otherwise its too presumptous.

Had a quick go at simplifying it - this takes out all of the custom stuff except for back-compat + a one line __getstate__. Tests pass. But also agree, not a big deal
max-sixty/pandas@period-float...SixtyCapital:use-getstate

jreback · 2016-05-27T23:08:03Z

pandas/core/algorithms.py

+        index = PeriodIndex(values)
+        freq = index.freq
+        values = index.values
+
    from pandas.core.series import Series
    values = Series(values).values


maybe reorg this a bit, we should catch the datetime/timedelta/period ones before we do these coercion lines

jreback · 2016-05-27T23:11:19Z

@MaximilianR ok does look a lot cleaner

max-sixty · 2016-07-14T18:24:29Z

I've been remiss about the final fixes here. Will need a couple of weeks on other commitments but I will get to this

Unless @sinhrks is going to spoil all my hard work with moving PeriodIndex to be a proper type

sinhrks · 2016-07-14T21:02:04Z

I'm using PeriodIndex as internal of PeriodDtype (like DatetimeTZ), the improvement of PeriodIndex should not be spoiled:)

jorisvandenbossche · 2016-08-21T18:26:15Z

@MaximilianR do you have time to update this? (the period dtype for index PR has been merged) Then we could get it in 0.19

jorisvandenbossche · 2016-12-23T23:25:02Z

doc/source/whatsnew/v0.20.0.txt

@@ -283,6 +283,7 @@ Bug Fixes
 - Bug in ``astype()`` where ``inf`` values were incorrectly converted to integers. Now raises error now with ``astype()`` for Series and DataFrames (:issue:`14265`)
 - Bug in ``DataFrame(..).apply(to_numeric)`` when values are of type decimal.Decimal. (:issue:`14827`)
 - Bug in ``describe()`` when passing a numpy array which does not contain the median to the ``percentiles`` keyword argument (:issue:`14908`)
+- Cleaned up ``PeriodIndex`` constructor, including raising on floats more consistently (:issue:`13277`)


if we put it in the api section, there should be a clear 'change' documented for users to be aware of. Is the raising on floats not more a bug fix?

jorisvandenbossche · 2016-12-23T23:29:32Z

setup.cfg

@@ -13,6 +13,7 @@ parentdir_prefix = pandas-

 [flake8]
 ignore = E731
+max-line-length = 79


Isn't this the default?

Some people will have local defaults that are different. This will override those

jorisvandenbossche · 2016-12-28T15:36:21Z

@MaximilianR Can you run some of the asv benchmarks to check there is no impact on performance? (also the plotting benchmarks, we have see regressions there in the past due to changes in period index)

max-sixty · 2017-01-02T10:55:28Z

@jorisvandenbossche Good call. There is one significant slow down, by about 30x: time_from_pydatetime, which cooerces python datetimes to Periods.

I've tried to debug this, but am struggling a bit and could do with some help (and general guidance on how to approach these problems). So far:

The Period constructor is being called by each object within period.extract_ordinals, and that's what's taking all the time
The Period constructor then calls get_period_ordinal, which somewhere calls into to_offset:

 0.183 __new__  pandas/tseries/period.py:183
├─ 0.168 wrapper  pandas/util/decorators.py:65
│  └─ 0.150 to_offset  pandas/tseries/frequencies.py:453
│     ├─ 0.089 get_stride_from_decimal  pandas/tseries/frequencies.py:186
│     │  └─ 0.086 isclose  numpy/core/numeric.py:2375
│     │     ├─ 0.032 within_tol  numpy/core/numeric.py:2434
│     │     └─ 0.021 all  numpy/core/fromnumeric.py:1987
│     │        ├─ 0.013 _all  numpy/core/_methods.py:40
│     │        └─ 0.003 asanyarray  numpy/core/numeric.py:484
│     ├─ 0.019 get_offset  pandas/tseries/frequencies.py:605
│     │  └─ 0.014 copy  pandas/tseries/offsets.py:306
│     │     └─ 0.013 __init__  pandas/tseries/offsets.py:192
│     ├─ 0.014 __mul__  pandas/tseries/offsets.py:412
│     │  └─ 0.006 __init__  pandas/tseries/offsets.py:192
│     │     └─ 0.004 _determine_offset  pandas/tseries/offsets.py:198
│     │        └─ 0.003 <genexpr>  pandas/tseries/offsets.py:203
│     └─ 0.008 split  re.py:195
└─ 0.014 get_freq_code  pandas/tseries/frequencies.py:300
   └─ 0.003 _period_str_to_code  pandas/tseries/frequencies.py:732

Apart from exactly how the code traces through:

I'm not sure how these changes should make an impact on this code path
Does this matter? Coercing from python datetime isn't a common use case

CC @sinhrks

max-sixty · 2017-01-02T11:18:37Z

I think I figured it out - Period._maybe_convert_freq needs to be called early, otherwise it the pyx functions will take a slow route to get the appropriate freq. Running ASV to confirm now

max-sixty · 2017-01-02T13:05:56Z

Performance issues solved! I think all all asv changes are from external compute interference given none are period related. Here are all <50% or >200% for transparency:

-  112.12ms    54.32ms      0.48  sparse.sparse_frame_constructor.time_sparse_frame_constructor
-    7.34ms     3.48ms      0.47  reshape.reshape_stack_simple.time_reshape_stack_simple
-    5.97ms     2.70ms      0.45  reshape.melt_dataframe.time_melt_dataframe
-    1.00ms   448.65μs      0.45  reindex.FillMethod.time_pad
-    1.87ms   825.54μs      0.44  period.Algorithms.time_value_counts_pindex
-    1.15ms   485.69μs      0.42  reindex.Reindexing.time_reindex_dates
-  429.95μs   173.53μs      0.40  series_methods.series_constructor_no_data_datetime_index.time_series_constructor_no_data_datetime_index
-   24.72ms     4.49ms      0.18  reindex.Duplicates.time_frame_drop_dups_na_inplace
-   18.22ms     2.74ms      0.15  reindex.Reindexing.time_reindex_multiindex

---

+    6.13ms    14.80ms      2.41  hdfstore_bench.HDF5.time_read_store_mixed
+   26.90ms    60.93ms      2.27  groupby.GroupBySuite.time_rank('int', 100)
+    1.48ms     3.23ms      2.19  frame_methods.Interpolate.time_interpolate_some_good
+  537.00μs     1.16ms      2.16  strings.StringEncode.time_encode_decode
+    2.62ms     5.43ms      2.07  strings.StringMethods.time_contains_few_noregex

max-sixty · 2017-01-02T14:54:11Z

@jreback green!

jreback · 2017-01-02T15:51:31Z

pandas/tseries/period.py

+        if not isinstance(data, (np.ndarray, PeriodIndex,
+                                 DatetimeIndex, Int64Index)):
+            if is_scalar(data) or isinstance(data, Period):
+                raise ValueError('PeriodIndex() must be called with a '


this is generally a TypeError, in fact changing this to cls._scalar_data_error(data) is appropriate (I am changing on merge)

actually why don't you make this change (as I have a couple of more comments).

diff --git a/pandas/tseries/period.py b/pandas/tseries/period.py index 903b4e3..cefe1cf 100644 --- a/pandas/tseries/period.py +++ b/pandas/tseries/period.py @@ -239,9 +239,7 @@ class PeriodIndex(DatelikeOps, DatetimeIndexOpsMixin, Int64Index): if not isinstance(data, (np.ndarray, PeriodIndex, DatetimeIndex, Int64Index)): if is_scalar(data) or isinstance(data, Period): - raise ValueError('PeriodIndex() must be called with a ' - 'collection of some kind, %s was passed' - % repr(data)) + cls._scalar_data_error(data) # other iterable of some kind if not isinstance(data, (list, tuple)): diff --git a/pandas/tseries/tests/test_period.py b/pandas/tseries/tests/test_period.py index c7fd4ae..8ec2fae 100644 --- a/pandas/tseries/tests/test_period.py +++ b/pandas/tseries/tests/test_period.py @@ -1753,7 +1753,7 @@ class TestPeriodIndex(tm.TestCase): self.assertRaises(ValueError, PeriodIndex, idx._values) self.assertRaises(ValueError, PeriodIndex, list(idx._values)) - self.assertRaises(ValueError, PeriodIndex, + self.assertRaises(TypeError, PeriodIndex, data=Period('2007', freq='A')) result = PeriodIndex(iter(idx))

jreback · 2017-01-02T15:55:03Z

pandas/tseries/period.py

+            data = np.asarray(data)
+
+        # datetime other than period
+        if np.issubdtype(data.dtype, np.datetime64):


this should be is_datetime64_dtype. But not sure what should happend if you have tz-aware or datetime64 (but not ns) passed. We may not have tests for this. You could handle datetime64tz by pretty easily though (happy to have this done in another PR as well).

jreback · 2017-01-02T15:56:14Z

pandas/tseries/period.py

        if not is_integer_dtype(values):
            values = np.array(values, copy=False)
-            if (len(values) > 0 and is_float_dtype(values)):
+            if len(values) > 0 and is_float_dtype(values):
                raise TypeError("PeriodIndex can't take floats")
            else:


can eliminate the else clause (and just do the return)

jreback · 2017-01-02T15:57:47Z

pandas/tseries/period.py

@@ -325,7 +322,6 @@ def _shallow_copy_with_infer(self, values=None, **kwargs):

    def _shallow_copy(self, values=None, **kwargs):
        if kwargs.get('freq') is None:


I would explicitly list freq=None I think in the signature (then pass it directly) as its so crucial to the PeriodIndex nice to be noted

jreback · 2017-01-02T15:59:12Z

couple of comments. lgtm otherwise. if you'd fix / create a new issue (for the handling of datetime-aware tz, other datetimes), then ok with merging.

jreback · 2017-02-27T16:01:46Z

can you rebase / update

jorisvandenbossche · 2017-03-03T09:58:09Z

@MaximilianR You have a bunch of errors, all seems related to the use of np.issubdtype(data.dtype, is_datetime64_dtype) (not sure what changed here)

jreback · 2017-03-04T21:03:22Z

this lgtm. any perf impacts?

jreback · 2017-03-04T21:16:03Z

thanks @MaximilianR

figured out to push this in so no rebases are needed!

@jreback

closes pandas-dev#13232 Material clean up of PeriodIndex constructor, which was doing a few weird things (pandas-dev#13232 (comment) nt-220788816), and generally getting messy. Author: Maximilian Roos <[email protected]> Closes pandas-dev#13277 from MaximilianR/period-float and squashes the following commits: 5cae7aa [Maximilian Roos] @jreback changes 75ff54d [Maximilian Roos] _new_PeriodIndex for unpickling 240172f [Maximilian Roos] coerce freq object earlier for perf ba5133b [Maximilian Roos] documentation b0fc0a7 [Maximilian Roos] final changes fa0fa9d [Maximilian Roos] clean up PeriodIndex constructor

jreback added Period Period data type Compat pandas objects compatability with Numpy or Python functions labels May 25, 2016

jreback reviewed May 25, 2016
View reviewed changes

max-sixty force-pushed the period-float branch 3 times, most recently from b252774 to 274bf06 Compare May 27, 2016 20:14

max-sixty reviewed May 27, 2016
View reviewed changes

max-sixty force-pushed the period-float branch 2 times, most recently from d44d1f0 to e6695fa Compare May 27, 2016 22:42

jreback reviewed May 27, 2016
View reviewed changes

jorisvandenbossche reviewed Dec 23, 2016

View reviewed changes

max-sixty force-pushed the period-float branch 3 times, most recently from f9a05e3 to d0e744d Compare January 2, 2017 09:59

max-sixty force-pushed the period-float branch from d0e744d to d04692f Compare January 2, 2017 11:16

jreback reviewed Jan 2, 2017

View reviewed changes

jreback added this to the 0.20.0 milestone Jan 2, 2017

max-sixty added 5 commits March 2, 2017 21:18

clean up PeriodIndex constructor

fa0fa9d

final changes

b0fc0a7

documentation

ba5133b

coerce freq object earlier for perf

240172f

_new_PeriodIndex for unpickling

75ff54d

max-sixty force-pushed the period-float branch from 8f721e3 to 75ff54d Compare March 3, 2017 02:34

@jreback changes

5cae7aa

max-sixty force-pushed the period-float branch from 858b00b to 5cae7aa Compare March 3, 2017 23:25

jreback closed this in ca6d88b Mar 4, 2017

		@@ -325,7 +322,6 @@ def _shallow_copy_with_infer(self, values=None, **kwargs):

		def _shallow_copy(self, values=None, **kwargs):
		if kwargs.get('freq') is None:

clean up PeriodIndex constructor #13277

clean up PeriodIndex constructor #13277

Conversation

max-sixty commented May 25, 2016 • edited Loading

Choose a reason for hiding this comment

max-sixty commented May 25, 2016

jreback commented May 25, 2016

max-sixty commented May 25, 2016

jreback commented May 25, 2016

max-sixty commented May 25, 2016

jreback commented May 25, 2016 • edited Loading

max-sixty commented May 25, 2016

jreback commented May 25, 2016 • edited Loading

max-sixty commented May 25, 2016 • edited by jorisvandenbossche Loading

jreback commented May 26, 2016

max-sixty commented May 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty May 27, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 27, 2016

max-sixty commented Jul 14, 2016

sinhrks commented Jul 14, 2016

jorisvandenbossche commented Aug 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 28, 2016

max-sixty commented Jan 2, 2017 • edited Loading

max-sixty commented Jan 2, 2017

max-sixty commented Jan 2, 2017

max-sixty commented Jan 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 2, 2017

jreback commented Feb 27, 2017

jorisvandenbossche commented Mar 3, 2017

jreback commented Mar 4, 2017

jreback commented Mar 4, 2017

max-sixty commented May 25, 2016 •

edited

Loading

jreback commented May 25, 2016 •

edited

Loading

jreback commented May 25, 2016 •

edited

Loading

max-sixty commented May 25, 2016 •

edited by jorisvandenbossche

Loading

max-sixty May 27, 2016 •

edited

Loading

jorisvandenbossche commented Aug 21, 2016 •

edited

Loading

max-sixty commented Jan 2, 2017 •

edited

Loading