Skip to content

clean up PeriodIndex constructor #13277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

max-sixty
Copy link
Contributor

@max-sixty max-sixty commented May 25, 2016

Material clean up of PeriodIndex constructor, which was doing a few weird things (#13232 (comment)), and generally getting messy.

@jreback jreback added Period Period data type Compat pandas objects compatability with Numpy or Python functions labels May 25, 2016
@@ -41,10 +41,13 @@ def f(self):

def _get_ordinals(data, freq):
f = lambda x: Period(x, freq=freq).ordinal
if len(data) == 0:
return np.array([], dtype='int64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do if/elsif here (really more a stylistic comment)

@max-sixty
Copy link
Contributor Author

@jreback One of the reasons the PeriodIndex constructer is so convoluted is there are some odd cases supported (I imagine through many bugfixes & iterations!). One of them is a list of ints representing years. An array of ints are interpreted as ordinals, but a list is coerced to years within the Period constructor.

Here's an example: https://github.com/SixtyCapital/pandas/blob/master/pandas/tseries/tests/test_period.py#L2216

What do you think about not allowing that?

@jreback
Copy link
Contributor

jreback commented May 25, 2016

@MaximilianR I think that's convenient. The issue is you need a constructor that only deals with codes (otherwise its ambiguous), ok; let's rename _from_arraylike to from_codes and make it public; that will solve the ambiguity problem then (and also provide the internal mechanism for creation which is THE way to create from codes).

@max-sixty
Copy link
Contributor Author

OK, but then you have these being different results:

pd.PeriodIndex([2007, 2008], freq='A')  # years
pd.PeriodIndex(np.array([2007, 2008]), freq='A')  # ordinals 
pd.PeriodIndex(np.array([2007, 2008], dtype='object'), freq='A')  # ?

Agree on separation of concerns there - I have ._simple_new handling ordinals, and __new__ handling the other cases now.

But we still need pd.PeriodIndex(idx.values, idx.freq) to treat the values as ordinals rather than years (otherwise would break a fundamental tenet of indexes). So we're left with the above.
I can add back the branching logic for lists vs. arrays if you really think it's a helpful use case?

@jreback
Copy link
Contributor

jreback commented May 25, 2016

hmm, those should all do the same thing.

@max-sixty
Copy link
Contributor Author

hmm, those should all do the same thing.

Right. So we can either allow supplying years as ints OR those all the same thing: choose one!
Or a creative solution?

@jreback
Copy link
Contributor

jreback commented May 25, 2016

actually looking closer I don't think integers are EVER interpreted as ordinals directly, instead the integers are intrepreted according to the frequency. Can you give a counter-example?

hmm, maybe that's not true. but in any event an example would help.

@max-sixty
Copy link
Contributor Author

actually looking closer I don't think integers are EVER interpreted as ordinals directly,

From some test failures earlier, but without much exploration on its frequency): in some indexing operations, PeriodIndex(some_values_from_a_periodindex) is called.
There may be ways around that but are you sure we want to violate the idx._constructor(idx.values) == idx consistency?

instead the integers are intrepreted according to the frequency.

You need the freq to convert an ordinal to a Period, for sure

@jreback
Copy link
Contributor

jreback commented May 25, 2016

idx._constructor(idx.values) == idx

idx.values is an implementation detail. I have no problem with forcing US (meaning you!) to change internal access methods to PeriodIndex.from_ordinals(idx.values) if you want to construct from ordinals and leave PeriodIndex([2000,2001],freq='A-JUNE') to work as the public contstructor.

IOW separating out ordinal construction.

@max-sixty
Copy link
Contributor Author

max-sixty commented May 25, 2016

OK, let me reflect a bit.

My current thinking depends what you mean by 'internal access methods' - absolutely fine if that's within the PeriodIndex machinery. My fear is that there's a lot of pandas that relies on assertion above, and that going down this road would make PeriodIndex need more special casing throughout the library. And I would love to see less, both for PeriodIndex and everything else!

One option would be to move towards the Index interface relying on ._simple_new to interpret any output of the values of an Index. I think that probably isn't a big change.

@jreback
Copy link
Contributor

jreback commented May 26, 2016

My current thinking depends what you mean by 'internal access methods- absolutely fine if that's within the PeriodIndex machinery. My fear is that there's a lot of pandas that relies on assertion above, and that going down this road would makePeriodIndexneed more special casing throughout the library. And I would love to see less, both forPeriodIndex` and everything else!

You mean that people are actually constructing this like PeriodIndex(idx.values)? and expecting it to work? in external code, really?

If you can fix up the internal consistency, does anything actually break? (e.g. create PeriodIndex.from_ordinals then move all internal code to use that).

@max-sixty
Copy link
Contributor Author

You mean that people are actually constructing this like PeriodIndex(idx.values)? and expecting it to work? in external code, really?

No! But there are places in pandas this happens outside of the PeriodIndex class. I need to turn this off and see what breaks - I suspect a reasonable amount but not a torrent

@max-sixty max-sixty force-pushed the period-float branch 3 times, most recently from b252774 to 274bf06 Compare May 27, 2016 20:14
raise ValueError('PeriodIndex() must be called with a '
'collection of some kind, %s was passed'
% repr(data))
def _simple_new(cls, data, name=None, freq=None, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be called data rather than values in order for the current implementation of pickling to work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally: I haven't got my head around why we need all the custom pickling infrastructure around pickling with indexes - isn't reconstructing the object's __dict__ enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there isn't any current infrastructure, Index.__reduce__/__setstate__ does everything generally. Except for the back-compat stuff (which at some point should just take out). Everythings done as a dict now (was a tuple a few versions ago).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By 'infrastructure' I mean the Index.__reduce__/__setstate__ methods themselves.
Or you're saying they're only there to allow back compat? And that we could remove those, default to the normal python machinery, and it would work the same for everything from recent versions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I don't think it would work. You have to tell pickle what to do. E.g. we want to pickle only certain attributes and the data, nothing more. I don't think it works correctly otherwise (it tries to do too much).

You can try removing them but (aside from back-compat) I don't think it would work, but I could be wrong.

In any event I don't see it as a big deal; you want custom actions and that's what the special methods are for.

otherwise its too presumptous.

Copy link
Contributor Author

@max-sixty max-sixty May 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a quick go at simplifying it - this takes out all of the custom stuff except for back-compat + a one line __getstate__. Tests pass. But also agree, not a big deal
max-sixty/pandas@period-float...SixtyCapital:use-getstate

@max-sixty max-sixty force-pushed the period-float branch 2 times, most recently from d44d1f0 to e6695fa Compare May 27, 2016 22:42
index = PeriodIndex(values)
freq = index.freq
values = index.values

from pandas.core.series import Series
values = Series(values).values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe reorg this a bit, we should catch the datetime/timedelta/period ones before we do these coercion lines

@jreback
Copy link
Contributor

jreback commented May 27, 2016

@MaximilianR ok does look a lot cleaner

@max-sixty
Copy link
Contributor Author

I've been remiss about the final fixes here. Will need a couple of weeks on other commitments but I will get to this

Unless @sinhrks is going to spoil all my hard work with moving PeriodIndex to be a proper type

@sinhrks
Copy link
Member

sinhrks commented Jul 14, 2016

I'm using PeriodIndex as internal of PeriodDtype (like DatetimeTZ), the improvement of PeriodIndex should not be spoiled:)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 21, 2016

@MaximilianR do you have time to update this? (the period dtype for index PR has been merged) Then we could get it in 0.19

@@ -283,6 +283,7 @@ Bug Fixes
- Bug in ``astype()`` where ``inf`` values were incorrectly converted to integers. Now raises error now with ``astype()`` for Series and DataFrames (:issue:`14265`)
- Bug in ``DataFrame(..).apply(to_numeric)`` when values are of type decimal.Decimal. (:issue:`14827`)
- Bug in ``describe()`` when passing a numpy array which does not contain the median to the ``percentiles`` keyword argument (:issue:`14908`)
- Cleaned up ``PeriodIndex`` constructor, including raising on floats more consistently (:issue:`13277`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we put it in the api section, there should be a clear 'change' documented for users to be aware of. Is the raising on floats not more a bug fix?

setup.cfg Outdated
@@ -13,6 +13,7 @@ parentdir_prefix = pandas-

[flake8]
ignore = E731
max-line-length = 79
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some people will have local defaults that are different. This will override those

@jorisvandenbossche
Copy link
Member

@MaximilianR Can you run some of the asv benchmarks to check there is no impact on performance? (also the plotting benchmarks, we have see regressions there in the past due to changes in period index)

@max-sixty max-sixty force-pushed the period-float branch 3 times, most recently from f9a05e3 to d0e744d Compare January 2, 2017 09:59
@max-sixty
Copy link
Contributor Author

max-sixty commented Jan 2, 2017

@jorisvandenbossche Good call. There is one significant slow down, by about 30x: time_from_pydatetime, which cooerces python datetimes to Periods.

I've tried to debug this, but am struggling a bit and could do with some help (and general guidance on how to approach these problems). So far:

  • The Period constructor is being called by each object within period.extract_ordinals, and that's what's taking all the time
  • The Period constructor then calls get_period_ordinal, which somewhere calls into to_offset:
 0.183 __new__  pandas/tseries/period.py:183
├─ 0.168 wrapper  pandas/util/decorators.py:65
│  └─ 0.150 to_offset  pandas/tseries/frequencies.py:453
│     ├─ 0.089 get_stride_from_decimal  pandas/tseries/frequencies.py:186
│     │  └─ 0.086 isclose  numpy/core/numeric.py:2375
│     │     ├─ 0.032 within_tol  numpy/core/numeric.py:2434
│     │     └─ 0.021 all  numpy/core/fromnumeric.py:1987
│     │        ├─ 0.013 _all  numpy/core/_methods.py:40
│     │        └─ 0.003 asanyarray  numpy/core/numeric.py:484
│     ├─ 0.019 get_offset  pandas/tseries/frequencies.py:605
│     │  └─ 0.014 copy  pandas/tseries/offsets.py:306
│     │     └─ 0.013 __init__  pandas/tseries/offsets.py:192
│     ├─ 0.014 __mul__  pandas/tseries/offsets.py:412
│     │  └─ 0.006 __init__  pandas/tseries/offsets.py:192
│     │     └─ 0.004 _determine_offset  pandas/tseries/offsets.py:198
│     │        └─ 0.003 <genexpr>  pandas/tseries/offsets.py:203
│     └─ 0.008 split  re.py:195
└─ 0.014 get_freq_code  pandas/tseries/frequencies.py:300
   └─ 0.003 _period_str_to_code  pandas/tseries/frequencies.py:732

Apart from exactly how the code traces through:

  • I'm not sure how these changes should make an impact on this code path
  • Does this matter? Coercing from python datetime isn't a common use case

CC @sinhrks

@max-sixty
Copy link
Contributor Author

I think I figured it out - Period._maybe_convert_freq needs to be called early, otherwise it the pyx functions will take a slow route to get the appropriate freq. Running ASV to confirm now

@max-sixty
Copy link
Contributor Author

Performance issues solved! I think all all asv changes are from external compute interference given none are period related. Here are all <50% or >200% for transparency:

-  112.12ms    54.32ms      0.48  sparse.sparse_frame_constructor.time_sparse_frame_constructor
-    7.34ms     3.48ms      0.47  reshape.reshape_stack_simple.time_reshape_stack_simple
-    5.97ms     2.70ms      0.45  reshape.melt_dataframe.time_melt_dataframe
-    1.00ms   448.65μs      0.45  reindex.FillMethod.time_pad
-    1.87ms   825.54μs      0.44  period.Algorithms.time_value_counts_pindex
-    1.15ms   485.69μs      0.42  reindex.Reindexing.time_reindex_dates
-  429.95μs   173.53μs      0.40  series_methods.series_constructor_no_data_datetime_index.time_series_constructor_no_data_datetime_index
-   24.72ms     4.49ms      0.18  reindex.Duplicates.time_frame_drop_dups_na_inplace
-   18.22ms     2.74ms      0.15  reindex.Reindexing.time_reindex_multiindex

---

+    6.13ms    14.80ms      2.41  hdfstore_bench.HDF5.time_read_store_mixed
+   26.90ms    60.93ms      2.27  groupby.GroupBySuite.time_rank('int', 100)
+    1.48ms     3.23ms      2.19  frame_methods.Interpolate.time_interpolate_some_good
+  537.00μs     1.16ms      2.16  strings.StringEncode.time_encode_decode
+    2.62ms     5.43ms      2.07  strings.StringMethods.time_contains_few_noregex

@max-sixty
Copy link
Contributor Author

@jreback green!

if not isinstance(data, (np.ndarray, PeriodIndex,
DatetimeIndex, Int64Index)):
if is_scalar(data) or isinstance(data, Period):
raise ValueError('PeriodIndex() must be called with a '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is generally a TypeError, in fact changing this to cls._scalar_data_error(data) is appropriate (I am changing on merge)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually why don't you make this change (as I have a couple of more comments).

diff --git a/pandas/tseries/period.py b/pandas/tseries/period.py
index 903b4e3..cefe1cf 100644
--- a/pandas/tseries/period.py
+++ b/pandas/tseries/period.py
@@ -239,9 +239,7 @@ class PeriodIndex(DatelikeOps, DatetimeIndexOpsMixin, Int64Index):
         if not isinstance(data, (np.ndarray, PeriodIndex,
                                  DatetimeIndex, Int64Index)):
             if is_scalar(data) or isinstance(data, Period):
-                raise ValueError('PeriodIndex() must be called with a '
-                                 'collection of some kind, %s was passed'
-                                 % repr(data))
+                cls._scalar_data_error(data)
 
             # other iterable of some kind
             if not isinstance(data, (list, tuple)):
diff --git a/pandas/tseries/tests/test_period.py b/pandas/tseries/tests/test_period.py
index c7fd4ae..8ec2fae 100644
--- a/pandas/tseries/tests/test_period.py
+++ b/pandas/tseries/tests/test_period.py
@@ -1753,7 +1753,7 @@ class TestPeriodIndex(tm.TestCase):
 
         self.assertRaises(ValueError, PeriodIndex, idx._values)
         self.assertRaises(ValueError, PeriodIndex, list(idx._values))
-        self.assertRaises(ValueError, PeriodIndex,
+        self.assertRaises(TypeError, PeriodIndex,
                           data=Period('2007', freq='A'))
 
         result = PeriodIndex(iter(idx))

data = np.asarray(data)

# datetime other than period
if np.issubdtype(data.dtype, np.datetime64):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be is_datetime64_dtype. But not sure what should happend if you have tz-aware or datetime64 (but not ns) passed. We may not have tests for this. You could handle datetime64tz by pretty easily though (happy to have this done in another PR as well).

if not is_integer_dtype(values):
values = np.array(values, copy=False)
if (len(values) > 0 and is_float_dtype(values)):
if len(values) > 0 and is_float_dtype(values):
raise TypeError("PeriodIndex can't take floats")
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can eliminate the else clause (and just do the return)

@@ -325,7 +322,6 @@ def _shallow_copy_with_infer(self, values=None, **kwargs):

def _shallow_copy(self, values=None, **kwargs):
if kwargs.get('freq') is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explicitly list freq=None I think in the signature (then pass it directly) as its so crucial to the PeriodIndex nice to be noted

@jreback jreback added this to the 0.20.0 milestone Jan 2, 2017
@jreback
Copy link
Contributor

jreback commented Jan 2, 2017

couple of comments. lgtm otherwise. if you'd fix / create a new issue (for the handling of datetime-aware tz, other datetimes), then ok with merging.

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

can you rebase / update

@jorisvandenbossche
Copy link
Member

@MaximilianR You have a bunch of errors, all seems related to the use of np.issubdtype(data.dtype, is_datetime64_dtype) (not sure what changed here)

@jreback
Copy link
Contributor

jreback commented Mar 4, 2017

this lgtm. any perf impacts?

@jreback jreback closed this in ca6d88b Mar 4, 2017
@jreback
Copy link
Contributor

jreback commented Mar 4, 2017

thanks @MaximilianR

figured out to push this in so no rebases are needed!

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#13232

Material clean up of PeriodIndex constructor, which was doing a few
weird things (pandas-dev#13232 (comment)
nt-220788816), and generally getting messy.

Author: Maximilian Roos <[email protected]>

Closes pandas-dev#13277 from MaximilianR/period-float and squashes the following commits:

5cae7aa [Maximilian Roos] @jreback changes
75ff54d [Maximilian Roos] _new_PeriodIndex for unpickling
240172f [Maximilian Roos] coerce freq object earlier for perf
ba5133b [Maximilian Roos] documentation
b0fc0a7 [Maximilian Roos] final changes
fa0fa9d [Maximilian Roos] clean up PeriodIndex constructor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PeriodIndex with float input inconsistency
5 participants