PERF: PeriodEngine.get_loc accept narrow data type #29234

proost · 2019-10-26T05:34:34Z

closes REF/PERF: PeriodEngine covers narrow type #29038
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

xref #28628.
I want to refactor PeriodIndex.get_loc. Because PeriodIndex.get_loc compute twice Int64index.get_loc and PeriodEngine.get_loc. Problem is only PeriodEngine.get_loc accepts Period.ordinal.
I try to change PeriodIndex.get_loc in two ways.
First one, I try to remove PeriodEngine.get_loc in PeriodIndex.get_loc. so just compute .get_loc once. I change several ways But always stuck in test cases. So I give it up this way.
Second one, I try to change PeriodEngine.get_loc to accept Period not only Period.ordinal. This way is success in that work successfully and pass all test cases. However Problem is most cases users don't use Period using PeriodIndex.get_loc and PeriodEngine.get_loc can't accept "intolerance" argument. Not using Int64index.get_loc can't be possible.
But I send PR. Little improvement is also improvement.

improvement : memory usage with Period.get_loc

pep8speaks · 2019-10-26T05:34:37Z

Hello @proost! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-15 16:30:18 UTC

proost · 2019-10-26T16:45:26Z

@jreback
Here is what i intent to change

jbrockmendel · 2019-10-27T23:52:34Z

Is there any simplficiation of PeriodIndex.get_loc that this makes possible?

proost · 2019-10-28T01:35:40Z

@jbrockmendel

Yes. I'm working on it.

proost · 2019-10-29T09:18:18Z

@jbrockmendel
Could you take a look at it?

doc/source/whatsnew/v1.0.0.rst

pandas/core/indexes/period.py

jbrockmendel · 2019-10-29T15:47:24Z

pandas/tests/indexes/common.py

@@ -332,7 +332,7 @@ def test_memory_usage(self, indices):
        # RangeIndex, IntervalIndex
        # don't have engines
        if not isinstance(indices, (RangeIndex, IntervalIndex)):
-            assert result2 > result


is it just PeriodIndex for which this inequality is no longer strict?

@jbrockmendel
Yes. for PeriodIndex

jbrockmendel · 2019-10-29T15:49:42Z

It looks like nothing in _libs.index is changed now. I thought the idea was that something was wrong there that would be fixed, and that would allow simplification in PeriodIndex.get_loc. Am I confused with something else?

proost · 2019-10-30T02:11:50Z

@jbrockmendel

In terms of simplication, Yes. I was wrong.
How about memory Usage?
Before,

In [2]: pi = pd.period_range(start=2010,end=2019,freq="m")                      
In [3]: pi.memory_usage()                                                       
Out[3]: 872
In [4]: pi.get_loc(pi[0])                                                       
Out[4]: 0
In [5]: pi.memory_usage()                                                       
Out[5]: 11984

After

In [3]: pi = pd.period_range(start=2010,end=2019,freq="m")                                                                                      
In [4]: pi.memory_usage()                                                                                                                       
Out[4]: 872
In [5]: pi.get_loc(pi[0])                           
Out[5]: 0
In [6]: pi.memory_usage()                           
Out[6]: 6864

I don't intent to it but i found it running test case. After .get_loc, use less memory space

jreback

you are changing the semantics of what is classified as a TypeError here. An invalid key should be a TypeError and not a KeyError.

jreback · 2019-10-30T12:08:02Z

pandas/core/indexes/period.py

@@ -704,36 +704,36 @@ def get_loc(self, key, method=None, tolerance=None):
        -------
        loc : int
        """
+        if is_integer(key):
+            ordinal = key
+        else:


why don't you just move what you have in _cast_to_period_object here. This is confusing as you are doing the same type of operation in multiple functions.

yah, i think we should require that the argument to get_loc be either a Period or a string/datetime that can be cast to Period, not allow ordinal at this point

@jbrockmendel
Yes, I really want to. But some code use ".get_loc" with numeric type and if exception should raise, raise KeyError not TypeError.

pandas/tests/indexes/period/test_indexing.py

jbrockmendel · 2019-10-31T01:09:34Z

@proost I think you need to clarify the goal of this PR. The referenced issue #29038 is about the behavior of PeriodEngine, but this doesn't touch that at all.

proost · 2019-10-31T01:46:52Z

@jbrockmendel
Purpose: Improvement function of PeriodIndex.get_loc
Reason: Inefficient PeriodIndex.get_loc. Because PeriodEngine.get_loc acceptable parameter type is narrow. So most case PeriodIndex.get_loc compute twice. Both PeriodEngine and Int64Index
So What i try to :
-> first one : Change PeriodEngine.get_loc to accept Period to cover all parameter types. Problem is PeriodEngine.get_loc can't accept "tolerance" and "method". So if "method" and "tolerance" are not none, also in this case, PeriodIndex.get_loc compute twice. first PeriodEngine fails and then Int64Index
-> second one: Don't use PeriodEngine.get_loc Using PeriodIndex.get_loc.

What we can get from this change:
Before: I thought there is improvement in computation time.
Now: Same speed. But use less memory.

Of course, This PR off track from original issue. In #29038, I narrowed the scope. It was my fault. Sorry not to upload adequate issue.

proost · 2019-11-20T05:58:09Z

@jbrockmendel
small improvement is prohibited? If prohibited, I will close this PR. If not, this PR for reducing memory usage.

jbrockmendel · 2019-11-21T01:17:42Z

Small improvements are encouraged! We just need to communicate clearly what the relevant improvement is. how much less memory does this use? You're no longer calling the _engine method at the beginning, which makes me concerned that performance will take a hit.

proost · 2019-11-21T05:01:42Z

@jbrockmendel
from not calling _engine.get_loc, period index use less memory after using get_loc.
From this change, we can save less memory space about 60% ~ 100% before.

first case:

pi = pd.period_range(start=2000,end=2019,freq="y") 
pi.get_loc('2000')
pi.memory_usage()

Before change, period index memory usage: 1600
After change, period index memory usage: 900

second case:

pi = pd.period_range(start=2000,end=2019,freq="m") 
pi.get_loc('2000')
pi.memory_usage()

Before change, period index memory usage: 24144
After change, period index memory usage: 13904

third case:

pi = pd.period_range(start=2010,end=2019,freq="d") 
pi.get_loc('2010')
pi.memory_usage()

Before change, period index memory usage: 380288
After change, period index memory usage: 216448

fourth case:

pi = pd.period_range(start=2010,end=2019,freq="h") 
pi.get_loc('2010')
pi.memory_usage()

Before change, period index memory usage: 6505104
After change, period index memory usage: 3883664

fifth case:

pi = pd.period_range(start=2010,end=2019,freq="s") 
pi.get_loc('2010')
pi.memory_usage()

Before change, period index memory usage: 4543948816
After change, period index memory usage: 4543948816

if you worry about time performance from this change, no need to worry.
command same above,

first case:

Before Change: 11.7 µs ± 58.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After Change: 10.2 µs ± 42.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

second case:

Before Change: 11.7 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After Change: 8.78 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

third case:

Before Change: 11 µs ± 384 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After Change: 9.35 µs ± 40.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

fourth case:

Before Change: 10.7 µs ± 330 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After Change: 9.25 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

fifth case:

Before Change: 28.8 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
After Change: 26.8 µs ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Almost same or slightly better before. So this change of code doesn't exacerbate on time performance.

jreback

pls rebase on master as well

pandas/core/indexes/period.py

jreback · 2019-12-02T01:58:31Z

so i wonder after this if we can blow PeriodEngine away entirely. cc @jbrockmendel

@proost the memory savings are because we no longer need to instantiate the engine.

jbrockmendel · 2019-12-04T02:35:17Z

@proost if i understand correclty, all the timings you posted were with string inputs. Can you do the same comparisons for 0, pd.Period("2000"), pd.NaT, pd.Timedelta(days=1)

jbrockmendel · 2019-12-04T02:38:03Z

so i wonder after this if we can blow PeriodEngine away entirely

It looks like there is some non-trivial stuff in there for get_indexer, but we might be able to share more with DatetimeEngine

proost · 2019-12-09T04:41:11Z

@jbrockmendel
If

l = ['1970-01-01', '1970-01-02', '1970-01-03', '1970-01-01',
     '1970-01-05', '1970-01-06', '1970-01-07', '1970-01-08',
     '1970-01-09', '1970-01-10', '1970-01-11', '1970-01-12',
     '1970-01-13', '1970-01-14', '1970-01-15', '1970-01-16',
     '1970-01-17', '1970-01-18', '1970-01-19', '1970-01-20',
     '1970-01-21', '1970-01-22', '1970-01-23', '1970-01-24',
     '1970-01-25', '1970-01-26', '1970-01-27', '1970-01-28',
     '1970-01-29', '1970-01-30', '1970-01-31', np.datetime64('NaT')
]
pi = pd.PeriodIndex(l,freq="D")

case 1:

p = pd.Period(year=1970, month=1, day=1, freq="D")
pi.get_loc(p)

Before:
memory usage: 3072
time taken: 11.8 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After:
memory usage: 1792
time taken: 14 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

case 2:

pi.get_loc(0)

Before:
memory usage: 1536
time taken: 5.13 µs ± 25 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After:
memory usage: 1792
time taken: 11.2 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

case 3:

pi.get_loc(pd.NaT)

Before:
memory usage: 3072
time taken: 10.1 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After:
memory usage: 1792
time taken: 11.7 µs ± 51.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

case 4:
Because "Timedelta" Object can't be converted to Period. So can't work on ".get_loc"
But if you mean "Timestamp" object,

t = pd.Timestamp(year=1970, month=1, day=1)
pi.get_loc(t)

Before:
memory usage: 3072
time taken: 13.3 µs ± 533 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After:
memory Usage: 1792
time taken: 16 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Well, time performance becomes bad before. But i think this much can be acceptable.
except Integer case, all memory usage lower before. And Integer or float type is not expected input from user. Becasue PeriodIndex.get_loc parameter type is expected as "Period", "String", "datetime" or datetimelike object from user. So This is not a problem.

jbrockmendel · 2019-12-10T01:26:35Z

@proost thank you for doing these additional timings.

For the Timedelta case i actually did want to know how long it took to fail.

proost · 2019-12-10T02:11:52Z

@jbrockmendel
Ok. I understand. So this is measurement of time.

td =  pd.Timedelta(days=1)
pi.get_loc(td)

Before: 13.5 ns ± 0.021 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
After: 15.7 ns ± 0.0219 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

jreback · 2019-12-27T19:58:24Z

can you merge master and will look again

jreback

can you rebase

jreback · 2020-01-06T13:39:26Z

pandas/tests/indexes/common.py

@@ -339,7 +339,7 @@ def test_memory_usage(self, indices):

        # RangeIndex, IntervalIndex


can you update / change the comment to more reflect what is going on (i would remove the references to RI and II here)

indices = PeriodIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12', '2000-01-13', '2000-01-14'], dtype='period[D]', freq='D') result = indices.memory_usage() indices.get_loc(indices[0]) result2 = indices.memory_usage()

"result" and "result2" same as 80

jbrockmendel · 2020-01-15T18:46:06Z

pandas/core/indexes/period.py

        except ValueError:
            # we cannot construct the Period
            # as we have an invalid type
            if is_list_like(key):
                raise TypeError(f"'{key}' is an invalid key")
            raise KeyError(key)

-        ordinal = key.ordinal if key is not NaT else key.value


i think this got affected by #31021. unless there a compelling reason to move this line up to L614, lets keep it here

jbrockmendel · 2020-01-15T18:49:15Z

pandas/core/indexes/period.py

-
-            except KeyError:
-                raise KeyError(key)
+            raise KeyError(key)


going directly to _int64index does simplify the code here which is nice, but i imagine involves a perf penalty for by-far-the-most-common case with method=tolerance=None. Am I wrong about that?

@jbrockmendel
Yes, you are right. I pointed out the wrong key. Like #31021, cleaning up codes is best choice.
Is it okay if i close PR and issue?

proost changed the title ~~BUG: PeriodEngine doesn't work (#29038)~~ BUG: PeriodEngine doesn't work Oct 26, 2019

proost force-pushed the periodengine-get_loc branch 5 times, most recently from f1bdfae to 04ab615 Compare October 26, 2019 14:27

proost force-pushed the periodengine-get_loc branch 4 times, most recently from 1294121 to b58e022 Compare October 29, 2019 08:27

proost force-pushed the periodengine-get_loc branch 2 times, most recently from 9ec1e25 to 55b8abb Compare October 29, 2019 10:19

jbrockmendel reviewed Oct 29, 2019

View reviewed changes

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

jbrockmendel reviewed Oct 29, 2019

View reviewed changes

pandas/core/indexes/period.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Oct 29, 2019

View reviewed changes

proost force-pushed the periodengine-get_loc branch 4 times, most recently from 9d11e2a to 8cdf285 Compare October 30, 2019 06:55

jreback requested changes Oct 30, 2019

View reviewed changes

jreback added the Period Period data type label Oct 30, 2019

proost force-pushed the periodengine-get_loc branch from ccad304 to 55e55d5 Compare November 20, 2019 05:44

jreback requested changes Dec 2, 2019

View reviewed changes

pandas/core/indexes/period.py Outdated Show resolved Hide resolved

pandas/core/indexes/period.py Outdated Show resolved Hide resolved

pandas/core/indexes/period.py Outdated Show resolved Hide resolved

proost force-pushed the periodengine-get_loc branch from 55e55d5 to 1db092b Compare December 2, 2019 15:41

proost changed the title ~~BUG: PeriodEngine doesn't work~~ PERF: PeriodEngine doesn't work Dec 21, 2019

proost force-pushed the periodengine-get_loc branch 2 times, most recently from ef552b6 to 8b3f484 Compare December 28, 2019 13:34

proost changed the title ~~PERF: PeriodEngine doesn't work~~ PERF: PeriodEngine.get_loc accept narrow data type Dec 28, 2019

proost force-pushed the periodengine-get_loc branch from 8b3f484 to 30bb3a7 Compare December 30, 2019 03:30

jreback requested changes Jan 6, 2020

View reviewed changes

proost force-pushed the periodengine-get_loc branch 2 times, most recently from ee0a53f to b10e5ce Compare January 7, 2020 12:54

proost force-pushed the periodengine-get_loc branch 3 times, most recently from 2fc2f30 to 4b47e01 Compare January 15, 2020 16:17

PERF: PeriodEngine.get_loc accept narrow data type (pandas-dev#29234)

7f49b53

proost force-pushed the periodengine-get_loc branch from 397e0d7 to 7f49b53 Compare January 15, 2020 16:30

jbrockmendel reviewed Jan 15, 2020

View reviewed changes

proost closed this Jan 16, 2020

proost deleted the periodengine-get_loc branch January 17, 2020 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: PeriodEngine.get_loc accept narrow data type #29234

PERF: PeriodEngine.get_loc accept narrow data type #29234

proost commented Oct 26, 2019 •

edited

Loading

pep8speaks commented Oct 26, 2019 •

edited

Loading

proost commented Oct 26, 2019

jbrockmendel commented Oct 27, 2019

proost commented Oct 28, 2019

proost commented Oct 29, 2019

jbrockmendel Oct 29, 2019

proost Oct 30, 2019

jbrockmendel commented Oct 29, 2019

proost commented Oct 30, 2019 •

edited

Loading

jreback left a comment

jreback Oct 30, 2019

jbrockmendel Jan 12, 2020

proost Jan 14, 2020

jbrockmendel commented Oct 31, 2019

proost commented Oct 31, 2019 •

edited

Loading

proost commented Nov 20, 2019

jbrockmendel commented Nov 21, 2019

proost commented Nov 21, 2019 •

edited

Loading

jreback left a comment

jreback commented Dec 2, 2019

jbrockmendel commented Dec 4, 2019

jbrockmendel commented Dec 4, 2019

proost commented Dec 9, 2019

jbrockmendel commented Dec 10, 2019

proost commented Dec 10, 2019

jreback commented Dec 27, 2019

jreback left a comment

jreback Jan 6, 2020

proost Jan 7, 2020

jbrockmendel Jan 15, 2020

jbrockmendel Jan 15, 2020

proost Jan 16, 2020

jbrockmendel Jan 16, 2020

		@@ -339,7 +339,7 @@ def test_memory_usage(self, indices):

		# RangeIndex, IntervalIndex

PERF: PeriodEngine.get_loc accept narrow data type #29234

PERF: PeriodEngine.get_loc accept narrow data type #29234

Conversation

proost commented Oct 26, 2019 • edited Loading

pep8speaks commented Oct 26, 2019 • edited Loading

Comment last updated at 2020-01-15 16:30:18 UTC

proost commented Oct 26, 2019

jbrockmendel commented Oct 27, 2019

proost commented Oct 28, 2019

proost commented Oct 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Oct 29, 2019

proost commented Oct 30, 2019 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Oct 31, 2019

proost commented Oct 31, 2019 • edited Loading

proost commented Nov 20, 2019

jbrockmendel commented Nov 21, 2019

proost commented Nov 21, 2019 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jreback commented Dec 2, 2019

jbrockmendel commented Dec 4, 2019

jbrockmendel commented Dec 4, 2019

proost commented Dec 9, 2019

jbrockmendel commented Dec 10, 2019

proost commented Dec 10, 2019

jreback commented Dec 27, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

proost commented Oct 26, 2019 •

edited

Loading

pep8speaks commented Oct 26, 2019 •

edited

Loading

proost commented Oct 30, 2019 •

edited

Loading

proost commented Oct 31, 2019 •

edited

Loading

proost commented Nov 21, 2019 •

edited

Loading