Skip to content

PERF: PeriodEngine.get_loc accept narrow data type #29234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

proost
Copy link
Contributor

@proost proost commented Oct 26, 2019

xref #28628.
I want to refactor PeriodIndex.get_loc. Because PeriodIndex.get_loc compute twice Int64index.get_loc and PeriodEngine.get_loc. Problem is only PeriodEngine.get_loc accepts Period.ordinal.
I try to change PeriodIndex.get_loc in two ways.
First one, I try to remove PeriodEngine.get_loc in PeriodIndex.get_loc. so just compute .get_loc once. I change several ways But always stuck in test cases. So I give it up this way.
Second one, I try to change PeriodEngine.get_loc to accept Period not only Period.ordinal. This way is success in that work successfully and pass all test cases. However Problem is most cases users don't use Period using PeriodIndex.get_loc and PeriodEngine.get_loc can't accept "intolerance" argument. Not using Int64index.get_loc can't be possible.
But I send PR. Little improvement is also improvement.


improvement : memory usage with Period.get_loc

@pep8speaks
Copy link

pep8speaks commented Oct 26, 2019

Hello @proost! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-15 16:30:18 UTC

@proost proost changed the title BUG: PeriodEngine doesn't work (#29038) BUG: PeriodEngine doesn't work Oct 26, 2019
@proost proost force-pushed the periodengine-get_loc branch 5 times, most recently from f1bdfae to 04ab615 Compare October 26, 2019 14:27
@proost
Copy link
Contributor Author

proost commented Oct 26, 2019

@jreback
Here is what i intent to change

@jbrockmendel
Copy link
Member

Is there any simplficiation of PeriodIndex.get_loc that this makes possible?

@proost
Copy link
Contributor Author

proost commented Oct 28, 2019

@jbrockmendel

Yes. I'm working on it.

@proost proost force-pushed the periodengine-get_loc branch 4 times, most recently from 1294121 to b58e022 Compare October 29, 2019 08:27
@proost
Copy link
Contributor Author

proost commented Oct 29, 2019

@jbrockmendel
Could you take a look at it?

@proost proost force-pushed the periodengine-get_loc branch 2 times, most recently from 9ec1e25 to 55b8abb Compare October 29, 2019 10:19
@@ -332,7 +332,7 @@ def test_memory_usage(self, indices):
# RangeIndex, IntervalIndex
# don't have engines
if not isinstance(indices, (RangeIndex, IntervalIndex)):
assert result2 > result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it just PeriodIndex for which this inequality is no longer strict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
Yes. for PeriodIndex

@jbrockmendel
Copy link
Member

It looks like nothing in _libs.index is changed now. I thought the idea was that something was wrong there that would be fixed, and that would allow simplification in PeriodIndex.get_loc. Am I confused with something else?

@proost
Copy link
Contributor Author

proost commented Oct 30, 2019

@jbrockmendel

In terms of simplication, Yes. I was wrong.
How about memory Usage?
Before,

In [2]: pi = pd.period_range(start=2010,end=2019,freq="m")                      
In [3]: pi.memory_usage()                                                       
Out[3]: 872
In [4]: pi.get_loc(pi[0])                                                       
Out[4]: 0
In [5]: pi.memory_usage()                                                       
Out[5]: 11984

After

In [3]: pi = pd.period_range(start=2010,end=2019,freq="m")                                                                                      
In [4]: pi.memory_usage()                                                                                                                       
Out[4]: 872
In [5]: pi.get_loc(pi[0])                           
Out[5]: 0
In [6]: pi.memory_usage()                           
Out[6]: 6864

I don't intent to it but i found it running test case. After .get_loc, use less memory space

@proost proost force-pushed the periodengine-get_loc branch 4 times, most recently from 9d11e2a to 8cdf285 Compare October 30, 2019 06:55
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are changing the semantics of what is classified as a TypeError here. An invalid key should be a TypeError and not a KeyError.

@@ -704,36 +704,36 @@ def get_loc(self, key, method=None, tolerance=None):
-------
loc : int
"""
if is_integer(key):
ordinal = key
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just move what you have in _cast_to_period_object here. This is confusing as you are doing the same type of operation in multiple functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah, i think we should require that the argument to get_loc be either a Period or a string/datetime that can be cast to Period, not allow ordinal at this point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
Yes, I really want to. But some code use ".get_loc" with numeric type and if exception should raise, raise KeyError not TypeError.

@jreback jreback added the Period Period data type label Oct 30, 2019
@jbrockmendel
Copy link
Member

@proost I think you need to clarify the goal of this PR. The referenced issue #29038 is about the behavior of PeriodEngine, but this doesn't touch that at all.

@proost
Copy link
Contributor Author

proost commented Oct 31, 2019

@jbrockmendel
Purpose: Improvement function of PeriodIndex.get_loc
Reason: Inefficient PeriodIndex.get_loc. Because PeriodEngine.get_loc acceptable parameter type is narrow. So most case PeriodIndex.get_loc compute twice. Both PeriodEngine and Int64Index
So What i try to :
-> first one : Change PeriodEngine.get_loc to accept Period to cover all parameter types. Problem is PeriodEngine.get_loc can't accept "tolerance" and "method". So if "method" and "tolerance" are not none, also in this case, PeriodIndex.get_loc compute twice. first PeriodEngine fails and then Int64Index
-> second one: Don't use PeriodEngine.get_loc Using PeriodIndex.get_loc.

What we can get from this change:
Before: I thought there is improvement in computation time.
Now: Same speed. But use less memory.

Of course, This PR off track from original issue. In #29038, I narrowed the scope. It was my fault. Sorry not to upload adequate issue.

@proost proost force-pushed the periodengine-get_loc branch from ccad304 to 55e55d5 Compare November 20, 2019 05:44
@proost
Copy link
Contributor Author

proost commented Nov 20, 2019

@jbrockmendel
small improvement is prohibited? If prohibited, I will close this PR. If not, this PR for reducing memory usage.

@jbrockmendel
Copy link
Member

Small improvements are encouraged! We just need to communicate clearly what the relevant improvement is. how much less memory does this use? You're no longer calling the _engine method at the beginning, which makes me concerned that performance will take a hit.

@proost
Copy link
Contributor Author

proost commented Nov 21, 2019

@jbrockmendel
from not calling _engine.get_loc, period index use less memory after using get_loc.
From this change, we can save less memory space about 60% ~ 100% before.

first case:

pi = pd.period_range(start=2000,end=2019,freq="y") 
pi.get_loc('2000')
pi.memory_usage()

Before change, period index memory usage: 1600
After change, period index memory usage: 900

second case:

pi = pd.period_range(start=2000,end=2019,freq="m") 
pi.get_loc('2000')
pi.memory_usage()

Before change, period index memory usage: 24144
After change, period index memory usage: 13904

third case:

pi = pd.period_range(start=2010,end=2019,freq="d") 
pi.get_loc('2010')
pi.memory_usage()

Before change, period index memory usage: 380288
After change, period index memory usage: 216448

fourth case:

pi = pd.period_range(start=2010,end=2019,freq="h") 
pi.get_loc('2010')
pi.memory_usage()

Before change, period index memory usage: 6505104
After change, period index memory usage: 3883664

fifth case:

pi = pd.period_range(start=2010,end=2019,freq="s") 
pi.get_loc('2010')
pi.memory_usage()

Before change, period index memory usage: 4543948816
After change, period index memory usage: 4543948816

if you worry about time performance from this change, no need to worry.
command same above,

first case:

Before Change: 11.7 µs ± 58.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After Change: 10.2 µs ± 42.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

second case:

Before Change: 11.7 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After Change: 8.78 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

third case:

Before Change: 11 µs ± 384 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After Change: 9.35 µs ± 40.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

fourth case:

Before Change: 10.7 µs ± 330 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After Change: 9.25 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

fifth case:

Before Change: 28.8 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
After Change: 26.8 µs ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Almost same or slightly better before. So this change of code doesn't exacerbate on time performance.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls rebase on master as well

@jreback
Copy link
Contributor

jreback commented Dec 2, 2019

so i wonder after this if we can blow PeriodEngine away entirely. cc @jbrockmendel

@proost the memory savings are because we no longer need to instantiate the engine.

@proost proost force-pushed the periodengine-get_loc branch from 55e55d5 to 1db092b Compare December 2, 2019 15:41
@jbrockmendel
Copy link
Member

@proost if i understand correclty, all the timings you posted were with string inputs. Can you do the same comparisons for 0, pd.Period("2000"), pd.NaT, pd.Timedelta(days=1)

@jbrockmendel
Copy link
Member

so i wonder after this if we can blow PeriodEngine away entirely

It looks like there is some non-trivial stuff in there for get_indexer, but we might be able to share more with DatetimeEngine

@proost
Copy link
Contributor Author

proost commented Dec 9, 2019

@jbrockmendel
If

l = ['1970-01-01', '1970-01-02', '1970-01-03', '1970-01-01',
     '1970-01-05', '1970-01-06', '1970-01-07', '1970-01-08',
     '1970-01-09', '1970-01-10', '1970-01-11', '1970-01-12',
     '1970-01-13', '1970-01-14', '1970-01-15', '1970-01-16',
     '1970-01-17', '1970-01-18', '1970-01-19', '1970-01-20',
     '1970-01-21', '1970-01-22', '1970-01-23', '1970-01-24',
     '1970-01-25', '1970-01-26', '1970-01-27', '1970-01-28',
     '1970-01-29', '1970-01-30', '1970-01-31', np.datetime64('NaT')
]
pi = pd.PeriodIndex(l,freq="D")

case 1:

p = pd.Period(year=1970, month=1, day=1, freq="D")
pi.get_loc(p)

Before:
memory usage: 3072
time taken: 11.8 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After:
memory usage: 1792
time taken: 14 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

case 2:

pi.get_loc(0)

Before:
memory usage: 1536
time taken: 5.13 µs ± 25 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After:
memory usage: 1792
time taken: 11.2 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

case 3:

pi.get_loc(pd.NaT)

Before:
memory usage: 3072
time taken: 10.1 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After:
memory usage: 1792
time taken: 11.7 µs ± 51.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

case 4:
Because "Timedelta" Object can't be converted to Period. So can't work on ".get_loc"
But if you mean "Timestamp" object,

t = pd.Timestamp(year=1970, month=1, day=1)
pi.get_loc(t)

Before:
memory usage: 3072
time taken: 13.3 µs ± 533 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After:
memory Usage: 1792
time taken: 16 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Well, time performance becomes bad before. But i think this much can be acceptable.
except Integer case, all memory usage lower before. And Integer or float type is not expected input from user. Becasue PeriodIndex.get_loc parameter type is expected as "Period", "String", "datetime" or datetimelike object from user. So This is not a problem.

@jbrockmendel
Copy link
Member

@proost thank you for doing these additional timings.

For the Timedelta case i actually did want to know how long it took to fail.

@proost
Copy link
Contributor Author

proost commented Dec 10, 2019

@jbrockmendel
Ok. I understand. So this is measurement of time.

td =  pd.Timedelta(days=1)
pi.get_loc(td)

Before: 13.5 ns ± 0.021 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
After: 15.7 ns ± 0.0219 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

@proost proost changed the title BUG: PeriodEngine doesn't work PERF: PeriodEngine doesn't work Dec 21, 2019
@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

can you merge master and will look again

@proost proost force-pushed the periodengine-get_loc branch 2 times, most recently from ef552b6 to 8b3f484 Compare December 28, 2019 13:34
@proost proost changed the title PERF: PeriodEngine doesn't work PERF: PeriodEngine.get_loc accept narrow data type Dec 28, 2019
@proost proost force-pushed the periodengine-get_loc branch from 8b3f484 to 30bb3a7 Compare December 30, 2019 03:30
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rebase

@@ -339,7 +339,7 @@ def test_memory_usage(self, indices):

# RangeIndex, IntervalIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update / change the comment to more reflect what is going on (i would remove the references to RI and II here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indices = PeriodIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
             '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
             '2000-01-13', '2000-01-14'],
            dtype='period[D]', freq='D')
result = indices.memory_usage()
indices.get_loc(indices[0])
result2 = indices.memory_usage()

"result" and "result2" same as 80

@proost proost force-pushed the periodengine-get_loc branch 2 times, most recently from ee0a53f to b10e5ce Compare January 7, 2020 12:54
@proost proost force-pushed the periodengine-get_loc branch 3 times, most recently from 2fc2f30 to 4b47e01 Compare January 15, 2020 16:17
@proost proost force-pushed the periodengine-get_loc branch from 397e0d7 to 7f49b53 Compare January 15, 2020 16:30
except ValueError:
# we cannot construct the Period
# as we have an invalid type
if is_list_like(key):
raise TypeError(f"'{key}' is an invalid key")
raise KeyError(key)

ordinal = key.ordinal if key is not NaT else key.value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this got affected by #31021. unless there a compelling reason to move this line up to L614, lets keep it here


except KeyError:
raise KeyError(key)
raise KeyError(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going directly to _int64index does simplify the code here which is nice, but i imagine involves a perf penalty for by-far-the-most-common case with method=tolerance=None. Am I wrong about that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
Yes, you are right. I pointed out the wrong key. Like #31021, cleaning up codes is best choice.
Is it okay if i close PR and issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@proost proost closed this Jan 16, 2020
@proost proost deleted the periodengine-get_loc branch January 17, 2020 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REF/PERF: PeriodEngine covers narrow type
4 participants