-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: PeriodEngine.get_loc accept narrow data type #29234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f1bdfae
to
04ab615
Compare
@jreback |
Is there any simplficiation of PeriodIndex.get_loc that this makes possible? |
Yes. I'm working on it. |
1294121
to
b58e022
Compare
@jbrockmendel |
9ec1e25
to
55b8abb
Compare
@@ -332,7 +332,7 @@ def test_memory_usage(self, indices): | |||
# RangeIndex, IntervalIndex | |||
# don't have engines | |||
if not isinstance(indices, (RangeIndex, IntervalIndex)): | |||
assert result2 > result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it just PeriodIndex for which this inequality is no longer strict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel
Yes. for PeriodIndex
It looks like nothing in _libs.index is changed now. I thought the idea was that something was wrong there that would be fixed, and that would allow simplification in PeriodIndex.get_loc. Am I confused with something else? |
In terms of simplication, Yes. I was wrong.
After
I don't intent to it but i found it running test case. After |
9d11e2a
to
8cdf285
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are changing the semantics of what is classified as a TypeError here. An invalid key should be a TypeError and not a KeyError.
pandas/core/indexes/period.py
Outdated
@@ -704,36 +704,36 @@ def get_loc(self, key, method=None, tolerance=None): | |||
------- | |||
loc : int | |||
""" | |||
if is_integer(key): | |||
ordinal = key | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you just move what you have in _cast_to_period_object here. This is confusing as you are doing the same type of operation in multiple functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah, i think we should require that the argument to get_loc be either a Period or a string/datetime that can be cast to Period, not allow ordinal at this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel
Yes, I really want to. But some code use ".get_loc" with numeric type and if exception should raise, raise KeyError not TypeError.
@jbrockmendel What we can get from this change: Of course, This PR off track from original issue. In #29038, I narrowed the scope. It was my fault. Sorry not to upload adequate issue. |
ccad304
to
55e55d5
Compare
@jbrockmendel |
Small improvements are encouraged! We just need to communicate clearly what the relevant improvement is. how much less memory does this use? You're no longer calling the _engine method at the beginning, which makes me concerned that performance will take a hit. |
@jbrockmendel first case:
Before change, period index memory usage: 1600 second case:
Before change, period index memory usage: 24144 third case:
Before change, period index memory usage: 380288 fourth case:
Before change, period index memory usage: 6505104 fifth case:
Before change, period index memory usage: 4543948816 if you worry about time performance from this change, no need to worry. first case:
second case:
third case:
fourth case:
fifth case:
Almost same or slightly better before. So this change of code doesn't exacerbate on time performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls rebase on master as well
so i wonder after this if we can blow PeriodEngine away entirely. cc @jbrockmendel @proost the memory savings are because we no longer need to instantiate the engine. |
55e55d5
to
1db092b
Compare
@proost if i understand correclty, all the timings you posted were with string inputs. Can you do the same comparisons for 0, pd.Period("2000"), pd.NaT, pd.Timedelta(days=1) |
It looks like there is some non-trivial stuff in there for |
case 1:
Before: case 2:
Before: case 3:
Before: case 4:
Before: Well, time performance becomes bad before. But i think this much can be acceptable. |
@proost thank you for doing these additional timings. For the Timedelta case i actually did want to know how long it took to fail. |
@jbrockmendel
Before: 13.5 ns ± 0.021 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each) |
can you merge master and will look again |
ef552b6
to
8b3f484
Compare
8b3f484
to
30bb3a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rebase
@@ -339,7 +339,7 @@ def test_memory_usage(self, indices): | |||
|
|||
# RangeIndex, IntervalIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update / change the comment to more reflect what is going on (i would remove the references to RI and II here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indices = PeriodIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
'2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
'2000-01-13', '2000-01-14'],
dtype='period[D]', freq='D')
result = indices.memory_usage()
indices.get_loc(indices[0])
result2 = indices.memory_usage()
"result" and "result2" same as 80
ee0a53f
to
b10e5ce
Compare
2fc2f30
to
4b47e01
Compare
397e0d7
to
7f49b53
Compare
except ValueError: | ||
# we cannot construct the Period | ||
# as we have an invalid type | ||
if is_list_like(key): | ||
raise TypeError(f"'{key}' is an invalid key") | ||
raise KeyError(key) | ||
|
||
ordinal = key.ordinal if key is not NaT else key.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this got affected by #31021. unless there a compelling reason to move this line up to L614, lets keep it here
|
||
except KeyError: | ||
raise KeyError(key) | ||
raise KeyError(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going directly to _int64index does simplify the code here which is nice, but i imagine involves a perf penalty for by-far-the-most-common case with method=tolerance=None. Am I wrong about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel
Yes, you are right. I pointed out the wrong key. Like #31021, cleaning up codes is best choice.
Is it okay if i close PR and issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
xref #28628.
I want to refactor
PeriodIndex.get_loc
. BecausePeriodIndex.get_loc
compute twiceInt64index.get_loc
andPeriodEngine.get_loc
. Problem is onlyPeriodEngine.get_loc
acceptsPeriod.ordinal
.I try to change
PeriodIndex.get_loc
in two ways.First one, I try to remove
PeriodEngine.get_loc
inPeriodIndex.get_loc
. so just compute.get_loc
once. I change several ways But always stuck in test cases. So I give it up this way.Second one, I try to change
PeriodEngine.get_loc
to acceptPeriod
not onlyPeriod.ordinal
. This way is success in that work successfully and pass all test cases. However Problem is most cases users don't usePeriod
usingPeriodIndex.get_loc
andPeriodEngine.get_loc
can't accept "intolerance" argument. Not usingInt64index.get_loc
can't be possible.But I send PR. Little improvement is also improvement.
improvement : memory usage with Period.get_loc