-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Intermittent error fetching value from multi-indexed dataframe #39585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, this was changed before 1.2 Beta is interpreted as column, which does not exist, hence the keyerror. |
Thanks @phofl. Unfortunately I'm still seeing the intermittent errors with this change. |
This is returned on master with the change |
I'm confused. When I run that code snippet locally, on Heroku and on an online python editor I see the error occur roughly 1 in 10 runs when running as a new process every time. However, if I run the snippet in an infinite loop I do not see it error. |
I am not sure why this should ever fail when indexed with a tuple. Started round about 30 times, did not fail. Could you provide the traceback of a failure? |
|
Not really sure how this happens, can you show breakdown before the failure? |
Yeh, b c d e f g
date a
2020-06-03 15:59:59.999999+00:00 alpha 100 100 0 100 0 0
2020-06-03 alpha 100 100 0 100 0 0
beta 100 100 0 100 0 0
2020-06-04 alpha 100 100 0 100 0 0
beta 100 100 0 100 0 0
2020-06-05 alpha 100 100 0 100 0 0
beta 100 100 0 100 0 0 |
Thanks, could you do me one Last favour and Try |
|
This is weird, I am getting
again You are on pandas 1.2.1 right? cc @jbrockmendel Thoughts on how to handle this? |
|
@phofl i gotta run, but tenatively this looks weird:
|
Yep, can see the changing result now too. This is a groupby issue. I am getting
as breakdown in the slice(0,2) cases while
is returned in the other cases |
This is happening in pandas/pandas/_libs/hashtable_class_helper.pxi.in Line 1227 in ca4f204
The return value of Would need a bit of help to debug this further |
how do we get to that kh_get_pymap call? my intuition is that we shouldnt get there with slice objects |
This is happening in groupby, not related to Indexing. See my second to last comment. This causes different groups which causes then the KeyError |
OK, but do you know what the call stack looks like that gets to this line? |
Yes, sorry should have thought about this earlier. I added raise ValueError right before kh_get_pymap is called to show the stack
|
maybe use breakpoint() instead of ValueError to track down the call args? i still think its really weird to get here with a slice |
Not sure if I understand you correctly, but we are getting there with
not with the indexing, so not related to the slice? |
I'm having trouble reproducing this on master, can you still get it? |
and in 1.1.4 im getting the failure in the line |
The error is raised in there yes, but the cause of the error is, that the groupby line returns
most of the time (14 out of 15 or so) but sometimes we get
which then raises a KeyError, because the key is obvioulsly not in there, but the KeyError is actually just a consequential failure. I'll try to reproduce on master |
Of course 2 minutes later i get it on master. With
The So if im right so far (@phofl definitely worth double-checking) then we need need to figure out why that get_loc is non-deterministic. |
GroupBy is non deterministic, see my explanation above. This is because of the line I marked with the traceback. |
gotcha, im on the totally wrong track. thanks |
Wasn't explaining it very well either I think, sorry for this. I am not familiar enough with
to debug this on my own. The input values are the same but the output varies sometimes. |
@realead is probably the best person to ask about |
@phofl I think I can explain why We use the
That means in different runs of the intepreter, the resulting hashes can be different. Different hash-values will lead to different places in the hash-map and thus different returned indexes. You can verify this theory by setting the environment variable Another observation, which supports my theory: the different behaviors were never observed in the same run of the interpreter - always in diffent runs. |
@relead thy for the explanation, not sure if I get everything, will have to look this up. One thing I would like to add: We are using
and if
then k has not been set. The first value we pass in is |
I have added
directly after the call of For the original example, I see the following trace in case of a successful run:
The behavior is as expected: the first call with a value returns always 16 -> the key isn't in the map (after this the key gets added to the map) thus a second call with the same value returns not 16, but an index where the value was put in. Here are traces for a failed run:
First thing: the hash values (and thus indexes k) are different (reason is I'm not sure |
This is what I am getting in case of a failed run:
The 5 in the second row is the issue I think? Edit: I have added your print line also directly below the pymap call |
Ok, the first element (2020-06-03 15:59:59.999999+00:00) is of type The issue is, that while they have different hash-values, they are equal:
this is an issue, because for using hash-map the following must hold: "a equals b => hash(a)==hash(b)". Which is not the case here. Because the hash-values are different, there is sometimes a collision (k=5) and sometimes no collision (k=16). However, I can see the error, even if this collision doesn't happen. Thus it could be a red herring. |
Thx for the explanation. makes sense.
? I could not produce an error in this case |
Yes, I also see sometimes
without an error. |
luckily this behavior in |
Is the current state-of-play waiting for the offending behaviour of Is there a temporary workaround that could be used to avoid this error in the meantime? |
only thing that comes to mind is to not use |
@jbrockmendel @phofl - with the deprecation now enforced, are we good to close? I currently get |
Non-deterministic behaviour can also be seen in #57922, but with |
I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.
See this question and this question.
Question about pandas
I have a dataframe with a multiindex from which I am attempting access a row from. However, it is seemingly failing stochastically on around 1/10th of runs. I see this behaviour both locally and on prod. The dataframe can be recreated with the following:
I do not know if it is my incorrect usage that is causing this behaviour or a bug.
I originally encountered the issue on
pandas
1.1.4
(*) with the error:and on
1.2.1
(**) I see the same intermitent errors but with the error message:Version info
(*)
(**)
The text was updated successfully, but these errors were encountered: