-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Enable read_csv interpret 'Infinity' as floating point value #10065 #28181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We should be careful about changing these. The last time we changed things we got a decent about of pushback about unintended behavior changes. |
Yea I'd also be hesitant to do something like this. Code looks nice but given this isn't really standardized I would hate to start interpreting things that would really just be the string "Infinity" as a float value |
@WillAyd, when you say
Do you mean there is no specification (e.g. IEEE754) for this? De-facto, it is quite common to treat "Infinity" as infinite value
assert float("Infinity") == float("inf") The same thing why this pull-request #13274 was accepted.
In [1]: import numpy as np
In [2]: from io import StringIO
In [3]: data ='''
...: Infinity
...: +Infinity
...: -Infinity'''
In [4]: np.genfromtxt(StringIO(data))
Out[4]: array([ inf, inf, -inf])
public class Main{
public static void main(String []args){
System.out.println(Double.POSITIVE_INFINITY);
System.out.println(Double.NEGATIVE_INFINITY);
}
} https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Double.html#toString(double)
Ok, let it be string by default, as it is now. But what if I expect floating point values? pd.read_csv(fname, dtype=float) This throws a TypeError and ValueError. Could we make it work in this case? I realize, my fix might affect performance. Maybe, there is a better solution. In #10065 (comment) @jreback suggested
Should hash table be a local variable in _try_double function? Is it more expensive to allocate/deallocate it for each function call ? It would be nice, if someone could explain this solution in more detail. Finally, if pandas team decides to mark this issue |
Hmm OK - thanks for the clarifications. Can you run the benchmarks in asv_bench/benchmarks/io/csv.py and see if this has an impact? |
Sorry, I missed one file. Now it passes tests. I ran performance testing. (Before fixing formatting) asv continuous -E virtualenv --bench ^io master fix-#10065 In red color there is only this
Is this caused by my fix or some other commit? So, basically, my fix is the same as what @gfyoung proposed in #13274 (closed in commit da5fc17) only for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me
So 15% increase on that one test - do we think that manifests itself typically from an end user perspective in any way? If not lgtm as well, just want to be careful of that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. @WillAyd @TomAugspurger
Thanks @githeap - great change |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff