-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DataFrame.from_csv loses precision #2697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interesting. Fixing this wouldn't be totally trivial. Will have to have a look next month unless someone else gets to it |
Well the problem here is that if you have an integer above 2^53 and some other number in the column causes the entire column to be interpreted as float64, say, you are going to lose precision. The cutoff for reliable representation of integers in double-precision numbers is approx 2^53. I'm not sure if this can be fixed actually. Marking for "someday" or potentially close |
Testing these it seems the first two problems have been resolved, see an example. The last one, as pointed out, may be impossible to solve given homogeneity of columns and float64/int64 max value discrepancies. |
Not quite. The reason for the loss of precision in the first case is because >>> from pandas import read_csv, DataFrame
>>> from pandas.compat import StringIO
>>> data = 'a\n135217135789158401\n1352171357E+5'
>>> df = read_csv(StringIO(data), index_col=0, parse_dates=True)
>>> print(df.index[0] == 135217135789158401)
False
>>> print(df.index[0] == 1352171357E+5)
False You can see here that specifying >>> df = DataFrame.from_csv(StringIO(data), parse_dates=False)
>>> print(df.index[0] == 135217135789158401)
True
>>> print(df.index[0] == 1352171357E+5)
True As for the second example, I can't reproduce that anymore: >>> data = 'a\n99999999999\n123456789012345\n1234E+0'
>>> df = DataFrame.from_csv(StringIO(data), parse_dates=False)
>>> print(df.index[0] == 99999999999)
True
>>> print(df.index[1] == 123456789012345)
True
>>> print(df.index[2] == 1234)
True The I think the "enhancement" here would be to just clarify documentation in |
I think the issue is this: parse_dates=True tries to convert to dates - if it fails it might be doing something that changes the data that is then passed thru (IOW it has a side effect) |
Right, it is doing some conversion as expected, but I think clearer documentation there might help just to avoid this confusion again. |
maybe I am not clear it is doing a conversion, which fails (I think) and propagates it rather than using the original data (this is a guess FYI) so if u can detect a failure then I think ok if it's unambiguous (that it failed) |
@jreback : I still don't understand what you said. All I'm saying is that we should just update documentation and close this issue. |
Closes pandas-devgh-2697. (cherry picked from commit e23bd24)
I've found a problem in importing a csv file with numbers that loses the precision of all the numbers in a column.
If any of the entries in a column are in scientific format, all of the entries are converted to that and loose precision. Only happens if a number in the column is 12 digits or more (either representation - so either 1E+11 or 123456789012). Much larger numbers in a column with no scientific representation entries don't trigger the problem.
The mixed type is an error in my data, but thought I'd report the problem in pandas in case it effects legitimate data.
Happens in both 10.0 and 0.10.1.dev-6e2b6ea on OSX with numpy 1.6.2.
csv file:
Example of large number - column A is effected, column C isn't.
This may be related to Issue #2069
The text was updated successfully, but these errors were encountered: