DataFrame.from_csv loses precision #2697

brendam · 2013-01-15T02:34:18Z

I've found a problem in importing a csv file with numbers that loses the precision of all the numbers in a column.

If any of the entries in a column are in scientific format, all of the entries are converted to that and loose precision. Only happens if a number in the column is 12 digits or more (either representation - so either 1E+11 or 123456789012). Much larger numbers in a column with no scientific representation entries don't trigger the problem.

The mixed type is an error in my data, but thought I'd report the problem in pandas in case it effects legitimate data.
Happens in both 10.0 and 0.10.1.dev-6e2b6ea on OSX with numpy 1.6.2.

csv file:

id, text
135217135789158401, 'testing lost precision from csv'
1352171357E+5, 'any item scientific format loses the precision on all other entries'

test = pandas.DataFrame.from_csv('test.csv')
print test.index[0] == 135217135789158401
print test.index[1] == 1352171357E+5

Example of large number - column A is effected, column C isn't.

id, A, B, C
1, 99999999999, 'a', 99999999999
2, 123456789012345, 'b', 123456789012345
3, 1234E+0, 'c', 1234

This may be related to Issue #2069

The text was updated successfully, but these errors were encountered:

wesm · 2013-01-19T17:55:02Z

Interesting. Fixing this wouldn't be totally trivial. Will have to have a look next month unless someone else gets to it

wesm · 2013-04-07T02:45:35Z

Well the problem here is that if you have an integer above 2^53 and some other number in the column causes the entire column to be interpreted as float64, say, you are going to lose precision. The cutoff for reliable representation of integers in double-precision numbers is approx 2^53. I'm not sure if this can be fixed actually. Marking for "someday" or potentially close

kokes · 2016-03-09T00:43:36Z

Testing these it seems the first two problems have been resolved, see an example.

The last one, as pointed out, may be impossible to solve given homogeneity of columns and float64/int64 max value discrepancies.

gfyoung · 2016-08-22T05:57:04Z

Not quite. The reason for the loss of precision in the first case is because from_csv defaults parse_dates=True, which causes those floats to be butchered by attempts to convert these values to dates during index creation. You can see for yourself by passing in parse_dates=True to read_csv.

>>> from pandas import read_csv, DataFrame
>>> from pandas.compat import StringIO
>>> data = 'a\n135217135789158401\n1352171357E+5'
>>> df = read_csv(StringIO(data), index_col=0, parse_dates=True)
>>> print(df.index[0] == 135217135789158401)
False
>>> print(df.index[0] == 1352171357E+5)
False

You can see here that specifying parse_dates=False does the trick:

>>> df = DataFrame.from_csv(StringIO(data), parse_dates=False)
>>> print(df.index[0] == 135217135789158401)
True
>>> print(df.index[0] == 1352171357E+5)
True

As for the second example, I can't reproduce that anymore:

>>> data = 'a\n99999999999\n123456789012345\n1234E+0'
>>> df = DataFrame.from_csv(StringIO(data), parse_dates=False)
>>> print(df.index[0] == 99999999999)
True
>>> print(df.index[1] == 123456789012345)
True
>>> print(df.index[2] == 1234)
True

The parse_dates behaviour is not bugged but intended behaviour, as it is stated clearly in documentation here. While I do find it unusual that parse_dates=False by default, that too is laid out clearly in documentation here.

I think the "enhancement" here would be to just clarify documentation in from_csv to match that of read_csv so that no confusion is to be had. @jreback what do you think?

jreback · 2016-08-22T13:22:57Z

I think the issue is this:

parse_dates=True tries to convert to dates - if it fails it might be doing something that changes the data that is then passed thru (IOW it has a side effect)

gfyoung · 2016-08-22T14:55:51Z

Right, it is doing some conversion as expected, but I think clearer documentation there might help just to avoid this confusion again.

jreback · 2016-08-22T14:57:25Z

maybe I am not clear

it is doing a conversion, which fails (I think) and propagates it rather than using the original data (this is a guess FYI)

so if u can detect a failure then I think ok if it's unambiguous (that it failed)

gfyoung · 2016-08-23T01:10:11Z

@jreback : I still don't understand what you said. All I'm saying is that we should just update documentation and close this issue.

Closes pandas-devgh-2697.

Closes gh-2697.

Closes pandas-devgh-2697.

Closes pandas-devgh-2697. (cherry picked from commit e23bd24)

Closes gh-2697. (cherry picked from commit e23bd24)

nehalecky mentioned this issue Jan 27, 2013

Series near-zero subtraction loss of precision #2760

Closed

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 6, 2017

TST: Check lossiness of floats with parse_dates

ff98f38

Closes pandas-devgh-2697.

gfyoung modified the milestones: Someday, Next Major Release Nov 6, 2017

gfyoung added Testing pandas testing functions or related to the test suite and removed Enhancement IO Data IO issues that don't fit into a more specific label labels Nov 6, 2017

gfyoung mentioned this issue Nov 6, 2017

TST: Check lossiness of floats with parse_dates #18136

Merged

jreback modified the milestones: Next Major Release, 0.21.1 Nov 6, 2017

jreback closed this as completed in #18136 Nov 6, 2017

jreback pushed a commit that referenced this issue Nov 6, 2017

TST: Check lossiness of floats with parse_dates (#18136)

e23bd24

Closes gh-2697.

watercrossing pushed a commit to watercrossing/pandas that referenced this issue Nov 10, 2017

TST: Check lossiness of floats with parse_dates (pandas-dev#18136)

61898d2

Closes pandas-devgh-2697.

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

TST: Check lossiness of floats with parse_dates (pandas-dev#18136)

0757075

Closes pandas-devgh-2697.

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Dec 8, 2017

TST: Check lossiness of floats with parse_dates (pandas-dev#18136)

db63a39

Closes pandas-devgh-2697. (cherry picked from commit e23bd24)

TomAugspurger pushed a commit that referenced this issue Dec 11, 2017

TST: Check lossiness of floats with parse_dates (#18136)

c34d2b1

Closes gh-2697. (cherry picked from commit e23bd24)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.from_csv loses precision #2697

DataFrame.from_csv loses precision #2697

brendam commented Jan 15, 2013

wesm commented Jan 19, 2013

wesm commented Apr 7, 2013

kokes commented Mar 9, 2016

gfyoung commented Aug 22, 2016 •

edited

Loading

jreback commented Aug 22, 2016

gfyoung commented Aug 22, 2016

jreback commented Aug 22, 2016

gfyoung commented Aug 23, 2016

DataFrame.from_csv loses precision #2697

DataFrame.from_csv loses precision #2697

Comments

brendam commented Jan 15, 2013

wesm commented Jan 19, 2013

wesm commented Apr 7, 2013

kokes commented Mar 9, 2016

gfyoung commented Aug 22, 2016 • edited Loading

jreback commented Aug 22, 2016

gfyoung commented Aug 22, 2016

jreback commented Aug 22, 2016

gfyoung commented Aug 23, 2016

gfyoung commented Aug 22, 2016 •

edited

Loading