Skip to content

DataFrame.from_csv loses precision #2697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brendam opened this issue Jan 15, 2013 · 8 comments
Closed

DataFrame.from_csv loses precision #2697

brendam opened this issue Jan 15, 2013 · 8 comments
Labels
IO CSV read_csv, to_csv Testing pandas testing functions or related to the test suite
Milestone

Comments

@brendam
Copy link
Contributor

brendam commented Jan 15, 2013

I've found a problem in importing a csv file with numbers that loses the precision of all the numbers in a column.

If any of the entries in a column are in scientific format, all of the entries are converted to that and loose precision. Only happens if a number in the column is 12 digits or more (either representation - so either 1E+11 or 123456789012). Much larger numbers in a column with no scientific representation entries don't trigger the problem.

The mixed type is an error in my data, but thought I'd report the problem in pandas in case it effects legitimate data.
Happens in both 10.0 and 0.10.1.dev-6e2b6ea on OSX with numpy 1.6.2.

csv file:

id, text
135217135789158401, 'testing lost precision from csv'
1352171357E+5, 'any item scientific format loses the precision on all other entries'
test = pandas.DataFrame.from_csv('test.csv')
print test.index[0] == 135217135789158401
print test.index[1] == 1352171357E+5

Example of large number - column A is effected, column C isn't.

id, A, B, C
1, 99999999999, 'a', 99999999999
2, 123456789012345, 'b', 123456789012345
3, 1234E+0, 'c', 1234

This may be related to Issue #2069

@wesm
Copy link
Member

wesm commented Jan 19, 2013

Interesting. Fixing this wouldn't be totally trivial. Will have to have a look next month unless someone else gets to it

@wesm
Copy link
Member

wesm commented Apr 7, 2013

Well the problem here is that if you have an integer above 2^53 and some other number in the column causes the entire column to be interpreted as float64, say, you are going to lose precision. The cutoff for reliable representation of integers in double-precision numbers is approx 2^53. I'm not sure if this can be fixed actually. Marking for "someday" or potentially close

@kokes
Copy link
Contributor

kokes commented Mar 9, 2016

Testing these it seems the first two problems have been resolved, see an example.

The last one, as pointed out, may be impossible to solve given homogeneity of columns and float64/int64 max value discrepancies.

@gfyoung
Copy link
Member

gfyoung commented Aug 22, 2016

Not quite. The reason for the loss of precision in the first case is because from_csv defaults parse_dates=True, which causes those floats to be butchered by attempts to convert these values to dates during index creation. You can see for yourself by passing in parse_dates=True to read_csv.

>>> from pandas import read_csv, DataFrame
>>> from pandas.compat import StringIO
>>> data = 'a\n135217135789158401\n1352171357E+5'
>>> df = read_csv(StringIO(data), index_col=0, parse_dates=True)
>>> print(df.index[0] == 135217135789158401)
False
>>> print(df.index[0] == 1352171357E+5)
False

You can see here that specifying parse_dates=False does the trick:

>>> df = DataFrame.from_csv(StringIO(data), parse_dates=False)
>>> print(df.index[0] == 135217135789158401)
True
>>> print(df.index[0] == 1352171357E+5)
True

As for the second example, I can't reproduce that anymore:

>>> data = 'a\n99999999999\n123456789012345\n1234E+0'
>>> df = DataFrame.from_csv(StringIO(data), parse_dates=False)
>>> print(df.index[0] == 99999999999)
True
>>> print(df.index[1] == 123456789012345)
True
>>> print(df.index[2] == 1234)
True

The parse_dates behaviour is not bugged but intended behaviour, as it is stated clearly in documentation here. While I do find it unusual that parse_dates=False by default, that too is laid out clearly in documentation here.

I think the "enhancement" here would be to just clarify documentation in from_csv to match that of read_csv so that no confusion is to be had. @jreback what do you think?

@jreback
Copy link
Contributor

jreback commented Aug 22, 2016

I think the issue is this:

parse_dates=True tries to convert to dates - if it fails it might be doing something that changes the data that is then passed thru (IOW it has a side effect)

@gfyoung
Copy link
Member

gfyoung commented Aug 22, 2016

Right, it is doing some conversion as expected, but I think clearer documentation there might help just to avoid this confusion again.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2016

maybe I am not clear

it is doing a conversion, which fails (I think) and propagates it rather than using the original data (this is a guess FYI)

so if u can detect a failure then I think ok if it's unambiguous (that it failed)

@gfyoung
Copy link
Member

gfyoung commented Aug 23, 2016

@jreback : I still don't understand what you said. All I'm saying is that we should just update documentation and close this issue.

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 6, 2017
@gfyoung gfyoung modified the milestones: Someday, Next Major Release Nov 6, 2017
@gfyoung gfyoung added Testing pandas testing functions or related to the test suite and removed Enhancement IO Data IO issues that don't fit into a more specific label labels Nov 6, 2017
@jreback jreback modified the milestones: Next Major Release, 0.21.1 Nov 6, 2017
watercrossing pushed a commit to watercrossing/pandas that referenced this issue Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Dec 8, 2017
TomAugspurger pushed a commit that referenced this issue Dec 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

5 participants