Skip to content

Suggestion: changing default float_format in DataFrame.to_csv() #16452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Peque opened this issue May 23, 2017 · 24 comments
Open

Suggestion: changing default float_format in DataFrame.to_csv() #16452

Peque opened this issue May 23, 2017 · 24 comments
Labels
Enhancement IO CSV read_csv, to_csv

Comments

@Peque
Copy link
Contributor

Peque commented May 23, 2017

Current behavior

If I read a CSV file, do nothing with it, and save it again, I would expect Pandas to keep the format the CSV had before. But that is not the case.

from io import StringIO
from pathlib import Path
from tempfile import NamedTemporaryFile

import pandas

input_csv = StringIO('''
01/01/17 23:00,1.05148,1.05153,1.05148,1.05153,4
01/01/17 23:01,1.05153,1.05153,1.05153,1.05153,4
01/01/17 23:02,1.05170,1.05175,1.05170,1.05175,4
01/01/17 23:03,1.05174,1.05175,1.05174,1.05175,4
01/01/17 23:08,1.05170,1.05170,1.05170,1.05170,4
01/01/17 23:11,1.05173,1.05174,1.05173,1.05174,4
01/01/17 23:13,1.05173,1.05173,1.05173,1.05173,4
01/01/17 23:14,1.05174,1.05174,1.05174,1.05174,4
01/01/17 23:16,1.05204,1.05238,1.05204,1.05238,4
''')

df = pandas.read_csv(input_csv, header=None)

with NamedTemporaryFile() as tmpfile:
    df.to_csv(tmpfile.name, index=False, header=None)
    print(Path(tmpfile.name).read_text())

Would give us this output:

01/01/17 23:00,1.05148,1.0515299999999999,1.05148,1.0515299999999999,4
01/01/17 23:01,1.0515299999999999,1.0515299999999999,1.0515299999999999,1.0515299999999999,4
01/01/17 23:02,1.0517,1.05175,1.0517,1.05175,4
01/01/17 23:03,1.05174,1.05175,1.05174,1.05175,4
01/01/17 23:08,1.0517,1.0517,1.0517,1.0517,4
01/01/17 23:11,1.0517299999999998,1.05174,1.0517299999999998,1.05174,4
01/01/17 23:13,1.0517299999999998,1.0517299999999998,1.0517299999999998,1.0517299999999998,4
01/01/17 23:14,1.05174,1.05174,1.05174,1.05174,4
01/01/17 23:16,1.0520399999999999,1.0523799999999999,1.0520399999999999,1.0523799999999999,4

What is wrong with that?

I would consider this to be unintuitive/undesirable behavior.

The written numbers have that representation because the original number cannot be represented precisely as a float. That is expected when working with floats. However, that means we are writing the last digit, which we know it is not exact due to float-precision limitations anyways, to the CSV.

Proposed change

I think that last digit, knowing is not precise anyways, should be rounded when writing to a CSV file.

Maybe by changing the default DataFrame.to_csv()'s float_format parameter from None to '%16g'? (or at least make .to_csv() use '%.16g' when no float_format is specified).

Note that I propose rounding to the float's precision, which for a 64-bits float, would mean that 1.0515299999999999 could be rounded to 1.05123, but 1.0515299999999992 could be rounded to 1.051529999999999 and 1.051529999999981 would not be rounded at all.

Example:

from io import StringIO
from pathlib import Path
from tempfile import NamedTemporaryFile

import pandas

input_csv = StringIO('''
1.05153,1.05175
1.0515299999999991,1.051529999999981
''')

df = pandas.read_csv(input_csv, header=None)

with NamedTemporaryFile() as tmpfile:
    df.to_csv(tmpfile.name, index=False, header=None, float_format='%.16g')
    print(Path(tmpfile.name).read_text())

Which gives:

1.05153,1.05175
1.051529999999999,1.051529999999981

Which also adds some errors, but keeps a cleaner output:

State Loaded Writen Error
Before 1.05153 1.0515299999999999 0.0000000000000001
After 1.05153 1.05153 0.0
Before 1.0515299999999991 1.0515299999999992 0.0000000000000001
After 1.0515299999999991 1.051529999999999 0.0000000000000001

Note that errors are similar, but the output "After" seems to be more consistent with the input (for all the cases where the float is not represented to the last unprecise digit).

Also, I think in most cases, a CSV does not have floats represented to the last (unprecise) digit. See the precedents just bellow (other software outputting CSVs that would not use that last unprecise digit).

Precedents

Both MATLAB and R do not use that last unprecise digit when converting to CSV (they round it).

@TomAugspurger
Copy link
Contributor

Hmm I don't think we should change the default. If we just used %g we'd be potentially silently truncating the data.

@TomAugspurger TomAugspurger added IO CSV read_csv, to_csv Numeric Operations Arithmetic, Comparison, and Logical operations labels May 23, 2017
@Peque
Copy link
Contributor Author

Peque commented May 23, 2017

Agreed. Maybe using '%g' but automatically adjusting to the float precision as well? (depending on the float type)

@TomAugspurger
Copy link
Contributor

Just to make sure I fully understand, can you provide an example? Or let me know if this is what you were worried about. Given a file foo.csv.

A
0.1
0.2
0.3

then

In [1]: import pandas as pd
In [2]: pd.read_csv("foo.csv").to_csv("foo2.csv", index=False)

In [3]: !cat foo2.csv
A
0.1
0.2
0.3

That one doesn't have any rounding issues (but maybe with different numbers it would?)

@Peque
Copy link
Contributor Author

Peque commented May 23, 2017

You are right, sorry. The DataFrame I had was actually being modified. 😓

@Peque Peque closed this as completed May 23, 2017
@Peque
Copy link
Contributor Author

Peque commented Jun 6, 2017

@TomAugspurger Let me reopen this issue. I have now found an example that reproduces this without modifying the contents of the original DataFrame:

from io import StringIO
from pathlib import Path
from tempfile import NamedTemporaryFile

import pandas


input_csv = StringIO('''
01/01/17 23:00,1.05148,1.05153,1.05148,1.05153,4
01/01/17 23:01,1.05153,1.05153,1.05153,1.05153,4
01/01/17 23:02,1.05170,1.05175,1.05170,1.05175,4
01/01/17 23:03,1.05174,1.05175,1.05174,1.05175,4
01/01/17 23:08,1.05170,1.05170,1.05170,1.05170,4
01/01/17 23:11,1.05173,1.05174,1.05173,1.05174,4
01/01/17 23:13,1.05173,1.05173,1.05173,1.05173,4
01/01/17 23:14,1.05174,1.05174,1.05174,1.05174,4
01/01/17 23:16,1.05204,1.05238,1.05204,1.05238,4
''')

df = pandas.read_csv(input_csv, header=None)

with NamedTemporaryFile() as tmpfile:
    df.to_csv(tmpfile.name, index=False, header=None)
    print(Path(tmpfile.name).read_text())

That would give you this output:

01/01/17 23:00,1.05148,1.0515299999999999,1.05148,1.0515299999999999,4
01/01/17 23:01,1.0515299999999999,1.0515299999999999,1.0515299999999999,1.0515299999999999,4
01/01/17 23:02,1.0517,1.05175,1.0517,1.05175,4
01/01/17 23:03,1.05174,1.05175,1.05174,1.05175,4
01/01/17 23:08,1.0517,1.0517,1.0517,1.0517,4
01/01/17 23:11,1.0517299999999998,1.05174,1.0517299999999998,1.05174,4
01/01/17 23:13,1.0517299999999998,1.0517299999999998,1.0517299999999998,1.0517299999999998,4
01/01/17 23:14,1.05174,1.05174,1.05174,1.05174,4
01/01/17 23:16,1.0520399999999999,1.0523799999999999,1.0520399999999999,1.0523799999999999,4

@Peque Peque reopened this Jun 6, 2017
@TomAugspurger
Copy link
Contributor

@Peque I think everything is operating as intended, but let me see if I understand your concern.

  1. DataFrame.repr prints out a dataframe with some precision (pd.options.display.precision, 6 by default).
In [21]: df
Out[21]:
                0        1        2        3        4  5
0  01/01/17 23:00  1.05148  1.05153  1.05148  1.05153  4
1  01/01/17 23:01  1.05153  1.05153  1.05153  1.05153  4
2  01/01/17 23:02  1.05170  1.05175  1.05170  1.05175  4
3  01/01/17 23:03  1.05174  1.05175  1.05174  1.05175  4
4  01/01/17 23:08  1.05170  1.05170  1.05170  1.05170  4
5  01/01/17 23:11  1.05173  1.05174  1.05173  1.05174  4
6  01/01/17 23:13  1.05173  1.05173  1.05173  1.05173  4
7  01/01/17 23:14  1.05174  1.05174  1.05174  1.05174  4
8  01/01/17 23:16  1.05204  1.05238  1.05204  1.05238  4
  1. Because of the floating-point representation, the actual value stored in the dataframe slightly different
In [22]: df.iloc[0, 2]
Out[22]: 1.0515299999999999
  1. You think that the value written to to_csv should be the rounded to the same as the pd.options.display.precision? (I may be wrong about this)

If I understand you correctly, then I think I disagree. The purpose of the string repr print(df) is primarily for human consumption, where super-high precision isn't desirable (by default). The purpose of most to_* methods, including to_csv is for a faithful representation of the data. We'd get a bunch of complaints from users if we started rounding their data before writing it to disk.

@Peque
Copy link
Contributor Author

Peque commented Jun 6, 2017

@TomAugspurger Not exactly what I mean. 😇

I am not saying that numbers should be rounded to pd.options.display.precision, but maybe rounded to something near the numerical precision of the float type.

When we load 1.05153 from the CSV, it is represented in-memory as 1.0515299999999999, because I understand there is no other way to represent it in base 2. That is something to be expected when working with floats.

Now, when writing 1.0515299999999999 to a CSV I think it should be written as 1.05153 as it is a sane rounding for a float64 value. So, not rounding at precision 6, but rather at the highest possible precision, depending on the float size.

I understand that changing the defaults is a hard decision, but wanted to suggest it anyway. 😊

I also understand that print(df) is for human consumption, but I would argue that CSV is as well. Usually text-based representations are always meant for human consumption/readability.

@TomAugspurger
Copy link
Contributor

Now, when writing 1.0515299999999999 to a CSV I think it should be written as 1.05153 as it is a sane rounding for a float64 value.

So for

In [32]: print(pd.Series([1.05153, 1.0515299999999999]).to_csv(index=False))
1.05153
1.0515299999999999

It would be 1.05153 for both lines, correct?

I understand that changing the defaults is a hard decision, but wanted to suggest it anyway. 😊

I appreciate that. We're always willing to consider making API breaking changes, the benefit just has to outweigh the cost. In this case, I don't think they do. I just worry about users who need that precision. This would be a very difficult bug to track down, whereas passing float_format='%g' isn't too onerous.

@jorisvandenbossche
Copy link
Member

Now, when writing 1.0515299999999999 to a CSV I think it should be written as 1.05153 as it is a sane rounding for a float64 value. So, not rounding at precision 6, but rather at the highest possible precision, depending on the float size.

I don't think that is correct. Floats of that size can have a higher precision than 5 decimals (just not any value):

In [25]: pd.Series([1.0515299999999998]).iloc[0]
Out[25]: 1.0515299999999999

In [26]: pd.Series([1.0515299999999991]).iloc[0]
Out[26]: 1.0515299999999992

In [27]: pd.Series([1.051529999999981]).iloc[0]
Out[27]: 1.051529999999981

So the three different values would be exactly the same if you would round them before writing to csv.

I agree the exploding decimal numbers when writing pandas objects to csv can be quite annoying (certainly because it differs from number to number, so messing up any alignment you would have in the csv file). But, that's just a consequence of how floats work, and if you don't like it we options to change that (float_format).

@Peque
Copy link
Contributor Author

Peque commented Jun 8, 2017

I understand why that could affect someone (if they are really interested in that very last digit, which is not precise anyway, as 1.0515299999999999 is 0.0000000000000001 away from the "real" value).

@jorisvandenbossche I'm not saying all those should give the same result. Maybe only the first would be represented as 1.05153, the second as ...99 and the third (it might be missing one 9) as 98. So loosing only the very last digit, which is not 100% accurate anyway.

My suggestion is to do something like this only when outputting to a CSV, as that might be more like a "human", readable format in which the 16th digit might not be so important. Then, if someone really wants to have that digit too, use float_format. 😜

BTW, it seems R does not have this issue (so maybe what I am suggesting is not that crazy 😂):

df = read.table('input.csv', header=FALSE, sep=",")
write.csv(df, 'output.csv')
print("------------")
print(df)
print(typeof(df$V2))
print("------------")
$ Rscript read_write.R

The dataframe is loaded just fine, and columns are interpreted as "double" (float64). But when written back to the file, they keep the original "looking". I don't know how they implement it, though, but maybe they just do some rounding by default? In their documentation they say that "Real and complex numbers are written to the maximal possible precision", though.

PS: Don't want to be annoying, feel free to close this if you think you are just loosing your time and this will not be changed anyway (I wont get offended), and wont kill myself for having to use float_format every time either. 😉

@Peque
Copy link
Contributor Author

Peque commented Jun 8, 2017

It seems MATLAB (Octave actually) also don't have this issue by default, just like R. You can try:

df = csvread('input.csv')
csvwrite('output_matlab.csv', df)

And see how the output keeps the original "looking" as well.

@jorisvandenbossche
Copy link
Member

Digged a little bit into it, and I think this is due to some default settings in R:

When printing:

> df = read.table('/home/joris/scipy/__test_float_precision.csv', header=FALSE, sep=",")
> df
              V1      V2      V3      V4      V5 V6
1 01/01/17 23:00 1.05148 1.05153 1.05148 1.05153  4
2 01/01/17 23:01 1.05153 1.05153 1.05153 1.05153  4
3 01/01/17 23:02 1.05170 1.05175 1.05170 1.05175  4
4 01/01/17 23:03 1.05174 1.05175 1.05174 1.05175  4
5 01/01/17 23:08 1.05170 1.05170 1.05170 1.05170  4
6 01/01/17 23:11 1.05173 1.05174 1.05173 1.05174  4
7 01/01/17 23:13 1.05173 1.05173 1.05173 1.05173  4
8 01/01/17 23:14 1.05174 1.05174 1.05174 1.05174  4
9 01/01/17 23:16 1.05204 1.05238 1.05204 1.05238  4
> options(digits = 17)
> df
              V1                 V2                 V3                 V4                 V5 V6
1 01/01/17 23:00 1.0514800000000000 1.0515300000000001 1.0514800000000000 1.0515300000000001  4
2 01/01/17 23:01 1.0515300000000001 1.0515300000000001 1.0515300000000001 1.0515300000000001  4
3 01/01/17 23:02 1.0517000000000001 1.0517500000000000 1.0517000000000001 1.0517500000000000  4
4 01/01/17 23:03 1.0517399999999999 1.0517500000000000 1.0517399999999999 1.0517500000000000  4
5 01/01/17 23:08 1.0517000000000001 1.0517000000000001 1.0517000000000001 1.0517000000000001  4
6 01/01/17 23:11 1.0517300000000001 1.0517399999999999 1.0517300000000001 1.0517399999999999  4
7 01/01/17 23:13 1.0517300000000001 1.0517300000000001 1.0517300000000001 1.0517300000000001  4
8 01/01/17 23:14 1.0517399999999999 1.0517399999999999 1.0517399999999999 1.0517399999999999  4
9 01/01/17 23:16 1.0520400000000001 1.0523800000000001 1.0520400000000001 1.0523800000000001  4

So for printing R does the same if you change the digits options.

For writing to csv, it does not seem to follow the digits option, from the write.csv docs:

In almost all cases the conversion of numeric quantities is governed by the option "scipen" (see options), but with the internal equivalent of digits = 15. For finer control, use format to make a character matrix/data frame, and call write.table on that.

So with digits=15, this is just not precise enough to see the floating point artefacts (as in the example above, I needed digits=17 to show it). For that reason, the result of write.csv looks better for your case.

Pandas uses the full precision when writing csv. So the question is more if we want a way to control this with an option (read_csv has a float_precision keyword), and if so, whether the default should be lower than the current full precision.

I agree the default of R to use a precision just below the full one makes sense, as this fixes the most common cases of lower precision values.

@Peque
Copy link
Contributor Author

Peque commented Jun 8, 2017

@jorisvandenbossche Exactly. What I am proposing is simply to change the default float_precision to something that could be more reasonable/intuitive for average/most-common use cases.

To backup my argument I mention how R and MATLAB (or Octave) do that.

Also, maybe it is a way to make things easier/nicer for newcomers (who might not even know what a float looks like in memory and might think there is a problem with Pandas). For those wanting to have extreme precision written to their CSVs, they probably already know about float representations and about the float_format option, so they can adjust it.

@IngvarLa
Copy link

Not sure if this thread is active, anyway here are my thoughts.
I am not a regular pandas user, but inherited some code that uses dataframes and uses the to_csv() method. With an update of our Linux OS, we also update our python modules, and I saw this change:
in pandas 0.19.2 floating point numbers were written as str(num), which has 12 digits precision, in pandas 0.22.0 they are written as repr(num) which has 17 digits precision. There is a fair bit of noise in the last digit, enough that when using different hardware the last digit can vary. The str(num) is intended for human consumption, while repr(num) is the official representation, so reasonable that repr(num) is default. Still, it would be nice if there was an option to write out the numbers with str(num) again. Makes it easier to compare output without having to use tolerances.

@IngvarLa
Copy link

I guess what I am really asking for is to float_format="" to follow the python formatting convention:
https://docs.python.org/3/library/string.html#format-specification-mini-language, that "" corresponds to str(). When I tried, I get "TypeError: not all arguments converted during string formatting"

@floer32
Copy link

floer32 commented May 27, 2019

@IngvarLa FWIW the older %s/%(foo)s style formatting has the same features as the newer {} formatting, in terms of formatting floats. There's just a bit of chore to 'translate' if you have one vs the other. Here's an example

[ins] In [1]: "{0:.6g}".format(5.5657188485)                                                             
Out[1]: '5.56572'

[ins] In [2]: "%.6g" % 5.5657188485                                                                      
Out[2]: '5.56572'

So I've had the same thought that consistency would make sense (and just have it detect/support both, for compat), but there's a workaround.

@floer32
Copy link

floer32 commented May 27, 2019

This could be seen as a tangent, but I think it is related because I'm getting at same problem/ potential solutions.

Have recently rediscovered Python stdlib's decimal.Decimal. It can be very useful. On a recent project, it proved simplest overall to use decimal.Decimal for our values. In fact, we subclass it, to provide a certain handling of string-ifying. It's worked great with Pandas so far (curious if anyone else has hit edges).

There are some gotchas, such as it having some different behaviors for its "NaN." Off top of head here are some to be aware of.

  • np.nan is np.nan => True
  • Decimal('NaN') is Decimal('NaN') => False
  • Decimal('NaN').is_nan() => True
  • np.isnan(Decimal('NaN')) => error
    • So of course: pd.Series([Decimal('NaN'), np.nan]).isnull() => False, True
  • If you had a buncha Decimal('NaN') to convert, you could apply/applymap
>>> df = pd.DataFrame({'a': [Decimal('NaN'), np.nan, 1/3], 'b': [Decimal('0.333333333333333333333333333333333333333333333333333333333333'), np.nan, None, ]})    

>>> df.isnull()      # pay attn to ['a'][0]                
Out[45]: 
       a      b
0  False  False
1   True   True
2  False   True

>>> df.applymap(lambda x: np.nan if hasattr(x, 'is_nan') and x.is_nan() else x ).isnull()['a'][0]
True
  • It's your decision when/how-much to work in floats before/after Decimal ... Conversion from float to Decimal is lossless according to the docs. Back to float, of course has some loss, though you can use the excellent quantize features for a lot of control of this.
    • I've been converting to Decimal eagerly, have had Decimal all throughout process of manipulating DF, then ultimately it serializes without any special handling (probably because Pandas ends up treating as object->str()).
    • If you want to use float_format, both formatting syntaxes do work with Decimal, but I think you'd need to convert to float first, otherwise Pandas will treat Decimal in that object->str() way (which makes sense)

@janosh
Copy link
Contributor

janosh commented Aug 7, 2019

How about making the default float format in df.to_csv() user-configurable in pd.options? There already seems to be a display.float_format option.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 7, 2019 via email

@janosh
Copy link
Contributor

janosh commented Aug 8, 2019

Saving a dataframe to CSV isn't so much a computation as rather a logging operation, I think.

@Peque
Copy link
Contributor Author

Peque commented Aug 8, 2019

Also, this issue is about changing the default behavior, so having a user-configurable option in Pandas would not really solve it.

@TomAugspurger I updated the issue description to make it more clear and to include some of the comments in the discussion.

@naught101
Copy link

+1 for "%.16g" as the default. That's a stupidly high precision for nearly any field, and if you really need that many digits, you should really be using numpy's float128` instead of built in floats anyway.

Using g means that CSVs usually end up being smaller too.

@emeti
Copy link

emeti commented Jun 5, 2020

I vote to keep the issue open and find a way to change the current default behaviour to better handle a very simple use case - this is definitely an issue for a simple use of the library - it is an unexpected surprise.
Here is a use case : a simple workflow

  1. read data from a csv file
  2. filter some rows (numerical values not touched!) or apply some data transformations
  3. write the data frame back to a csv file.

Steps 1 2 3 with the defaults cause the numerical values changes (numerically values are practically the same, or with negligible errors but suddenly I get in a csv file tons of unnecessary digits that I did not have before ). Yes, that happens often for my datasets, where I have say 3 digit precision numbers. For me it is yet another pandas quirk I have to remember.

The principle of least surprise out of the box - I don't want to see those data changes for a simple data filter step ... or not necessarily look into formats of columns for simple data operations.

I am wondering if there is a way to make pandas better and not confuse a simple user .... maybe not changing float_format default itself but introducing a data frame property for columns to keep track of numerical columns precision sniffed during 'read_csv' and applicable during 'to_csv' (detect precision during read and use the same one during write) ? That would be a significant change I guess. The problem is that once read_csv reads the data into data frame the data frame loses memory of what the column precision and format was. Anyway - the resolution proposed by @Peque works with my data , +1 for the deafult of %.16g or finding another way

@Abdias-Wang
Copy link

This could be fixed by adding the line:

df = df.round(5)

before you dump the df into csv file using .to_csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

10 participants