-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Suggestion: changing default float_format
in DataFrame.to_csv()
#16452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmm I don't think we should change the default. If we just used |
Agreed. Maybe using |
Just to make sure I fully understand, can you provide an example? Or let me know if this is what you were worried about. Given a file
then In [1]: import pandas as pd
In [2]: pd.read_csv("foo.csv").to_csv("foo2.csv", index=False)
In [3]: !cat foo2.csv
A
0.1
0.2
0.3 That one doesn't have any rounding issues (but maybe with different numbers it would?) |
|
@TomAugspurger Let me reopen this issue. I have now found an example that reproduces this without modifying the contents of the original DataFrame: from io import StringIO
from pathlib import Path
from tempfile import NamedTemporaryFile
import pandas
input_csv = StringIO('''
01/01/17 23:00,1.05148,1.05153,1.05148,1.05153,4
01/01/17 23:01,1.05153,1.05153,1.05153,1.05153,4
01/01/17 23:02,1.05170,1.05175,1.05170,1.05175,4
01/01/17 23:03,1.05174,1.05175,1.05174,1.05175,4
01/01/17 23:08,1.05170,1.05170,1.05170,1.05170,4
01/01/17 23:11,1.05173,1.05174,1.05173,1.05174,4
01/01/17 23:13,1.05173,1.05173,1.05173,1.05173,4
01/01/17 23:14,1.05174,1.05174,1.05174,1.05174,4
01/01/17 23:16,1.05204,1.05238,1.05204,1.05238,4
''')
df = pandas.read_csv(input_csv, header=None)
with NamedTemporaryFile() as tmpfile:
df.to_csv(tmpfile.name, index=False, header=None)
print(Path(tmpfile.name).read_text()) That would give you this output:
|
@Peque I think everything is operating as intended, but let me see if I understand your concern.
In [22]: df.iloc[0, 2]
Out[22]: 1.0515299999999999
If I understand you correctly, then I think I disagree. The purpose of the string repr |
@TomAugspurger Not exactly what I mean. 😇 I am not saying that numbers should be rounded to When we load Now, when writing I understand that changing the defaults is a hard decision, but wanted to suggest it anyway. 😊 I also understand that |
So for In [32]: print(pd.Series([1.05153, 1.0515299999999999]).to_csv(index=False))
1.05153
1.0515299999999999 It would be 1.05153 for both lines, correct?
I appreciate that. We're always willing to consider making API breaking changes, the benefit just has to outweigh the cost. In this case, I don't think they do. I just worry about users who need that precision. This would be a very difficult bug to track down, whereas passing |
I don't think that is correct. Floats of that size can have a higher precision than 5 decimals (just not any value):
So the three different values would be exactly the same if you would round them before writing to csv. I agree the exploding decimal numbers when writing pandas objects to csv can be quite annoying (certainly because it differs from number to number, so messing up any alignment you would have in the csv file). But, that's just a consequence of how floats work, and if you don't like it we options to change that ( |
I understand why that could affect someone (if they are really interested in that very last digit, which is not precise anyway, as @jorisvandenbossche I'm not saying all those should give the same result. Maybe only the first would be represented as My suggestion is to do something like this only when outputting to a CSV, as that might be more like a "human", readable format in which the 16th digit might not be so important. Then, if someone really wants to have that digit too, use BTW, it seems R does not have this issue (so maybe what I am suggesting is not that crazy 😂): df = read.table('input.csv', header=FALSE, sep=",")
write.csv(df, 'output.csv')
print("------------")
print(df)
print(typeof(df$V2))
print("------------")
The dataframe is loaded just fine, and columns are interpreted as "double" (float64). But when written back to the file, they keep the original "looking". I don't know how they implement it, though, but maybe they just do some rounding by default? In their documentation they say that "Real and complex numbers are written to the maximal possible precision", though. PS: Don't want to be annoying, feel free to close this if you think you are just loosing your time and this will not be changed anyway (I wont get offended), and wont kill myself for having to use |
It seems MATLAB (Octave actually) also don't have this issue by default, just like R. You can try: df = csvread('input.csv')
csvwrite('output_matlab.csv', df) And see how the output keeps the original "looking" as well. |
Digged a little bit into it, and I think this is due to some default settings in R: When printing:
So for printing R does the same if you change the digits options. For writing to csv, it does not seem to follow the digits option, from the
So with digits=15, this is just not precise enough to see the floating point artefacts (as in the example above, I needed digits=17 to show it). For that reason, the result of Pandas uses the full precision when writing csv. So the question is more if we want a way to control this with an option ( I agree the default of R to use a precision just below the full one makes sense, as this fixes the most common cases of lower precision values. |
@jorisvandenbossche Exactly. What I am proposing is simply to change the default To backup my argument I mention how R and MATLAB (or Octave) do that. Also, maybe it is a way to make things easier/nicer for newcomers (who might not even know what a float looks like in memory and might think there is a problem with Pandas). For those wanting to have extreme precision written to their CSVs, they probably already know about float representations and about the |
Not sure if this thread is active, anyway here are my thoughts. |
I guess what I am really asking for is to float_format="" to follow the python formatting convention: |
@IngvarLa FWIW the older [ins] In [1]: "{0:.6g}".format(5.5657188485)
Out[1]: '5.56572'
[ins] In [2]: "%.6g" % 5.5657188485
Out[2]: '5.56572' So I've had the same thought that consistency would make sense (and just have it detect/support both, for compat), but there's a workaround. |
This could be seen as a tangent, but I think it is related because I'm getting at same problem/ potential solutions. Have recently rediscovered Python stdlib's There are some gotchas, such as it having some different behaviors for its "NaN." Off top of head here are some to be aware of.
>>> df = pd.DataFrame({'a': [Decimal('NaN'), np.nan, 1/3], 'b': [Decimal('0.333333333333333333333333333333333333333333333333333333333333'), np.nan, None, ]})
>>> df.isnull() # pay attn to ['a'][0]
Out[45]:
a b
0 False False
1 True True
2 False True
>>> df.applymap(lambda x: np.nan if hasattr(x, 'is_nan') and x.is_nan() else x ).isnull()['a'][0]
True
|
How about making the default float format in |
Typically we don't rely on options that change the actual output of a
computation.
…On Wed, Aug 7, 2019 at 10:48 AM Janosh Riebesell ***@***.***> wrote:
How about making the default float format in df.to_csv()
user-configurable in pd.options? There already seems to be a
display.float_format
<https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html>
option.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16452?email_source=notifications&email_token=AAKAOIXV53PLTG3BWGWZSHDQDLVFJA5CNFSM4DMOSSK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3Y3FGQ#issuecomment-519156378>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIU6HZ3KSXJQJEKTBRDQDLVFJANCNFSM4DMOSSKQ>
.
|
Saving a dataframe to CSV isn't so much a computation as rather a logging operation, I think. |
Also, this issue is about changing the default behavior, so having a user-configurable option in Pandas would not really solve it. @TomAugspurger I updated the issue description to make it more clear and to include some of the comments in the discussion. |
+1 for Using |
I vote to keep the issue open and find a way to change the current default behaviour to better handle a very simple use case - this is definitely an issue for a simple use of the library - it is an unexpected surprise.
Steps 1 2 3 with the defaults cause the numerical values changes (numerically values are practically the same, or with negligible errors but suddenly I get in a csv file tons of unnecessary digits that I did not have before ). Yes, that happens often for my datasets, where I have say 3 digit precision numbers. For me it is yet another pandas quirk I have to remember. The principle of least surprise out of the box - I don't want to see those data changes for a simple data filter step ... or not necessarily look into formats of columns for simple data operations. I am wondering if there is a way to make pandas better and not confuse a simple user .... maybe not changing float_format default itself but introducing a data frame property for columns to keep track of numerical columns precision sniffed during 'read_csv' and applicable during 'to_csv' (detect precision during read and use the same one during write) ? That would be a significant change I guess. The problem is that once read_csv reads the data into data frame the data frame loses memory of what the column precision and format was. Anyway - the resolution proposed by @Peque works with my data , +1 for the deafult of %.16g or finding another way |
This could be fixed by adding the line: df = df.round(5) before you dump the df into csv file using .to_csv |
Uh oh!
There was an error while loading. Please reload this page.
Current behavior
If I read a CSV file, do nothing with it, and save it again, I would expect Pandas to keep the format the CSV had before. But that is not the case.
Would give us this output:
What is wrong with that?
I would consider this to be unintuitive/undesirable behavior.
The written numbers have that representation because the original number cannot be represented precisely as a float. That is expected when working with floats. However, that means we are writing the last digit, which we know it is not exact due to float-precision limitations anyways, to the CSV.
Proposed change
I think that last digit, knowing is not precise anyways, should be rounded when writing to a CSV file.
Maybe by changing the default
DataFrame.to_csv()
'sfloat_format
parameter fromNone
to'%16g'
? (or at least make.to_csv()
use'%.16g'
when nofloat_format
is specified).Note that I propose rounding to the float's precision, which for a 64-bits float, would mean that
1.0515299999999999
could be rounded to1.05123
, but1.0515299999999992
could be rounded to1.051529999999999
and1.051529999999981
would not be rounded at all.Example:
Which gives:
Which also adds some errors, but keeps a cleaner output:
Note that errors are similar, but the output "After" seems to be more consistent with the input (for all the cases where the float is not represented to the last unprecise digit).
Also, I think in most cases, a CSV does not have floats represented to the last (unprecise) digit. See the precedents just bellow (other software outputting CSVs that would not use that last unprecise digit).
Precedents
Both MATLAB and R do not use that last unprecise digit when converting to CSV (they round it).
The text was updated successfully, but these errors were encountered: