Skip to content

float formatting issue #726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lodagro opened this issue Feb 1, 2012 · 8 comments
Closed

float formatting issue #726

lodagro opened this issue Feb 1, 2012 · 8 comments
Labels
Milestone

Comments

@lodagro
Copy link
Contributor

lodagro commented Feb 1, 2012

see first value

In [1]: import pandas

In [2]: pandas.__version__
Out[2]: '0.7.0.dev-e3df4e2'

In [3]: df = pandas.DataFrame({'A': [746.03, 0.00, 5620.00, 1592.36]})

In [4]: df
Out[4]:
   A
0  746.
1  0.00
2  5620
3  1592

In [5]:                  
@adamklein
Copy link
Contributor

This is the behavior as designed. Is it problematic for you? "746." is a valid python floating point representation.

In [1]: 746.
Out[1]: 746.0

@lodagro
Copy link
Contributor Author

lodagro commented Feb 1, 2012

When doing a read_clipboard from a DataFrame on the mailing list i noticed the above behavior, it felt weird (probably because i'm too used to engineering float formatting).
If this is intended, no problem.

@lodagro lodagro closed this as completed Feb 1, 2012
@wesm wesm reopened this Feb 2, 2012
@wesm
Copy link
Member

wesm commented Feb 2, 2012

I might have a further look at this. R for example is a bit cleverer about this kinda stuff:

> data.frame(a=c(746.03, 0.00, 5620.00, 1592.36))
        a
1  746.03
2    0.00
3 5620.00
4 1592.36
> data.frame(a=c(746.03, 0.00, 5620.00, 1592.36), b=rnorm(4))
        a          b
1  746.03 -1.3561377
2    0.00 -0.8483049
3 5620.00 -0.4424412
4 1592.36 -0.6585460

@wesm
Copy link
Member

wesm commented Feb 7, 2012

OK I undertook a pretty major refactor of all the formatting code which is now quite a lot simpler and better. In the case above:


In [1]: In [3]: df = pandas.DataFrame({'A': [746.03, 0.00, 5620.00, 1592.36]})

In [2]: 

In [2]: In [4]: df
Out[2]: 
         A
0   746.03
1     0.00
2  5620.00
3  1592.36

I also changed the default number of decimal places (plus the first digit to the left of the decimal point) to 7, which is really just a suggestion as in R. There are a few kinks yet even though the test suite passes, can you @lodagro have a whirl?

@lodagro
Copy link
Contributor Author

lodagro commented Feb 7, 2012

ok, i did some shaking -- looks very good.

  • while shaking noticed that reset_index() returned dtype column iso float one (i added a comment on DataFrame.delevel infer dtypes better #440, details also below)
  • I wonder how decision is made if exponent notation will be used or not (not so important i can rtfs :-) ).
  • Index does not use the float formatting, it never did.
  • pandas.set_eng_float_format() is broken. Concerning the engineering float formatter, i added plenty of unittests validating if the engineering float formatter works correclty. That is the EngFormatter class. But there is no unittest that validates if the EngFormatter is actually used when enabling it with pandas.set_eng_float_format. There are tests that do repr(df) after set_eng_float_format, but the tests do not check if the strings are as expected. OK, i know what to do here.
df1 = pandas.DataFrame(\
        [(t, (9.81 * t ** 2) /2) for t in np.arange(0.0, 10, np.sqrt(2)/2)],
        columns=['time', 'speed'])
print df1

        time     speed
0   0.000000    0.0000
1   0.707107    2.4525
2   1.414214    9.8100
3   2.121320   22.0725
4   2.828427   39.2400
5   3.535534   61.3125
6   4.242641   88.2900
7   4.949747  120.1725
8   5.656854  156.9600
9   6.363961  198.6525
10  7.071068  245.2500
11  7.778175  296.7525
12  8.485281  353.1600
13  9.192388  414.4725
14  9.899495  480.6900

---> a nice start


time = np.arange(0.0, 10, np.sqrt(2)/2)
s1 = pandas.Series((9.81 * time ** 2) /2,
                   index=pandas.Index(time, name='time'),
                   name='speed')
print s1
time
0.0                 0.0000
0.707106781187      2.4525
1.41421356237       9.8100
2.12132034356      22.0725
2.82842712475      39.2400
3.53553390593      61.3125
4.24264068712      88.2900
4.94974746831     120.1725
5.65685424949     156.9600
6.36396103068     198.6525
7.07106781187     245.2500
7.77817459305     296.7525
8.48528137424     353.1600
9.19238815543     414.4725
9.89949493661     480.6900
Name: speed

df2 = s1.reset_index()
print df2
         time     speed
0           0    0.0000
1   0.7071068    2.4525
2    1.414214    9.8100
3     2.12132   22.0725
4    2.828427   39.2400
5    3.535534   61.3125
6    4.242641   88.2900
7    4.949747  120.1725
8    5.656854  156.9600
9    6.363961  198.6525
10   7.071068  245.2500
11   7.778175  296.7525
12   8.485281  353.1600
13   9.192388  414.4725
14   9.899495  480.6900

Index does not use the float formatting, it never did.
Here df2 was a surprise. But it is related to reset_index(), which makes df2['time'].dtype object (i added this as comment to #440)


df3 = pandas.DataFrame(\
        [(exp,
          np.pi * (10 ** exp),
          np.random.randint(-1000000, 1000000),
          np.random.randn() * (10 ** exp)) \
                for exp in range(0, 15)],
        columns=['exponent', 'pi*(10^exp)', 'rand int', 'floats'])
print df3

    exponent   pi*(10^exp)  rand int        floats
0          0  3.141593e+00     -2960 -7.215871e-01
1          1  3.141593e+01    444548  8.557070e+00
2          2  3.141593e+02   -984243 -2.774372e+01
3          3  3.141593e+03    661649 -3.249025e+02
4          4  3.141593e+04   -767947  8.474823e+03
5          5  3.141593e+05   -807672 -1.554962e+04
6          6  3.141593e+06   -842952 -3.450536e+05
7          7  3.141593e+07    811900 -3.746092e+05
8          8  3.141593e+08    -69090  1.773727e+08
9          9  3.141593e+09    394125  1.565224e+09
10        10  3.141593e+10   -229127 -1.030427e+10
11        11  3.141593e+11   -426117 -1.538240e+11
12        12  3.141593e+12   -630881 -1.526913e+12
13        13  3.141593e+13     24427  8.449833e+12
14        14  3.141593e+14   -197911  4.752335e+13

print df3.head(8)

   exponent      pi*(10^exp)  rand int         floats
0         0         3.141593     -2960      -0.721587
1         1        31.415927    444548       8.557070
2         2       314.159265   -984243     -27.743721
3         3      3141.592654    661649    -324.902472
4         4     31415.926536   -767947    8474.823490
5         5    314159.265359   -807672  -15549.620804
6         6   3141592.653590   -842952 -345053.577693
7         7  31415926.535898    811900 -374609.236675

How is the decision between using exponent or not done?


df4 = pandas.DataFrame({'A': [746.03, 0.00, 5620.00, 1592.36]})
print df4
         A
0   746.03
1     0.00
2  5620.00
3  1592.36


df5 = pandas.DataFrame({'A': [np.pi, np.sqrt(2), 12345.36, -1000, 1]})
print df5

              A
0      3.141593
1      1.414214
2  12345.360000
3  -1000.000000
4      1.000000

df6 = pandas.DataFrame({'A': [np.pi, np.sqrt(2), 12345.36, -1000, 1, 1e9]})
print df6

              A
0  3.141593e+00
1  1.414214e+00
2  1.234536e+04
3 -1.000000e+03
4  1.000000e+00
5  1.000000e+09

pandas.set_printoptions(precision=20)

print df6
0  3.1415926535897931160e+00
1  1.4142135623730951455e+00
2  1.2345360000000000582e+04
3 -1.0000000000000000000e+03
4  1.0000000000000000000e+00
5  1.0000000000000000000e+09

len('3.1415926535897931160e+00')
25

print df3
    exponent                pi*(10^exp)  rand int                     floats
0          0  3.1415926535897931160e+00     -2960 -7.2158713307242039470e-01
1          1  3.1415926535897931160e+01    444548  8.5570701721458934941e+00
2          2  3.1415926535897932581e+02   -984243 -2.7743721467131731373e+01
3          3  3.1415926535897929170e+03    661649 -3.2490247228284766834e+02
4          4  3.1415926535897931899e+04   -767947  8.4748234895174573467e+03
5          5  3.1415926535897928989e+05   -807672 -1.5549620804074344051e+04
6          6  3.1415926535897930153e+06   -842952 -3.4505357769299164647e+05
7          7  3.1415926535897932947e+07    811900 -3.7460923667523823678e+05
8          8  3.1415926535897928476e+08    -69090  1.7737270685067045689e+08
9          9  3.1415926535897932053e+09    394125  1.5652238372316517830e+09
10        10  3.1415926535897930145e+10   -229127 -1.0304274723894351959e+10
11        11  3.1415926535897930908e+11   -426117 -1.5382395723947625732e+11
12        12  3.1415926535897929688e+12   -630881 -1.5269131337086157227e+12
13        13  3.1415926535897929688e+13     24427  8.4498331051793642578e+12
14        14  3.1415926535897931250e+14   -197911  4.7523352185897289062e+13

@wesm
Copy link
Member

wesm commented Feb 7, 2012

OK I'll take a look through these issues and fix the set_eng_float_format problem (really ought to have been a test!)

@lodagro
Copy link
Contributor Author

lodagro commented Feb 7, 2012

aha, set_eng_float_format running fine again and tests already added, was just about to write one -- but already done.

@wesm wesm closed this as completed Feb 7, 2012
@wesm
Copy link
Member

wesm commented Feb 7, 2012

OK, I fixed the reset_index issue. I also have floats in Index formatting using the same formatter as everything else.

RE: how scientific notation is determined, it's roughly whenever values meet some arbitrary definition of "big". See for example R behavior:

> data.frame(a=c(pi * 1e3, pi * 1e6, pi * 1e9, pi * 1e12, pi * 1e14))
             a
1 3.141593e+03
2 3.141593e+06
3 3.141593e+09
4 3.141593e+12
5 3.141593e+14
> options(digits=10)
> data.frame(a=c(pi * 1e3, pi * 1e6, pi * 1e9, pi * 1e12, pi * 1e14))
                a
1 3.141592654e+03
2 3.141592654e+06
3 3.141592654e+09
4 3.141592654e+12
5 3.141592654e+14
> options(digits=15)
> data.frame(a=c(pi * 1e3, pi * 1e6, pi * 1e9, pi * 1e12, pi * 1e14))
                     a
1 3.14159265358979e+03
2 3.14159265358979e+06
3 3.14159265358979e+09
4 3.14159265358979e+12
5 3.14159265358979e+14
> options(digits=20)
> data.frame(a=c(pi * 1e3, pi * 1e6, pi * 1e9, pi * 1e12, pi * 1e14))
                          a
1 3.1415926535897929170e+03
2 3.1415926535897930153e+06
3 3.1415926535897932053e+09
4 3.1415926535897929688e+12
5 3.1415926535897931250e+14

I guess it makes sense that if one values must be formatted in scientific notation that the whole column should be. If the precision / # digits is sufficiently high not clear to me that R has it wrong. I'm just going to leave it be for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants