ENH: Added DataFrame.round and associated tests #10568

roblevy · 2015-07-14T16:34:04Z

I've found myself doing a lot of DataFrame.to_latex because I'm using pandas to write an academic paper.

I'm constantly messing about with the number of decimal places displayed by doing np.round(df, 2) so thought this flexible round, with different numbers of decimals per column, should be part of the DataFrame API (I'm surprised there isn't already such a piece of functionality.)

Here is an example:

In [9]: df = pd.DataFrame(np.random.random([10, 3]), columns=['a', 'b', 'c'])

In [10]: df
Out[10]: 
          a         b         c
0  0.761651  0.430963  0.440312
1  0.094071  0.242381  0.149731
2  0.620050  0.462600  0.194143
3  0.614627  0.692106  0.176523
4  0.215396  0.888180  0.380283
5  0.492990  0.200268  0.067020
6  0.804531  0.816366  0.065751
7  0.751224  0.037474  0.884083
8  0.994758  0.450143  0.808945
9  0.373180  0.537589  0.809112

In [11]: df.round(dict(b=2, c=4))
Out[11]: 
          a     b       c
0  0.761651  0.43  0.4403
1  0.094071  0.24  0.1497
2  0.620050  0.46  0.1941
3  0.614627  0.69  0.1765
4  0.215396  0.89  0.3803
5  0.492990  0.20  0.0670
6  0.804531  0.82  0.0658
7  0.751224  0.04  0.8841
8  0.994758  0.45  0.8089
9  0.373180  0.54  0.8091

You can also round by column number:

In [12]: df.round([1, 2, 3])
Out[12]: 
     a     b      c
0  0.8  0.43  0.440
1  0.1  0.24  0.150
2  0.6  0.46  0.194
3  0.6  0.69  0.177
4  0.2  0.89  0.380
5  0.5  0.20  0.067
6  0.8  0.82  0.066
7  0.8  0.04  0.884
8  1.0  0.45  0.809
9  0.4  0.54  0.809

and any columns which are not explicitly rounded are unaffected:

In [13]: df.round([1])
Out[13]: 
     a         b         c
0  0.8  0.430963  0.440312
1  0.1  0.242381  0.149731
2  0.6  0.462600  0.194143
3  0.6  0.692106  0.176523
4  0.2  0.888180  0.380283
5  0.5  0.200268  0.067020
6  0.8  0.816366  0.065751
7  0.8  0.037474  0.884083
8  1.0  0.450143  0.808945
9  0.4  0.537589  0.809112

Non-integer values raise a TypeError, as might be expected:

In [15]: df.round({'a':1.2})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-6f51d3fd917d> in <module>()
----> 1 df.round({'a':1.2})

/home/rob/Dropbox/PhD/pandas/pandas/core/frame.py in round(self, places)
   1467 
   1468         if isinstance(places, dict):
-> 1469             new_cols = [col for col in _dict_round(self, places)]
   1470         else:
   1471             new_cols = [col for col in _list_round(self, places)]

/home/rob/Dropbox/PhD/pandas/pandas/core/frame.py in _dict_round(df, places)
   1455             for col in df:
   1456                 try:
-> 1457                     yield np.round(df[col], places[col])
   1458                 except KeyError:
   1459                     yield df[col]

/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.pyc in round_(a, decimals, out)
   2646     except AttributeError:
   2647         return _wrapit(a, 'round', decimals, out)
-> 2648     return round(decimals, out)
   2649 
   2650 

/home/rob/Dropbox/PhD/pandas/pandas/core/series.pyc in round(self, decimals, out)
   1209 
   1210         """
-> 1211         result = _values_from_object(self).round(decimals, out=out)
   1212         if out is None:
   1213             result = self._constructor(result,

TypeError: integer argument expected, got float

jreback · 2015-07-14T18:54:13Z

is there a reason you are not simply using the options display.float_format or display.precision?

jreback · 2015-07-14T19:30:21Z

In [1]: np.random.seed(1234)

In [2]: df = pd.DataFrame(np.random.random([10, 3]), columns=['a', 'b', 'c'])

In [4]: df
Out[4]: 
          a         b         c
0  0.191519  0.622109  0.437728
1  0.785359  0.779976  0.272593
2  0.276464  0.801872  0.958139
3  0.875933  0.357817  0.500995
4  0.683463  0.712702  0.370251
5  0.561196  0.503083  0.013768
6  0.772827  0.882641  0.364886
7  0.615396  0.075381  0.368824
8  0.933140  0.651378  0.397203
9  0.788730  0.316836  0.568099

In [7]: pd.set_option('float_format',lambda x: "%.3f" % x)

In [8]: df
Out[8]: 
      a     b     c
0 0.192 0.622 0.438
1 0.785 0.780 0.273
2 0.276 0.802 0.958
3 0.876 0.358 0.501
4 0.683 0.713 0.370
5 0.561 0.503 0.014
6 0.773 0.883 0.365
7 0.615 0.075 0.369
8 0.933 0.651 0.397
9 0.789 0.317 0.568

In [9]: pd.set_option('precision',3)

In [10]: df
Out[10]: 
      a     b     c
0 0.192 0.622 0.438
1 0.785 0.780 0.273
2 0.276 0.802 0.958
3 0.876 0.358 0.501
4 0.683 0.713 0.370
5 0.561 0.503 0.014
6 0.773 0.883 0.365
7 0.615 0.075 0.369
8 0.933 0.651 0.397
9 0.789 0.317 0.568

shoyer · 2015-07-14T20:01:59Z

I think this is a great idea and would be a useful API addition.

A few comments on the design:

We should support supplying a single scalar value, e.g. df.round(2).
This should also be a Series method (for scalar values). Thus the logic should perhaps live in pandas/core/generic.py.
df.round([1]) should be a ValueError -- if you are referencing columns by position, you should need to supply a list with the same length as the full dataframe. It's not obvious what the single element list refers to (e.g., the first or last column?), so we should resist the temptation to guess.

@jreback float_format and precision don't let you customize the precision by column. Also, this could be useful for other things that these options don't effect, e.g., rounding prior to exporting the data with to_csv.

jreback · 2015-07-14T20:05:28Z

I disagree entirely this is very duplicative
why don't we just override each method in numpy?

this is exactly what apply(np.round, axis=1) is for

shoyer · 2015-07-14T20:09:33Z

Numpy arrays do have a round method, and DataFrame currently wraps almost every NumPy methods. So I think it would be entirely appropriate to add round. IMO it's more surprising that it's missing.

jreback · 2015-07-14T20:11:30Z

np.round ???

jreback · 2015-07-14T20:11:47Z

ufuncs are there for a reason

jorisvandenbossche · 2015-07-15T08:02:18Z

I also think this would be a nice addition.

First, I think it should be clear that rounding and precision display can be different things. There are enough cases where you don't want to just change how the output looks, but where you want 'real' rounding.

There is indeed np.round(df). But when you have heterogeneous dtypes in your dataframe (eg one column with strings), this already does not work.
Plus, the convenience of being able to specify different precisions per column would possibly be nice.

roblevy · 2015-07-15T12:18:26Z

@jreback display.float_format and display.precision all round each column to the same precision. This functionality allows each column to be rounded to a different precision. This is also why apply(np.round, axis=1) is not the same piece of functionality.

jreback · 2015-07-15T12:51:23Z

@roblevy I get why you want this functionaility and I suppose a small expansion of the API is ok, but it has become a creeping API :)

My main concern is that methods should not have any real notion of selection (parameterisation is ok, e.g. .drop_duplicates though).

as @shoyer points out this should take a Series/dict only & certainly should be in pandas/core/generic.py

e.g.

df.round(Series([1,2],list('ab')) or df.round(dict(a=1,b=2))

rather than a positional indicator

roblevy · 2015-07-15T15:41:14Z

Glad to have this accepted in principle, @jreback . Can I make one last effort to convince you that a list input is a good idea?

If, as @shoyer points out, the number of elements in the list match the number of columns in the DataFrame then there's no ambiguity and I would "expect" this to just work. Then columns which are not to be rounded can have a None element associated with them.

Yay? Or still nay?

shoyer · 2015-07-15T15:44:46Z

I also don't see a strong need for list like input. A dictionary is more flexible and probably more readable in practice, too

jreback · 2015-07-15T15:53:33Z

yeh list-like just cause confusion. Only should accept dict-like.

jreback · 2015-08-16T00:01:04Z

@roblevy can you update / fix so passing

roblevy · 2015-08-16T21:42:51Z

Didn't realise this before, but np.round dispatches to the .round method of whatever you pass it, so I've updated the signature to match what np.round expects.

Thus, the signature is now:

df.round(decimals, out=None)

where out raises a NotImplementedError. We also have to allow decimals to be an integer (or np.integer) because doing np.round(df, 2) simply dispatches to df.round(2).

roblevy · 2015-08-17T18:12:23Z

Ok. @jreback looks like we're good to go.

jorisvandenbossche · 2015-08-18T23:16:29Z

doc/source/whatsnew/v0.17.0.txt

@@ -501,6 +501,17 @@ Other API Changes
 - Enable serialization of lists and dicts to strings in ExcelWriter (:issue:`8188`)
 - Allow passing `kwargs` to the interpolation methods (:issue:`10378`).
 - Serialize metadata properties of subclasses of pandas objects (:issue:`10553`).
+- Round DataFrame to variable number of decimal places (:issue:`10568`).
+=======


some leftover of rebase?

I would move this to enhancements.

jorisvandenbossche · 2015-08-18T23:21:24Z

@roblevy I don't think the out kwarg is needed here. We also don't provide this in eg DataFrame.mean to conform to numpy

shoyer · 2015-08-19T00:08:10Z

@jorisvandenbossche actually, we allow (and ignore) **kwargs on DataFrame.mean, specifically for this reason (numpy compat). Otherwise np.mean does not work on a DataFrame.

So I think this is probably the right approach here.

shoyer · 2015-08-19T00:08:32Z

pandas/core/generic.py

@@ -3137,6 +3137,78 @@ def clip_lower(self, threshold, axis=None):
        subset = self.ge(threshold, axis=axis) | isnull(self)
        return self.where(subset, threshold, axis=axis)

+    def round(self, decimals, out=None):


decimals should default to 0, like NumPy.

just add **kwargs for compat with numpy

Sadly numpy needs a positional argument for out with np.round. So I think we should stick with this signature, just removing it from the docstring

jorisvandenbossche · 2015-08-19T00:22:35Z

@shoyer ah yes, OK, but then it is just not explicitely needed (not in the docstring/not raise an NotImplementedError)

jreback · 2015-08-19T00:23:42Z

pandas/core/generic.py

+            columns not included in `decimals` will be left as is. Elements
+            of `decimals` which are not columns of the input will be
+            ignored.
+        out: None


remove the out parameter (we are just ignoring it)

jreback · 2015-08-19T00:26:08Z

pandas/tests/test_format.py

+        from distutils.version import LooseVersion
+        df = DataFrame(
+            {'col1': [1.123, 2.123, 3.123], 'col2': [1.234, 2.234, 3.234]})
+        # Round with an integer


put a blank line between tests

jreback · 2015-08-19T00:29:21Z

@roblevy don't be scared off :) as you just got a lot of comments.

jorisvandenbossche · 2015-08-19T22:45:38Z

doc/source/api.rst

@@ -809,6 +809,7 @@ Binary operator functions
   DataFrame.eq
   DataFrame.combine
   DataFrame.combine_first
+   DataFrame.round


I don't think this is the right place (it is not a binary operator). But not sure what the good place is ..

Maybe 'Computations / Descriptive Stats'

jorisvandenbossche · 2015-08-19T22:49:47Z

To be consistent, I would the take the same approach as in other numpy-like methods, which is ignoring out instead of raising an error.

jreback · 2015-08-26T01:36:05Z

pandas/core/frame.py

+                except KeyError:
+                    yield df[col]
+
+        if isinstance(decimals, (dict, Series)):


this should be an int dtype series. I think you have to require >= 0. I suppose you could ignore nans as well. I am not sure what np.round would do with these cases, so pls add some tests for validation. If the errors are obtuse, then may need to catch and report a better message.

jreback · 2015-08-26T01:38:06Z

pls also add a section to the docs http://pandas-docs.github.io/pandas-docs-travis/options.html#number-formatting (or if you have a better idea where then pls report).

jreback · 2015-09-01T12:12:54Z

can you rebase / update according to comments

roblevy · 2015-09-01T16:21:36Z

I'm going to push back on handling nan here, because any Series with a nan value becomes a float anyway so numpy returns a sensible error that an integer argument is expected. That ok @jreback?

shoyer · 2015-09-01T16:24:48Z

You can let numpy handle the error processing, but please add a unit test to verify that an appropriate error is raised.

On Tue, Sep 1, 2015 at 9:21 AM, Rob Levy [email protected] wrote:

I'm going to push back on handling nan here, because any Series with a nan value becomes a float anyway so numpy returns a sensible error that an integer argument is expected. That ok @jreback?

Reply to this email directly or view it on GitHub:
#10568 (comment)

jreback · 2015-09-02T11:56:37Z

doc/source/options.rst

@@ -438,3 +438,5 @@ For instance:
   :suppress:

   pd.reset_option('^display\.')
+
+To round floats on a case-by-case basis, you can also use ``Series.round()`` and ``DataFrame.round()``.


you can use :meth:DataFrame.round`` for these

jreback · 2015-09-02T11:58:21Z

@roblevy ok, pls add the tests @shoyer suggests, minor doc comments. pls squash. ping when green.

roblevy · 2015-09-03T10:08:54Z

@jreback Good to go. I've added the tests as requested, and updated the docs as requested. Tests are passing!

ENH: Added DataFrame.round and associated tests

jreback · 2015-09-03T13:41:16Z

@roblevy thanks. nice change!

roblevy · 2015-09-03T13:45:36Z

YAAAY!!! Excellent. I feel very proud.

jreback added Output-Formatting __repr__ of pandas objects, to_string API Design labels Jul 15, 2015

jreback added this to the 0.17.0 milestone Jul 15, 2015

roblevy force-pushed the variable-round branch from fcdda34 to 9e9dd48 Compare August 16, 2015 21:34

jorisvandenbossche reviewed Aug 18, 2015
View reviewed changes

shoyer reviewed Aug 19, 2015
View reviewed changes

jreback reviewed Aug 19, 2015
View reviewed changes

jorisvandenbossche reviewed Aug 19, 2015
View reviewed changes

jreback reviewed Aug 26, 2015
View reviewed changes

jreback reviewed Sep 2, 2015
View reviewed changes

roblevy force-pushed the variable-round branch 3 times, most recently from 9123348 to e81ac39 Compare September 2, 2015 13:00

ENH: Added DataFrame.round and associated tests

dc57e2e

roblevy force-pushed the variable-round branch from e81ac39 to dc57e2e Compare September 2, 2015 13:02

jreback added a commit that referenced this pull request Sep 3, 2015

Merge pull request #10568 from roblevy/variable-round

9aafd6d

ENH: Added DataFrame.round and associated tests

jreback merged commit 9aafd6d into pandas-dev:master Sep 3, 2015

roblevy deleted the variable-round branch September 3, 2015 13:44

jreback mentioned this pull request Sep 7, 2015

Period/DatetimeIndex may not be broadcasting correctly #5032

Closed

ENH: Added DataFrame.round and associated tests #10568

ENH: Added DataFrame.round and associated tests #10568

Conversation

roblevy commented Jul 14, 2015

jreback commented Jul 14, 2015

jreback commented Jul 14, 2015

shoyer commented Jul 14, 2015

jreback commented Jul 14, 2015

shoyer commented Jul 14, 2015

jreback commented Jul 14, 2015

jreback commented Jul 14, 2015

jorisvandenbossche commented Jul 15, 2015

roblevy commented Jul 15, 2015

jreback commented Jul 15, 2015

roblevy commented Jul 15, 2015

shoyer commented Jul 15, 2015

jreback commented Jul 15, 2015

jreback commented Aug 16, 2015

roblevy commented Aug 16, 2015

roblevy commented Aug 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 18, 2015

shoyer commented Aug 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 19, 2015

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 19, 2015

Choose a reason for hiding this comment

jreback commented Aug 26, 2015

jreback commented Sep 1, 2015

roblevy commented Sep 1, 2015

shoyer commented Sep 1, 2015

I'm going to push back on handling nan here, because any Series with a nan value becomes a float anyway so numpy returns a sensible error that an integer argument is expected. That ok @jreback?

Choose a reason for hiding this comment

jreback commented Sep 2, 2015

roblevy commented Sep 3, 2015

jreback commented Sep 3, 2015

roblevy commented Sep 3, 2015

I'm going to push back on handling `nan` here, because any `Series` with a `nan` value becomes a float anyway so numpy returns a sensible error that an integer argument is expected. That ok @jreback?