Skip to content

ENH: Added DataFrame.round and associated tests #10568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 3, 2015

Conversation

roblevy
Copy link
Contributor

@roblevy roblevy commented Jul 14, 2015

I've found myself doing a lot of DataFrame.to_latex because I'm using pandas to write an academic paper.

I'm constantly messing about with the number of decimal places displayed by doing np.round(df, 2) so thought this flexible round, with different numbers of decimals per column, should be part of the DataFrame API (I'm surprised there isn't already such a piece of functionality.)

Here is an example:

In [9]: df = pd.DataFrame(np.random.random([10, 3]), columns=['a', 'b', 'c'])

In [10]: df
Out[10]: 
          a         b         c
0  0.761651  0.430963  0.440312
1  0.094071  0.242381  0.149731
2  0.620050  0.462600  0.194143
3  0.614627  0.692106  0.176523
4  0.215396  0.888180  0.380283
5  0.492990  0.200268  0.067020
6  0.804531  0.816366  0.065751
7  0.751224  0.037474  0.884083
8  0.994758  0.450143  0.808945
9  0.373180  0.537589  0.809112

In [11]: df.round(dict(b=2, c=4))
Out[11]: 
          a     b       c
0  0.761651  0.43  0.4403
1  0.094071  0.24  0.1497
2  0.620050  0.46  0.1941
3  0.614627  0.69  0.1765
4  0.215396  0.89  0.3803
5  0.492990  0.20  0.0670
6  0.804531  0.82  0.0658
7  0.751224  0.04  0.8841
8  0.994758  0.45  0.8089
9  0.373180  0.54  0.8091

You can also round by column number:

In [12]: df.round([1, 2, 3])
Out[12]: 
     a     b      c
0  0.8  0.43  0.440
1  0.1  0.24  0.150
2  0.6  0.46  0.194
3  0.6  0.69  0.177
4  0.2  0.89  0.380
5  0.5  0.20  0.067
6  0.8  0.82  0.066
7  0.8  0.04  0.884
8  1.0  0.45  0.809
9  0.4  0.54  0.809

and any columns which are not explicitly rounded are unaffected:

In [13]: df.round([1])
Out[13]: 
     a         b         c
0  0.8  0.430963  0.440312
1  0.1  0.242381  0.149731
2  0.6  0.462600  0.194143
3  0.6  0.692106  0.176523
4  0.2  0.888180  0.380283
5  0.5  0.200268  0.067020
6  0.8  0.816366  0.065751
7  0.8  0.037474  0.884083
8  1.0  0.450143  0.808945
9  0.4  0.537589  0.809112

Non-integer values raise a TypeError, as might be expected:

In [15]: df.round({'a':1.2})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-6f51d3fd917d> in <module>()
----> 1 df.round({'a':1.2})

/home/rob/Dropbox/PhD/pandas/pandas/core/frame.py in round(self, places)
   1467 
   1468         if isinstance(places, dict):
-> 1469             new_cols = [col for col in _dict_round(self, places)]
   1470         else:
   1471             new_cols = [col for col in _list_round(self, places)]

/home/rob/Dropbox/PhD/pandas/pandas/core/frame.py in _dict_round(df, places)
   1455             for col in df:
   1456                 try:
-> 1457                     yield np.round(df[col], places[col])
   1458                 except KeyError:
   1459                     yield df[col]

/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.pyc in round_(a, decimals, out)
   2646     except AttributeError:
   2647         return _wrapit(a, 'round', decimals, out)
-> 2648     return round(decimals, out)
   2649 
   2650 

/home/rob/Dropbox/PhD/pandas/pandas/core/series.pyc in round(self, decimals, out)
   1209 
   1210         """
-> 1211         result = _values_from_object(self).round(decimals, out=out)
   1212         if out is None:
   1213             result = self._constructor(result,

TypeError: integer argument expected, got float

@jreback
Copy link
Contributor

jreback commented Jul 14, 2015

is there a reason you are not simply using the options display.float_format or display.precision?

@jreback
Copy link
Contributor

jreback commented Jul 14, 2015

In [1]: np.random.seed(1234)

In [2]: df = pd.DataFrame(np.random.random([10, 3]), columns=['a', 'b', 'c'])

In [4]: df
Out[4]: 
          a         b         c
0  0.191519  0.622109  0.437728
1  0.785359  0.779976  0.272593
2  0.276464  0.801872  0.958139
3  0.875933  0.357817  0.500995
4  0.683463  0.712702  0.370251
5  0.561196  0.503083  0.013768
6  0.772827  0.882641  0.364886
7  0.615396  0.075381  0.368824
8  0.933140  0.651378  0.397203
9  0.788730  0.316836  0.568099

In [7]: pd.set_option('float_format',lambda x: "%.3f" % x)

In [8]: df
Out[8]: 
      a     b     c
0 0.192 0.622 0.438
1 0.785 0.780 0.273
2 0.276 0.802 0.958
3 0.876 0.358 0.501
4 0.683 0.713 0.370
5 0.561 0.503 0.014
6 0.773 0.883 0.365
7 0.615 0.075 0.369
8 0.933 0.651 0.397
9 0.789 0.317 0.568

In [9]: pd.set_option('precision',3)

In [10]: df
Out[10]: 
      a     b     c
0 0.192 0.622 0.438
1 0.785 0.780 0.273
2 0.276 0.802 0.958
3 0.876 0.358 0.501
4 0.683 0.713 0.370
5 0.561 0.503 0.014
6 0.773 0.883 0.365
7 0.615 0.075 0.369
8 0.933 0.651 0.397
9 0.789 0.317 0.568

@shoyer
Copy link
Member

shoyer commented Jul 14, 2015

I think this is a great idea and would be a useful API addition.

A few comments on the design:

  1. We should support supplying a single scalar value, e.g. df.round(2).
  2. This should also be a Series method (for scalar values). Thus the logic should perhaps live in pandas/core/generic.py.
  3. df.round([1]) should be a ValueError -- if you are referencing columns by position, you should need to supply a list with the same length as the full dataframe. It's not obvious what the single element list refers to (e.g., the first or last column?), so we should resist the temptation to guess.

@jreback float_format and precision don't let you customize the precision by column. Also, this could be useful for other things that these options don't effect, e.g., rounding prior to exporting the data with to_csv.

@jreback
Copy link
Contributor

jreback commented Jul 14, 2015

I disagree entirely this is very duplicative
why don't we just override each method in numpy?

this is exactly what apply(np.round, axis=1) is for

@shoyer
Copy link
Member

shoyer commented Jul 14, 2015

Numpy arrays do have a round method, and DataFrame currently wraps almost every NumPy methods. So I think it would be entirely appropriate to add round. IMO it's more surprising that it's missing.

@jreback
Copy link
Contributor

jreback commented Jul 14, 2015

np.round ???

@jreback
Copy link
Contributor

jreback commented Jul 14, 2015

ufuncs are there for a reason

@jorisvandenbossche
Copy link
Member

I also think this would be a nice addition.

First, I think it should be clear that rounding and precision display can be different things. There are enough cases where you don't want to just change how the output looks, but where you want 'real' rounding.

There is indeed np.round(df). But when you have heterogeneous dtypes in your dataframe (eg one column with strings), this already does not work.
Plus, the convenience of being able to specify different precisions per column would possibly be nice.

@roblevy
Copy link
Contributor Author

roblevy commented Jul 15, 2015

@jreback display.float_format and display.precision all round each column to the same precision. This functionality allows each column to be rounded to a different precision. This is also why apply(np.round, axis=1) is not the same piece of functionality.

@jreback
Copy link
Contributor

jreback commented Jul 15, 2015

@roblevy I get why you want this functionaility and I suppose a small expansion of the API is ok, but it has become a creeping API :)

My main concern is that methods should not have any real notion of selection (parameterisation is ok, e.g. .drop_duplicates though).

as @shoyer points out this should take a Series/dict only & certainly should be in pandas/core/generic.py

e.g.

df.round(Series([1,2],list('ab')) or df.round(dict(a=1,b=2))

rather than a positional indicator

@jreback jreback added Output-Formatting __repr__ of pandas objects, to_string API Design labels Jul 15, 2015
@jreback jreback added this to the 0.17.0 milestone Jul 15, 2015
@roblevy
Copy link
Contributor Author

roblevy commented Jul 15, 2015

Glad to have this accepted in principle, @jreback . Can I make one last effort to convince you that a list input is a good idea?

If, as @shoyer points out, the number of elements in the list match the number of columns in the DataFrame then there's no ambiguity and I would "expect" this to just work. Then columns which are not to be rounded can have a None element associated with them.

Yay? Or still nay?

@shoyer
Copy link
Member

shoyer commented Jul 15, 2015

I also don't see a strong need for list like input. A dictionary is more flexible and probably more readable in practice, too

@jreback
Copy link
Contributor

jreback commented Jul 15, 2015

yeh list-like just cause confusion. Only should accept dict-like.

@jreback
Copy link
Contributor

jreback commented Aug 16, 2015

@roblevy can you update / fix so passing

@roblevy
Copy link
Contributor Author

roblevy commented Aug 16, 2015

Didn't realise this before, but np.round dispatches to the .round method of whatever you pass it, so I've updated the signature to match what np.round expects.

Thus, the signature is now:

df.round(decimals, out=None)

where out raises a NotImplementedError. We also have to allow decimals to be an integer (or np.integer) because doing np.round(df, 2) simply dispatches to df.round(2).

@roblevy
Copy link
Contributor Author

roblevy commented Aug 17, 2015

Ok. @jreback looks like we're good to go.

@@ -501,6 +501,17 @@ Other API Changes
- Enable serialization of lists and dicts to strings in ExcelWriter (:issue:`8188`)
- Allow passing `kwargs` to the interpolation methods (:issue:`10378`).
- Serialize metadata properties of subclasses of pandas objects (:issue:`10553`).
- Round DataFrame to variable number of decimal places (:issue:`10568`).
=======
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some leftover of rebase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this to enhancements.

@jorisvandenbossche
Copy link
Member

@roblevy I don't think the out kwarg is needed here. We also don't provide this in eg DataFrame.mean to conform to numpy

@shoyer
Copy link
Member

shoyer commented Aug 19, 2015

@jorisvandenbossche actually, we allow (and ignore) **kwargs on DataFrame.mean, specifically for this reason (numpy compat). Otherwise np.mean does not work on a DataFrame.

So I think this is probably the right approach here.

@@ -3137,6 +3137,78 @@ def clip_lower(self, threshold, axis=None):
subset = self.ge(threshold, axis=axis) | isnull(self)
return self.where(subset, threshold, axis=axis)

def round(self, decimals, out=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decimals should default to 0, like NumPy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just add **kwargs for compat with numpy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly numpy needs a positional argument for out with np.round. So I think we should stick with this signature, just removing it from the docstring

@jorisvandenbossche
Copy link
Member

@shoyer ah yes, OK, but then it is just not explicitely needed (not in the docstring/not raise an NotImplementedError)

columns not included in `decimals` will be left as is. Elements
of `decimals` which are not columns of the input will be
ignored.
out: None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the out parameter (we are just ignoring it)

from distutils.version import LooseVersion
df = DataFrame(
{'col1': [1.123, 2.123, 3.123], 'col2': [1.234, 2.234, 3.234]})
# Round with an integer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a blank line between tests

@jreback
Copy link
Contributor

jreback commented Aug 19, 2015

@roblevy don't be scared off :) as you just got a lot of comments.

@@ -809,6 +809,7 @@ Binary operator functions
DataFrame.eq
DataFrame.combine
DataFrame.combine_first
DataFrame.round
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right place (it is not a binary operator). But not sure what the good place is ..

Maybe 'Computations / Descriptive Stats'

@jorisvandenbossche
Copy link
Member

To be consistent, I would the take the same approach as in other numpy-like methods, which is ignoring out instead of raising an error.

except KeyError:
yield df[col]

if isinstance(decimals, (dict, Series)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an int dtype series. I think you have to require >= 0. I suppose you could ignore nans as well. I am not sure what np.round would do with these cases, so pls add some tests for validation. If the errors are obtuse, then may need to catch and report a better message.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2015

pls also add a section to the docs http://pandas-docs.github.io/pandas-docs-travis/options.html#number-formatting (or if you have a better idea where then pls report).

@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

can you rebase / update according to comments

@roblevy
Copy link
Contributor Author

roblevy commented Sep 1, 2015

I'm going to push back on handling nan here, because any Series with a nan value becomes a float anyway so numpy returns a sensible error that an integer argument is expected. That ok @jreback?

@shoyer
Copy link
Member

shoyer commented Sep 1, 2015

You can let numpy handle the error processing, but please add a unit test to verify that an appropriate error is raised.

On Tue, Sep 1, 2015 at 9:21 AM, Rob Levy [email protected] wrote:

I'm going to push back on handling nan here, because any Series with a nan value becomes a float anyway so numpy returns a sensible error that an integer argument is expected. That ok @jreback?

Reply to this email directly or view it on GitHub:
#10568 (comment)

@@ -438,3 +438,5 @@ For instance:
:suppress:

pd.reset_option('^display\.')

To round floats on a case-by-case basis, you can also use ``Series.round()`` and ``DataFrame.round()``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use :meth:DataFrame.round`` for these

@jreback
Copy link
Contributor

jreback commented Sep 2, 2015

@roblevy ok, pls add the tests @shoyer suggests, minor doc comments. pls squash. ping when green.

@roblevy roblevy force-pushed the variable-round branch 3 times, most recently from 9123348 to e81ac39 Compare September 2, 2015 13:00
@roblevy
Copy link
Contributor Author

roblevy commented Sep 3, 2015

@jreback Good to go. I've added the tests as requested, and updated the docs as requested. Tests are passing!

jreback added a commit that referenced this pull request Sep 3, 2015
ENH: Added DataFrame.round and associated tests
@jreback jreback merged commit 9aafd6d into pandas-dev:master Sep 3, 2015
@jreback
Copy link
Contributor

jreback commented Sep 3, 2015

@roblevy thanks. nice change!

@roblevy roblevy deleted the variable-round branch September 3, 2015 13:44
@roblevy
Copy link
Contributor Author

roblevy commented Sep 3, 2015

YAAAY!!! Excellent. I feel very proud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants