Skip to content

BUG: DataFrame.to_string() - customer formatter ***not*** called for all values in column #45177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
DanielGoldfarb opened this issue Jan 3, 2022 · 5 comments
Open
3 tasks done
Labels
Bug Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Output-Formatting __repr__ of pandas objects, to_string

Comments

@DanielGoldfarb
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import datetime

print('==============================')
print('pd.__version__=',pd.__version__)

mylist = ['abcd',123,None,456.78,float('nan'),'wxyz',1234567,' ',datetime.datetime.now(),-321,None,3.1415926536]

df = pd.DataFrame(dict(MyColumn=mylist))

print('==============================')
print(df)
print('==============================')

def myformatter(value):
    print('type(value)=',type(value),' value=',value)
    #return f'{value:<10}'
    return '%-10.10s' % value

f = {'MyColumn': myformatter}

s = df.to_string(formatters=f,float_format=myformatter,na_rep='na_rep',
                 index=False,justify='left')

print('==============================')
print(s)
print('==============================')

Issue Description

Per Pandas documentation for DataFrame.to_string, the formatters parameter is a

list, tuple, or dict of one-parameter functions ... Formatter functions to apply to columns’ elements by position or name.

note: "apply to columns’ elements" (it does not say "apply to only some elements")

The "Reproducible example" code demonstrates that the formatter is not called for all elements in the column. See output below.

Specifically it appears that customer formatters are not called for the following types: float, NoneType, and NaN. This is not a big deal for float because the parameter float_format allows one to install a formatter function for floats as well. However this prevents the user from custom formatting NoneType and NaN with their formatting function.

In the reproducible example given, the user has chosen to format the column as left justified, however because None and NaN values are not passed to the customer formatter, the user is unable to accomplish this task.

Here is the output from the above reproducible example:

==============================
pd.__version__= 1.3.5
==============================
                      MyColumn
0                         abcd
1                          123
2                         None
3                       456.78
4                          NaN
5                         wxyz
6                      1234567
7
8   2022-01-03 14:37:24.075486
9                         -321
10                        None
11                    3.141593
==============================
type(value)= <class 'str'>  value= abcd
type(value)= <class 'int'>  value= 123
type(value)= <class 'float'>  value= 456.78
type(value)= <class 'str'>  value= wxyz
type(value)= <class 'int'>  value= 1234567
type(value)= <class 'str'>  value=
type(value)= <class 'datetime.datetime'>  value= 2022-01-03 14:37:24.075486
type(value)= <class 'int'>  value= -321
type(value)= <class 'float'>  value= 3.1415926536
==============================
MyColumn
abcd
123
      None
456.78
    na_rep
wxyz
1234567

2022-01-03
-321
      None
3.14159265
==============================

Expected Behavior

Custom formatter functions should be called for all elements in the specified column.

Since installation of custom formatters is already split between two paramters (formatters and float_format) it may be reasonable to pass NoneType to the formatters and NaN to the float_format formatter. Alternatively there should be a way to pass every element to one single formatter function for the column, regardless of the element's type.

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.60.1-microsoft-standard-WSL2
Version : #1 SMP Wed Aug 25 23:20:18 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 21.3.1
setuptools : 45.2.0
Cython : None
pytest : 6.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 8.0.0.dev
pandas_datareader: 0.10.0
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

@DanielGoldfarb DanielGoldfarb added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 3, 2022
@DanielGoldfarb
Copy link
Author

DanielGoldfarb commented Jan 3, 2022

Please note: I am happy to make this change myself if someone can point me to the appropriate file/method to change. (Went looking through the code myself but it was not entirely clear to me where the change should go).

I understand there may be issues of compatibility with existing user code. If this is a concern, there are alternative ways to solve this problem. For example, parameter na_rep, which presently accepts only strings, could also accept a formatting function to which None and NaN values are passed. Or, alternatively, a new boolean parameter (that defaults to False) could be introduced that, if True, would result in passing all elements (including float, None, and NaN) to the custom formatting function(s) specified in formatters.

@phofl
Copy link
Member

phofl commented Jan 3, 2022

This looks like a deliberate decision. Values are formatted like this:

  • if is float -> use float formatter
  • if is na -> use na rep
  • else: use format function

This happens around

def _format(x):

@DanielGoldfarb
Copy link
Author

@phofl
Patrick, Thanks for pointing out the code. The behavior does have a flavor of something that was done intentionally. I'm just asking myself: if a user deliberately supplies a formatter, do we want the behavior to be such that we let the user format some things but not others?

At any rate, now that I know where to look I will play with the code and run a few tests and see what changes I may think of. Thanks for the quick response.

All the best. --Daniel

@phofl
Copy link
Member

phofl commented Jan 4, 2022

Can kind of see both sides here. Would probably need a deprecation cycle if we would change that so that users can adjust their code

@attack68
Copy link
Contributor

attack68 commented Jan 4, 2022

For what its worth the formatting functions available in Styler behave a bit differently, and there is an outstanding PR for Styler.to_string which might help your case.

Their logic is as follows:

  1. A _default_formatter is constructed at module which:

    • if input is float, complex return string rendered with precision and thousands args.
    • if input is int return with zero precision with the thousands arg.
    • otherwise return the input
  2. Is a formatter supplied:

    • Yes, in string format; convert to callable;
    • Yes, as callable;
    • No; apply the _default_formatter;
  3. Is the escape arg used:

    • Yes; escape the input (if string) before passing to formatter in 1)
    • No; maintain the same formatter as in 1).
  4. Is the decimal and thousands str repr args used:

    • Yes; wrap the formatter from 2) so that correct replacements are made.
    • No; maintain the formatter from 2).
  5. Is the hyperlinks arg used:

    • Yes; wrap the formatter with a URL detection and replacement scheme before the passing to formatter in 3).
    • No; maintain the formatter from 3)
  6. Is the na_rep arg used:

    • Yes: if the value is na ignore the formatter and print na_rep
    • No: Use the formatter passed from 4) on the value.

This gives a very flexible function that can be used to control many aspects of the display.

In your particular example, if you wanted to deal with formatting nan you could use the na_rep or not use it and code the handling of the nan into your custom formatter fucntion.

@jbrockmendel jbrockmendel added Output-Formatting __repr__ of pandas objects, to_string and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 10, 2022
@mroeschke mroeschke added Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

No branches or pull requests

5 participants