Skip to content

BUG: on .to_string(index=False) #25000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

charlesdong1991
Copy link
Member

@pep8speaks
Copy link

pep8speaks commented Jan 29, 2019

Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-12 19:48:17 UTC

@TomAugspurger TomAugspurger added this to the 0.24.1 milestone Jan 29, 2019
@TomAugspurger
Copy link
Contributor

Thanks. Can you add a release note to 0.24.1 under fixed regressions?

@TomAugspurger TomAugspurger added the Output-Formatting __repr__ of pandas objects, to_string label Jan 29, 2019
@codecov
Copy link

codecov bot commented Jan 29, 2019

Codecov Report

Merging #25000 into master will decrease coverage by 49.49%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25000      +/-   ##
==========================================
- Coverage   92.38%   42.88%   -49.5%     
==========================================
  Files         166      166              
  Lines       52400    52402       +2     
==========================================
- Hits        48409    22473   -25936     
- Misses       3991    29929   +25938
Flag Coverage Δ
#multiple ?
#single 42.88% <66.66%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/formats/format.py 50.76% <66.66%> (-47.23%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6449bc1...9871550. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 29, 2019

Codecov Report

Merging #25000 into master will increase coverage by 0.02%.
The diff coverage is 94.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25000      +/-   ##
==========================================
+ Coverage   91.86%   91.89%   +0.02%     
==========================================
  Files         179      175       -4     
  Lines       50707    52508    +1801     
==========================================
+ Hits        46583    48250    +1667     
- Misses       4124     4258     +134
Flag Coverage Δ
#multiple 90.44% <94.73%> (-0.01%) ⬇️
#single 40.72% <15.78%> (-0.47%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 93.54% <ø> (-0.66%) ⬇️
pandas/io/formats/format.py 97.81% <94.73%> (-0.1%) ⬇️
pandas/io/gbq.py 75% <0%> (-25%) ⬇️
pandas/plotting/_misc.py 38.46% <0%> (-21.04%) ⬇️
pandas/io/gcs.py 80% <0%> (-20%) ⬇️
pandas/io/clipboard/clipboards.py 18.51% <0%> (-16.27%) ⬇️
pandas/io/clipboard/__init__.py 39.21% <0%> (-15.14%) ⬇️
pandas/compat/__init__.py 77.5% <0%> (-14.5%) ⬇️
pandas/io/s3.py 89.47% <0%> (-10.53%) ⬇️
pandas/io/excel/_xlrd.py 93.93% <0%> (-6.07%) ⬇️
... and 121 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 634577e...5a6b867. Read the comment docs.

@charlesdong1991
Copy link
Member Author

added!

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that this affects DataFrame.to_string as well.

0.23.4

In [7]: pd.DataFrame({"A": [1]}).to_string(index=False)
Out[7]: 'A\n1'

0.24.0

In [2]: pd.DataFrame({"A": [1]}).to_string(index=False)
Out[2]: ' A\n 1'

can you add a test and fix for that as well?

@@ -22,6 +22,7 @@ Fixed Regressions

- Bug in :meth:`DataFrame.itertuples` with ``records`` orient raising an ``AttributeError`` when the ``DataFrame`` contained more than 255 columns (:issue:`24939`)
- Bug in :meth:`DataFrame.itertuples` orient converting integer column names to strings prepended with an underscore (:issue:`24940`)
- Bug in :meth:`to_string` with ``index`` set to ``False``, leading space is added. (:issue:`24980`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Bug in :meth:`to_string` with ``index`` set to ``False``, leading space is added. (:issue:`24980`)
- Bug in :meth:`Series.to_string` adding a leading space when ``index=False`` (:issue:`24980`)

return format_array(values_to_format, None,
float_format=self.float_format,
na_rep=self.na_rep,
leading_space=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leading_space=self.index then you don't need to duplicate the call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... not sure about this... self.index by default is True, not None, right?

Copy link
Contributor

@TomAugspurger TomAugspurger Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do leading_space=bool(self.index) if you want to not allow passing None.

@charlesdong1991
Copy link
Member Author

hmm... i am doubting if this solution is correct... I just find out that for numeric value, this is still incorrect:
pd.Series(1).to_string(index=False), the output still contain leading space
but if it's a mixed array, then leading space goes away, e.g. pd.Series([1, "a"]).to_string(index=False).
The same behaviour remains in DataFrame, only works if column value is not numeric...
e.g. pd.DataFrame({"A": ["a", "b"], "B": [1, "dd"]}).to_string(index=False) works fine
but pd.DataFrame({"A": ["a", "b"], "B": [1, 2]}).to_string(index=False) has leading space...
any thoughts? @TomAugspurger

@TomAugspurger
Copy link
Contributor

Indeed, the changes here are for GenericArrayFormater. We would maybe need something similar for the other formatters...

I'm re-reading #22505 now...

@TomAugspurger
Copy link
Contributor

@gshiba can you confirm that Series.to_string is supposed to have a leading space, even when index=False?

@TomAugspurger
Copy link
Contributor

This was apparently to fix an alignment bug with a mixture of positive / negative values

# 0.23.4
In [3]: print(pd.Series([0, -1]).to_string(index=False))
0
-1
# 0.24.0
In [2]: print(pd.Series([0, -1]).to_string(index=False))
 0
-1

I suppose that's fine, but the change to non-negative series surprises

In [3]: print(pd.Series([0, 1]).to_string(index=False))
 0
 1

Can we apply the leading space just when necessary (any values less than 0?)

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Jan 29, 2019

@TomAugspurger btw, i tried to fix it while keeping the original as much as possible, probably it's not very efficient, but looks working... look forward to your feedback and suggestion!
FYI, i haven't updated whatsnew since i am not sure if my solution is correct way to go, will update later once this is settled.
now, the output looks like:

# 0.24.0
In [1]: print(pd.Series([0, -1]).to_string(index=False))
 0
-1

In [2]: print(pd.Series([0, 1]).to_string(index=False))
0
1

@gshiba
Copy link
Contributor

gshiba commented Jan 29, 2019

It seems there has been many PRs and Issues for this that it's hard to summarize succinctly, but here goes my understanding.

(My investigation only concerned numerical columns -- integers specifically; I did not look at string columns)

Old history

The leading space originates from the % d in IntArrayFormatter.get_result (106fe99; this was in year 2012. it is now in IntArrayFormatter._format_strings).

So as far as I can tell, the following behavior is intended:

  • there is a leading space for positive values in Int columns
  • there is no leading space (but a negative sign instead) for negative values in Int columns

And thus, when

  • an Int column is the first column in a DataFrame (or we have a Integer Series) and
  • it only contains zero or positive values and
  • index=False,

it looks like there's a redundant white space.

More recent history

In an attempt to fix the redundant white space, PR #11942 was merged, which in turn broke alignment with column headers, depending on the max width of values vs width of column name, etc (as described in Issue #13032)

I had two PRs going:

PR #22505 was merged in 30b942a (and released in 0.24.0), and alignment looks good now, but the leading white space is still there.

Going forward

I understand leading white spaces are unexpected when only positive values are in a Series.

If this is going to be fixed, care should be taken not to break alignment among the values, and also with the column header which could be wider/shorter than the values for DataFrames.

(I hope this doesn't add to the confusion!)

@charlesdong1991
Copy link
Member Author

very detailed and thorough explanation, awesome!!! @gshiba

if self.index:
leading_space = None
else:
leading_space = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really the idiom? These are both False values, so this is extra confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this just be leading_space = self.index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somehow it looks like there is a slight difference between False and None, and if default for leading_space is None which means it's implicit, then a leading_space is added.

if leading_space is False:
     # False specifically, so that the default is
     # to include a space if we get here.
     tpl = u'{v}'
else:
     tpl = u' {v}'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure who added this. It is just ripe for errors. Are there really 3 cases or just 2? If there are 3, then the api for leading_space should be a string that is checked and much more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have 2 cases... leading_space=None basically is equivalent to True IIUC

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leading_space=None preserves the 0.23.4 and earlier behavior "add a leading space when there are any floating point and no missing values".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then let's make this more explicit and use a str argname then like leading_space='compat' or somesuch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesdong1991 can you fix this up

@charlesdong1991 charlesdong1991 changed the title Bug: on .to_string(index=False) BUG: on .to_string(index=False) Jan 31, 2019
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at the PR, the relevant change seems to be right at https://github.com/pandas-dev/pandas/pull/22505/files#diff-425b4da47d01dc33d86c5c697e196b70L291.

So, how about ... something like

result = self.adj.adjoin(3, fmt_values)

if all(x[0].startswith(' ') for x in fmt_values):
    result = result..replace('\n ', '').strip()

There's an edge-case there on a series where all the strings start with a ' '...

'\\begin{tabular}{lll}\n\\toprule\n name & mask & weapon
\\\\\n\\midrule\n Raphael & red & sai \\\\\n Donatello &
purple & bo staff \\\\\n\\bottomrule\n\\end{tabular}\n'
'\\begin{tabular}{lll}\n\\toprule\n name & mask & weapon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test changes are concerning. Is this reverting an API change between 0.23.4 and 0.24.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is the leading space that is wrongly added due to this bug

if self.index:
leading_space = None
else:
leading_space = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leading_space=None preserves the 0.23.4 and earlier behavior "add a leading space when there are any floating point and no missing values".

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 31, 2019

Though that edge case is there in 0.23.4...

>>> print(pd.Series([' a', ' b']).to_string(index=False))
a
 b

We could exclude that erroneous strip in the case of all strings starting with a ' ' with self.series.str.startswith(' ').all(), with suitable type-checking before doing the self.series.str so that doesn't raise.

@gshiba
Copy link
Contributor

gshiba commented Jan 31, 2019

Applying strip may have unintended consequences.

Current behavior

In [1]: import pandas as pd

In [2]: print_ = lambda s: print('\n'.join([f'^{l}$' for l in s.split('\n')]))

In [3]: print_(pd.Series(['a', ' ']).to_string(index=False))
^ a$
^  $

In [4]: pd.__version__
Out[4]: '0.24.0'

If strip is used, I think we'd end up with something like:

In [X]: print_(pd.Series(['a', ' ']).to_string(index=False))
^a$

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 1, 2019 via email

@TomAugspurger
Copy link
Contributor

I don't see a straightforward solution to this, that isn't going to regress in other ways.

It seems to me like we need to push down some of the column formatting to the value formatting. When deciding the values in the array

[10, -10]

we need to know that the array contains negative values, to decide how the first value, 10, should be formatted.

Going to push to 0.24.2.

@charlesdong1991 feel free to keep working on this if you want, and let us know if you get stuck (though I won't be able to take a look for a week or two).

@TomAugspurger TomAugspurger modified the milestones: 0.24.1, 0.24.2 Feb 1, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking like too much is changing for a backport, but hard to tell

@@ -22,7 +22,7 @@ Fixed Regressions

- Bug in :meth:`DataFrame.itertuples` with ``records`` orient raising an ``AttributeError`` when the ``DataFrame`` contained more than 255 columns (:issue:`24939`)
- Bug in :meth:`DataFrame.itertuples` orient converting integer column names to strings prepended with an underscore (:issue:`24940`)
- Fixed regression in :class:`Index.intersection` incorrectly sorting the values by default (:issue:`24959`).
- Bug in :meth:`to_string` with ``index`` set to ``False``, leading space is added. (:issue:`24980`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.24.2

if self.index:
leading_space = None
else:
leading_space = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then let's make this more explicit and use a str argname then like leading_space='compat' or somesuch

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Feb 11, 2019

@TomAugspurger I just added more edge cases in pytest as you mentioned above with my current fix, result looks as expected... e.g.

>>> print(pd.Series([' a', ' b']).to_string(index=False))
 a
 b
>>> print(pd.Series(['10', '-10']).to_string(index=False))
 10
-10
>>> print(pd.Series(['1', '.1']).to_string(index=False))
 1
.1

have to say this fix is not very elegant... but if you agree on this fix, I could make leading_space more explicit as @jreback suggested.

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Feb 15, 2019

Hi, sorry to bother you @TomAugspurger @jreback master seemed broken last a couple days, so the test didn't pass... today when i am trying to merge the master to get latest changes, and resubmit the PR, i still get an error which seems unrelated to my PR, could you give any hint where I should change in my PR to fix this issue? thank you very much!

Traceback (most recent call last):
  File "C:\Miniconda\lib\site-packages\pkg_resources\__init__.py", line 357, in get_provider
    module = sys.modules[moduleOrReq]
KeyError: 'numpy'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "setup.py", line 738, in <module>
    ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
  File "setup.py", line 481, in maybe_cythonize
    numpy_incl = pkg_resources.resource_filename('numpy', 'core/include')
  File "C:\Miniconda\lib\site-packages\pkg_resources\__init__.py", line 1142, in resource_filename
    return get_provider(package_or_requirement).get_resource_filename(
  File "C:\Miniconda\lib\site-packages\pkg_resources\__init__.py", line 359, in get_provider
    __import__(moduleOrReq)
ModuleNotFoundError: No module named 'numpy'
Obtaining file:///D:/a/1/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "C:\Miniconda\lib\site-packages\pkg_resources\__init__.py", line 357, in get_provider
        module = sys.modules[moduleOrReq]
    KeyError: 'numpy'

if self.index:
leading_space = None
else:
leading_space = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesdong1991 can you fix this up

@jreback jreback modified the milestones: 0.24.2, 0.25.0 Feb 16, 2019
@@ -56,7 +56,7 @@ Bug Fixes
- Bug in reading a HDF5 table-format ``DataFrame`` created in Python 2, in Python 3 (:issue:`24925`)
- Bug in reading a JSON with ``orient='table'`` generated by :meth:`DataFrame.to_json` with ``index=False`` (:issue:`25170`)
- Bug where float indexes could have misaligned values when printing (:issue:`25061`)
-
- Bug in :meth:`Series.to_string` adding a leading space when ``index=False`` (:issue:`24980`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.25.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@charlesdong1991
Copy link
Member Author

I changed the default value for leading_space to 'compat'as you suggested. Not sure if it's what you mean... feel free to leave your comments and advice. thanks @jreback

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

if you can merge master i can have a look

@charlesdong1991
Copy link
Member Author

@jreback sorry for my late reply. I have been with other things recently. I just fixed conflicts and merge the latest master. pls take a look if you have time

@charlesdong1991
Copy link
Member Author

Hi, @jreback i fixed the conflicts and merged to master, pls take a look if you get some time, thanks very much!

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

can you merge master and will look again.

@charlesdong1991
Copy link
Member Author

Hi, @jreback i merged to master and found this type validation failure, do you have idea why/how it is caused?

@WillAyd
Copy link
Member

WillAyd commented Jun 12, 2019

Hmm most likely caused by the merging of #26802 - thought it was green when I merged but apparently not. Let me take a look

@WillAyd
Copy link
Member

WillAyd commented Jun 27, 2019

@charlesdong1991 merging master again should resolve prior issue

@WillAyd
Copy link
Member

WillAyd commented Jul 15, 2019

@charlesdong1991 I think this is stale so closing for now but ping if you want to pick back up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug on .to_string(index=False)
7 participants