-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: line terminator
and '\n to \r\n' problem in Windows(Issue #20353)
#21406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -147,7 +147,7 @@ MultiIndex | |||
I/O | |||
^^^ | |||
|
|||
- | |||
- (Only for Windows) Bug in :meth:`DataFrame.to_csv`, in which all `\n`s are converted to `\r\n` (:issue:`20353`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the Windows portion as follows:
Bug in :meth:`DataFrame.to_csv`, in which all `\n`s are converted to `\r\n` on Windows (:issue:`20353`)
b'int,str_crlf\n' \ | ||
b'1,abc\n' \ | ||
b'2,"d\r\nef"\n' \ | ||
b'3,"g\r\nh\r\n\r\ni"\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use parentheses instead of slashes for multi-line strings.
pandas/io/common.py
Outdated
@@ -394,13 +395,13 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None, | |||
elif is_path: | |||
if compat.PY2: | |||
# Python 2 | |||
f = open(path_or_buf, mode) | |||
f = io.open(path_or_buf, mode, newline='\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to use io.open
for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I referenced this stackoverflow article, but it was not appropreate as all CI tests failed in py2 environment.
I will fix it.
Hello @deflatSOCO! Thanks for updating the PR.
Comment last updated on October 19, 2018 at 05:43 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #21406 +/- ##
==========================================
- Coverage 92.19% 92.19% -0.01%
==========================================
Files 169 169
Lines 50950 50956 +6
==========================================
+ Hits 46975 46980 +5
- Misses 3975 3976 +1
Continue to review full report at Codecov.
|
pandas/io/common.py
Outdated
f = open(path_or_buf, mode) | ||
elif encoding: | ||
# Python 3 and encoding | ||
f = open(path_or_buf, mode, encoding=encoding) | ||
f = open(path_or_buf, mode, encoding=encoding, newline='\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deflatSOCO Is there a specific reason for not making newline a function parameter? I would suggest that you make newline a _get_handle
parameter and default it to None then let open
handle it from there.
Then update formats.csvs.CSVFormatter.save
to pass self.line_terminator to the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bibenb1 Thanks for feedback.
Python document says,
When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.
formats.csvs.CSVFormatter.save
applies self.line_terminator to csv.writer
(in line157) before calling get_handle
(in line163), so it should not be passed in open()
again.
From this thing, I thought newline value should ALWAYS be '' or '\n', to avoid different behavior between Windows and others, so I did not make it a function parameter.
Willing to be convinced otherwise but I don't think the default should change to |
Right, and I'm good with fixing the original bug ( |
@chris-b1 : Ah, I see what you mean. That is indeed correct based on the tests. @deflatSOCO : Could you change it so that the default remains |
@chris-b1 @gfyoung Please let me convince. If so, the expected output undet the Windows environment should be like this. Is that right?
|
@chris-b1 : Thoughts? IMO @deflatSOCO has a point here. |
I know the current docs say contrary, but my vote is that we continue defaulting to the platform default line terminator ( It preserves backwards compat - so if someone was relying on the In [1]: import csv
In [3]: with open('tmp.csv', 'wt', newline='') as f:
...: writer = csv.writer(f)
...: writer.writerow(['a', 'b'])
...: writer.writerow([1, 2])
...: writer.writerow([3, 4])
In [4]: open('tmp.csv', 'rb').read()
Out[4]: b'a,b\r\n1,2\r\n3,4\r\n' (again fully in support of merging this with bugfix for case with explicit ending) |
can u rebase |
Fair enough. @deflatSOCO : Can you update? (If you can't, I can pick this up) |
Marking for |
* re-defined testcases that suits conversations in PR pandas-dev#21406 * changed default value of line_terminator to os.linesep * changed API document of DataFrame.to_csv * changed "newline" value of "open()" from '\n' to '' * Updated whatsnew document related pages: * Issue pandas-dev#20353 * PR pandas-dev#21406
* Updates: * Updated expected values for some tests about 'to_csv()' method, to deal with new default value of 'line_terminator' arg. * Related Issue: * Issue pandas-dev#20353 * PR pandas-dev#21406
pandas/tests/frame/test_to_csv.py
Outdated
expected = ('"A","B"' + os.linesep + | ||
'1,"foo"' + os.linesep + | ||
'2,"bar"' + os.linesep + | ||
'3,"baz"' + os.linesep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can make a testing utility (i.e. function) for this os.linesep
"magic"
with open(path, mode='rb') as f: | ||
assert f.read() == expected | ||
|
||
df.to_csv(path, line_terminator='\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do separate ensure_cleans() for each case & put a comment for each case
b'2,"d\nef"' + os_linesep + | ||
b'3,"g\nh\n\ni"' + os_linesep | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make separate ensure_cleans here
df.to_csv(sys.stdout, encoding='ascii') | ||
output = sys.stdout.getvalue() | ||
assert output == expected_ascii | ||
assert not sys.stdout.closed | ||
|
||
@pytest.mark.xfail( | ||
os.name == 'nt', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use compat.is_windows_platform()
pandas/util/testing.py
Outdated
|
||
|
||
def convert_rows_list_to_csv_str(rows_list): | ||
csv_str = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a mini-doc-string
pandas/util/testing.py
Outdated
csv_str = '' | ||
for row in rows_list: | ||
csv_str += row + os.linesep | ||
return csv_str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use
return os.linesep.join(rows_list)
?
* Requested in PR pandas-dev#21406
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -89,6 +89,8 @@ Other Enhancements | |||
|
|||
Backwards incompatible API changes | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
- When file path is given by string format in :func:`DataFrame.to_csv`, it disables universal newline support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear form this whether disabling universal newline support is the previous or new behavior. I think I would phrase it as
:func:`DataFrame.to_csv` now uses :func:`os.linesep` rather than ``'\n'`` for the default line terminator.
pandas/io/common.py
Outdated
@@ -409,13 +409,15 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None, | |||
elif is_path: | |||
if compat.PY2: | |||
# Python 2 | |||
if mode == 'w': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late reply.
Document of csv.writer()
in python2 says:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
Document of open()
in python2 says:
Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written.
From these two, I thought binary mode should always be activated in python2 when writing.
@deflatSOCO : Could you address the merge conflicts here? @TomAugspurger @jreback : Can you have a look again at this? |
@TomAugspurger @jreback : Can you have a look again at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok to me. i would just request an exapanded whatsnew note. maybe a small sub-section on when a user should care and what they need to do about it. e.g. i write csv's on windows, do I need to do anything?what if I explicity pass line_terminator
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -189,6 +189,7 @@ Other Enhancements | |||
|
|||
Backwards incompatible API changes | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
- :func:`DataFrame.to_csv` now uses :func:`os.linesep` rather than ``'\n'`` for the default line terminator.(:issue:`20353`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elabortate on what if anything a user needs to do (e.g. when should they care about this)
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -731,6 +732,7 @@ MultiIndex | |||
I/O | |||
^^^ | |||
|
|||
- Bug in :meth:`DataFrame.to_csv`, in which all `\n`s are converted to `\r\n` on Windows (:issue:`20353`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a repeat?
PR pandas-dev#21406 * Expanded whatsnew documents about the change of to_csv * Resolved duplication
doc/source/whatsnew/v0.24.0.txt
Outdated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
:func:`DataFrame.to_csv` now uses :func:`os.linesep` rather than ``'\n'`` | ||
for the default line terminator(:issue:`20353`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after terminator
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
.. code-block:: ipython | ||
|
||
In [1]: import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can skip the pd import
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
In [5]: with open("test.csv", mode='rb') as f: | ||
...: print(f.read()) | ||
b'string_with_lf,string_with_crlf\r\nabc,abc\r\n"d\r\nef","d\r\r\nef"\r\n"g\r\nh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can make this simpler, just show [3] , [4]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I make data
smaller, and keep [5] instead of [4]?
With [4], I can't show the terminator of each line in CSV, which is the main point of this change. That's why I want to keep [5].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the point is we want a small example that shows the most common case where a user would want to know what is changing. Too long and you lose the reader, too short and you don't know if the change applies to youl.
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
In [1]: import pandas as pd | ||
|
||
In [2]: data = pd.DataFrame({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
doc/source/whatsnew/v0.24.0.txt
Outdated
...: data.to_csv(f,index=False,line_terminator='\n') | ||
|
||
In [7]: pd.read_csv("test2.csv") | ||
Out[7]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for writing all this, but just show a single, most important change
PR pandas-dev#21406 - Applied jreback's feedback in PR pandas-dev#21406
@deflatSOCO can you update the whatsnew as discussed above, and rebase on master. |
doc/source/whatsnew/v0.24.0.txt
Outdated
b'string_with_lf,string_with_crlf\n"a\nbc","a\r\nbc"\n' | ||
|
||
|
||
- On windows, the value of ``os.linesep`` is ``'\r\n'``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
windows --> Windows
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
|
||
- As default value of ``line_terminator`` changes, just passing file object with ``newline='\n'`` does not set ``'\n'`` to line terminator. | ||
Pass ``line_terminator='\n'`` explicitly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can reword this a little. Say something like:
For files objects, specifying newline
is not sufficient to set the line terminator. You must pass in the line terminator explicitly, even in this case.
just a couple of minor comments / doc-string things. @gfyoung ping when happy. |
@gfyoung you ok with this, can you rebase this? or @deflatSOCO |
@jreback : Can do, given that no reply how has come since the latest comment in > 1 week. |
Alas, there were too many merge conflicts with |
* Use OS line terminator if none is provided * Enforce line terminator selection if one is Originally authored by @deflatSOCO, but reapplied by @gfyoung due to enormous merge conflicts. Closes pandas-devgh-20353.
@jreback : Addressed all comments, rebased, and all is green! PTAL. |
thanks @deflatSOCO and @gfyoung |
* Use OS line terminator if none is provided * Enforce line terminator selection if one is Originally authored by @deflatSOCO, but reapplied by @gfyoung due to enormous merge conflicts. Closes pandas-devgh-20353.
git diff upstream/master -u -- "*.py" | flake8 --diff