-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG:DataFrame.to_csv not using correct line terminator on Windows within open block #38551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does it work when creating the file handle with |
[Edited so the answer is more clear higher-up in the answers]
I literally read that section of the documentation page a bunch of times because I was mis-spelling path_or_bufstr... but I missed that comment somehow. Generally, shouldn't it work without having to do that? I read through many of the previous comments, but believe I missed comments about newline=''. |
I think I'd argue we should close given that the documentation shows a way to address this. It feels like it should "work" normally correctly without the newline='' on the open file handler, but if that's complicated or impossible, we should close maybe? |
one option would be to make a PR that tests whether the handle has the |
I like that idea. I can commit to trying to figure out how to do that, but what happens next? A PR is a pull request, right? Meaning someone has already written the code? But how do we turn this reported bug into a request for a feature /change? (I'm sorry I'm not more familiar with how this all works). |
No one has written code for that yet - you are very welcome to fork pandas, make the necessary changes, and then create a pull request. I assume this also affects I assume that a good place for this feature would be at the top of
|
I realize I miscommunicated-- I know no one has written this code. I thought a PR (pull request, right?) meant that someone has already written the code. Which I know is not the case. My question is what do we do with this open bug while I get around to writing some warning code? If anything-- do we just leave it open? Thanks, btw, for your notes on where to make code changes-- I'm (perhaps obviously) not familiar with the inner workings of pandas, and that gives me a good start. I'll give it a try. Two comments about your comments:
|
FWIW,
and the linked footnote: https://docs.python.org/3/library/csv.html#id3 This is probably why pandas also has this behavior. |
I would just leave it open so that you can then reference this issue in your pull request. I think it is worth to add a warning, as multiple people have had this issue before you.
Yes, testing if 'r' is not in mode is probably better.
You are probably right that |
Finally got a working pandas development environment-- once I figured out the Dockerfile stuff, it's pretty cool with VS Code! Question for @twoertwein: what were you thinking getattr(path_or_buf, '') should do? I get an error: object has no attribute '' I'm sure I'm missing something? |
oh, I'm sorry, I meant: |
Got it. I figured out last night that that's probably what you meant. I need to determine first if path_or_buf / filepath_or_buffer is a string (in which case it's not a previously opened file handle) and if it's not a string, use getattr(path_or_buf,"newline") to check if they're already passed '' or None as the newline value, in which case the warning isn't required. Also, I was thinking of adding the warning further up the call stack-- perhaps just in core/generic.py in to_csv. I'll get it working in the next few days now that I can debug / walk through the code. |
Update. I've run into a bit of a problem, and would ask for help if anyone has thoughts. I'll keep digging. It doesn't seem, per this Stackexchange conversation, that the newline parameter when a file is opened is actually available as an attribute anywhere. I thought it was-- I thought the file objects newlines attribute was this, but it doesn't seem to be. PEP-0278 says "A file object that has been opened in universal newline mode gets a new attribute "newlines" which reflects the newline convention used in the file. The value for this attribute is one of None (no newline read yet), "\r", "\n", "\r\n" or a tuple containing all the newline types seen." This only seems to apply to files opened for read though. Files written always seems to have a value of None (in the code below, and every other combination I've tried, it's None)
LMK if anyone has thoughts? |
sorry, I didn't know that. I assume the only option would be to warn the user when using |
A bit more information. As @asishm said above, csv.writer recommends the same thing (newline=''), which is maybe not surprising since pandas uses csv.writer under the covers (see line 238 in io/formats/csvs.py). Agree, @twoertwein we could warn the user whenever a person on Windows calls to_csv() with a file object not opened in binary mode, but that seems like it sort of punishes everyone using the to_csv() method with a warning, literally every time they use it, for those not reading the documentation and fixing it once. I suggest that I edit my post above that starts with "Yes it does. That resolves it." to provide a bit more clarification on how to resolve (although the post above and below are already good/clear), and that maybe we close or table this issue. If anything, my conclusion now is that it should be corrected in csv.writer, not in pandas, but it'd be good for people searching for this issue to be able to quickly find the solution higher-up in the post. Thoughts? |
@cheitzig Feel free to close it. One option might be to have a warning in |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
Pandas currently seems inconsitent in terms of how it writes data to csv files using the to_csv method. I've looked at previous issues #20353 and #25048. Both are closed, but this issue seems to happen even right now in production Python 3
Above is a snippet of code, similar to #20353, that reproduces the issue. The result is that file1.csv has line endings that are \r\r\n and file2.csv has line endings that are as expected-- \r\n
Expected Output
In both cases, I'd expect:
,Name,Grade\r\n
0,Charlie,A\r\n
1,Rich,B\r\n
2,Katie,A\r\n
3,Tommy,B\r\n
Instead, in file1.csv, we see each row with a line ending of \r\r\n
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : b5958ee
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 23 Stepping 7, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.1.5
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.7.5
pip : 20.3.3
setuptools : 50.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: