-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[Windows bug?] when appending to an already-open file to_csv() treats first data row in dataframe as the header #21577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Not sure I'd consider this a bug - why wouldn't you either close the file you are writing before re-opening in a different mode or alternately just keeping a reference to the same handle across all of the writing operations? |
I'm not sure why I'd want to close&re-open the file before I'm done writing to it - I'd like to write the metadata and the data block in one shot and then close the file and move on to writing the data for the next station. Closing and then re-opening is just two extra lines of code that I don't think I should have to include. Regardless, the bug (from my point of view) is that to_csv() is ignoring the actual headers in the dataframe and instead is treating the first row of data as the header. I'd be VERY surprised if this is the intended operation. But, I'm a newbie, so I could be wrong :) |
I think it is strange to open your file multiple times especially in different modes, which is why I suggested one option to keep writing to the same file handle. If you remove pandas from the equation altogether and just write arbitrary content does your code behave the way you expect? The fact that you are getting different behavior on different platforms makes me believe this has more to do with the OS' underlying IO operations w.r.t Python and not so much with pandas |
Will: Regarding the OS dependency, I may have been wrong about that. On my windows 10 machine, older versions do not have this problem, but the updated set of packages does have this problem. Specifically, when tried it on my Windows 10 machine with pandas v0.22.0 and the problem did not appear. I then updated everything (using conda update --all). pandas was updated to v.0.23.1 and the problem did appear. When I get a chance, I'll update all on my Mint VM and will let you know what I find. Regarding opening and closing, I think it's odd as well. I just want to open the file once, do all of the write operations, and then close the file. However, I have to close and then re-open to get the pandas output to work. When I re-open the file I have to use the append mode to avoid over-writing the header information. By file handle I assume that the file in file = open(filename,'w') is the handle, correct? If so, then I am using the same handle for all write operations (or am trying to anyway). Regarding removing pandas from the equation, I'm not sure how I'd go about that honestly. My data are all stored in dataframes and I would like to keep it in dataframes so I can take advantage of the wide variety of pandas tools. As for arbitrary writes, the write operations to make the first three lines of the output file work as expected, its just when I get to outputting the dataframe that I have problems. I suppose I could concatenate each row of the dataframe into a csv text string and then use a for loop to export each line individually to the file - is this what you had in mind? If so, I suspect that it would work fine because the file.write(string) operations worked just fine. |
It is strange that this changed between pandas versions - indicates we might be something off with the file handle in filename = 'sample_output1.txt'
file = open(filename,'w')
file.write('StationName:station1'+chr(10))
file.write('version:3'+chr(10))
file.write('height:10'+chr(10)+chr(10))
file.flush()
df.to_csv(file,mode='a',header=True,index=True,index_label='index')
file = open(filename,'r')
print(file.read())
file.close()
# `stdout`
# StationName:station1
# version:3
# height:10
#
#
# index,timestamp,temp,wind
# 0,2018-01-01 00:00:00,20.1,10
# 1,2018-01-01 00:03:00,20.2,20
# 2,2018-01-01 00:06:00,20.3,30 |
Easiest would still be to again do this all with one file handle: In [6]: with open('sample_output1.txt', 'w') as outfile:
...: outfile.write('StationName:station1'+chr(10))
...: outfile.write('version:3'+chr(10))
...: outfile.write('height:10'+chr(10)+chr(10))
...: df.to_csv(outfile,header=True,index=True,index_label='index') I don't have a platform setup like yours to test but I'd still question if this really is a pandas issue at all or again just goes back to opening multiple file handles in different modes without properly closing any files. |
Will: I also tried removing mode='a' from the the to_csv() call and it made no difference - I still get the incorrect operation. The mode statement in the to_csv() call doesn't seem to have any effect on the output. It seems as if the mode of the file object takes precedence over the to_csv() write mode. I'd guess that this is intended operation Regarding your with statement, I couldn't get it to work. I got an error AttributeError: '_io.TextIOWrapper' object has no attribute 'open' and I'm not sure how to modify the with statement to get it to run. I'm also not entirely clear on what you mean by "opening multiple file handles in different modes". The following code only has one open statement and the problem can be seen if you open the file in a text editor. I don't see how it's functionally any different from the intent of yours - yours intends to make a file object (outfile) and then use it to perform four write functions. The only reason I re-open the file with the 'r' mode is so I can read it and print the result. file = open(filename,'w')
file.write('StationName:station1'+chr(10))
file.write('version:3'+chr(10))
file.write('height:10'+chr(10)+chr(10))
df.to_csv(file,header=True,index=True,index_label='index') Thanks, |
Chris: |
I think I was running into a related issue with writing to an already opened file. I made a pull request with a unit test to prevent a future regression. |
For me in Win10, WinPython 3.6, Pandas 0.23.4 the file.flush() is necessary and works like a charm. I'll try to remember to come back and offer non-proprietary sample data. scrubbed code:
|
TL;DR version
to_csv() ignores the header row and instead treats the first line of data as the header when using a file object in write mode to:
This problem does appear on Win7, Win10, and a Mint VM when pandas 0.23.1 is installed.
This problem does not appear on Win10 or Mint VM when pandas 0.22.0 is installed
All testing I've done has been using either Spyder or Jupyter Notebook installed as part of Anaconda. The use of Spyder vs Jupyter does not seem to matter. The version of Spyder (3.2.6 vs. 3.2.8) and the version of Jupyter (5.4.0 vs. 5.5.0) does not seem to matter either.
Task background
I'm trying to output a dataframe containing weather data (temperature and wind speed) to a txt file. The first few lines of the file are metadata about the station (name, number, height, etc.). After the meta data, I want to store the time-series weather data. I want to write all information (station metadata and the data block) for a particular station, close the file, and move onto outputting the file for the next station.
Problem Description:
pd.to_csv() is ignoring the header information in the dataframe and is instead treating the first line of data as the header.
The problem occurs when using the following steps to write to file:
Code Sample
Gives the following output:
Expected Output
example of the correct output:
Side Note:
The output file is formatted correctly if I add an extraneous close/open just before calling to_csv() as follows:
The code below produces the expected output:
Dependency Information
As noted above, the problem shows up on Win7, Win10, and Mint when pandas 0.23.1 is installed but does not show up on Win10 or Mint when pandas 0.22.0 is installed.
To further test this, I did a fresh install of anaconda navigator (which ships with pandas 0.22.0) on a Mint VM and then ran conda update pandas. The updates performed are as follows (everything other than pandas were updated as part of the environment solution):
Before conda update pandas, pandas worked as expected for both sample cases (with and without the extraneous file.close()/file=open().
After updating, pandas began ignoring the dataframe header and treating the first line of data as the header information.
Prior to update, pd.show_versions() returned this:
pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
After the update, pd.show_versions() returned this:
pandas: 0.23.1
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: