Skip to content

[Windows bug?] when appending to an already-open file to_csv() treats first data row in dataframe as the header #21577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeremymatt opened this issue Jun 21, 2018 · 10 comments · Fixed by #21696
Labels
IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@jeremymatt
Copy link

jeremymatt commented Jun 21, 2018

TL;DR version

to_csv() ignores the header row and instead treats the first line of data as the header when using a file object in write mode to:

  1. write a few lines of text to the file, and then
  2. use pd.to_csv() to append a dataframe to the file

This problem does appear on Win7, Win10, and a Mint VM when pandas 0.23.1 is installed.
This problem does not appear on Win10 or Mint VM when pandas 0.22.0 is installed

All testing I've done has been using either Spyder or Jupyter Notebook installed as part of Anaconda. The use of Spyder vs Jupyter does not seem to matter. The version of Spyder (3.2.6 vs. 3.2.8) and the version of Jupyter (5.4.0 vs. 5.5.0) does not seem to matter either.

Task background

I'm trying to output a dataframe containing weather data (temperature and wind speed) to a txt file. The first few lines of the file are metadata about the station (name, number, height, etc.). After the meta data, I want to store the time-series weather data. I want to write all information (station metadata and the data block) for a particular station, close the file, and move onto outputting the file for the next station.

Problem Description:

pd.to_csv() is ignoring the header information in the dataframe and is instead treating the first line of data as the header.

The problem occurs when using the following steps to write to file:

  • Open the file (mode='w')
  • Write the three lines of header information
  • call to_csv() on the dataframe

Code Sample

import pandas as pd

#make sample datafame
temp = pd.Series([20.1,20.2,20.3])
wind = pd.Series([10,20,30])
timestamp = pd.date_range(start='1/1/2018',periods = 3,freq='3min')
df = pd.DataFrame({'timestamp':timestamp,'temp':temp,'wind':wind})

#TEST1: Produces incorrect output
filename = 'sample_output1.txt'
file = open(filename,'w')
file.write('StationName:station1'+chr(10))
file.write('version:3'+chr(10))
file.write('height:10'+chr(10)+chr(10))
df.to_csv(file,mode='a',header=True,index=True,index_label='index')
file = open(filename,'r')
print(file.read())
file.close()

Gives the following output:

StationName:station1
version:3
height:10

00,20.1,10
1,2018-01-01 00:03:00,20.2,20
2,2018-01-01 00:06:00,20.3,30

Expected Output

example of the correct output:

StationName:station1
version:3
height:10

index,timestamp,temp,wind
0,2018-01-01 00:00:00,20.1,10
1,2018-01-01 00:03:00,20.2,20
2,2018-01-01 00:06:00,20.3,30

Side Note:

The output file is formatted correctly if I add an extraneous close/open just before calling to_csv() as follows:

  • Open the file (mode='w')
  • Write the three lines of header information
  • close file
  • open file (mode='a')
  • call to_csv() on the dataframe

The code below produces the expected output:

#TEST2: Produces correct output
filename = 'sample_output2.txt'
file = open(filename,'w')
file.write('StationName:station1'+chr(10))
file.write('version:3'+chr(10))
file.write('height:10'+chr(10)+chr(10))

#Note the extraneous close/open of the file
file.close()
file = open(filename,'a')

df.to_csv(file,mode='a',header=True,index=True,index_label='index')
file.close()
file = open(filename,'r')
print(file.read())
file.close()

Dependency Information

As noted above, the problem shows up on Win7, Win10, and Mint when pandas 0.23.1 is installed but does not show up on Win10 or Mint when pandas 0.22.0 is installed.

To further test this, I did a fresh install of anaconda navigator (which ships with pandas 0.22.0) on a Mint VM and then ran conda update pandas. The updates performed are as follows (everything other than pandas were updated as part of the environment solution):

    anaconda:        5.1.0-py36_2          --> custom-py36hbbc8b67_0
    ca-certificates: 2017.08.26-h1d4fec5_0 --> 2018.03.07-0         
    certifi:         2018.1.18-py36_0      --> 2018.4.16-py36_0     
    openssl:         1.0.2n-hb7f436b_0     --> 1.0.2o-h20670df_0    
    pandas:          0.22.0-py36hf484d3e_0 --> 0.23.1-py36h637b7d7_0

Before conda update pandas, pandas worked as expected for both sample cases (with and without the extraneous file.close()/file=open().
After updating, pandas began ignoring the dataframe header and treating the first line of data as the header information.

Prior to update, pd.show_versions() returned this:

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-38-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

After the update, pd.show_versions() returned this:

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-38-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.1
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Jun 21, 2018

Not sure I'd consider this a bug - why wouldn't you either close the file you are writing before re-opening in a different mode or alternately just keeping a reference to the same handle across all of the writing operations?

@jeremymatt
Copy link
Author

I'm not sure why I'd want to close&re-open the file before I'm done writing to it - I'd like to write the metadata and the data block in one shot and then close the file and move on to writing the data for the next station. Closing and then re-opening is just two extra lines of code that I don't think I should have to include.

Regardless, the bug (from my point of view) is that to_csv() is ignoring the actual headers in the dataframe and instead is treating the first row of data as the header. I'd be VERY surprised if this is the intended operation. But, I'm a newbie, so I could be wrong :)

@WillAyd
Copy link
Member

WillAyd commented Jun 21, 2018

I think it is strange to open your file multiple times especially in different modes, which is why I suggested one option to keep writing to the same file handle.

If you remove pandas from the equation altogether and just write arbitrary content does your code behave the way you expect? The fact that you are getting different behavior on different platforms makes me believe this has more to do with the OS' underlying IO operations w.r.t Python and not so much with pandas

@gfyoung gfyoung added the IO Data IO issues that don't fit into a more specific label label Jun 21, 2018
@jeremymatt
Copy link
Author

Will:
Thanks for getting back to me. I apologize if I don't have the vocabulary to answer your questions well - as I said I'm a newbie to python and am not a computer science person. I've also edited the main post to more clearly describe what I see as the problem and to provide additional information.

Regarding the OS dependency, I may have been wrong about that. On my windows 10 machine, older versions do not have this problem, but the updated set of packages does have this problem. Specifically, when tried it on my Windows 10 machine with pandas v0.22.0 and the problem did not appear. I then updated everything (using conda update --all). pandas was updated to v.0.23.1 and the problem did appear. When I get a chance, I'll update all on my Mint VM and will let you know what I find.

Regarding opening and closing, I think it's odd as well. I just want to open the file once, do all of the write operations, and then close the file. However, I have to close and then re-open to get the pandas output to work. When I re-open the file I have to use the append mode to avoid over-writing the header information.

By file handle I assume that the file in file = open(filename,'w') is the handle, correct? If so, then I am using the same handle for all write operations (or am trying to anyway).

Regarding removing pandas from the equation, I'm not sure how I'd go about that honestly. My data are all stored in dataframes and I would like to keep it in dataframes so I can take advantage of the wide variety of pandas tools.

As for arbitrary writes, the write operations to make the first three lines of the output file work as expected, its just when I get to outputting the dataframe that I have problems. I suppose I could concatenate each row of the dataframe into a csv text string and then use a for loop to export each line individually to the file - is this what you had in mind? If so, I suspect that it would work fine because the file.write(string) operations worked just fine.

@chris-b1
Copy link
Contributor

It is strange that this changed between pandas versions - indicates we might be something off with the file handle in to_csv - in any case, adding a flush to your file fixes the issue, maybe something we should doing?

filename = 'sample_output1.txt'
file = open(filename,'w')
file.write('StationName:station1'+chr(10))
file.write('version:3'+chr(10))
file.write('height:10'+chr(10)+chr(10))
file.flush()

df.to_csv(file,mode='a',header=True,index=True,index_label='index')
file = open(filename,'r')
print(file.read())
file.close()
# `stdout`
# StationName:station1
# version:3
# height:10
#
#
# index,timestamp,temp,wind
# 0,2018-01-01 00:00:00,20.1,10
# 1,2018-01-01 00:03:00,20.2,20
# 2,2018-01-01 00:06:00,20.3,30

@WillAyd
Copy link
Member

WillAyd commented Jun 24, 2018

Easiest would still be to again do this all with one file handle:

In [6]: with open('sample_output1.txt', 'w') as outfile:
   ...:     outfile.write('StationName:station1'+chr(10))
   ...:     outfile.write('version:3'+chr(10))
   ...:     outfile.write('height:10'+chr(10)+chr(10))
   ...:     df.to_csv(outfile,header=True,index=True,index_label='index')

I don't have a platform setup like yours to test but I'd still question if this really is a pandas issue at all or again just goes back to opening multiple file handles in different modes without properly closing any files.

@jeremymatt
Copy link
Author

Will:
Woops, good point - I forgot file.close() after df.to_csv(). It doesn't change the operation though - the header values are still ignored by to_csv()

I also tried removing mode='a' from the the to_csv() call and it made no difference - I still get the incorrect operation. The mode statement in the to_csv() call doesn't seem to have any effect on the output. It seems as if the mode of the file object takes precedence over the to_csv() write mode. I'd guess that this is intended operation

Regarding your with statement, I couldn't get it to work. I got an error AttributeError: '_io.TextIOWrapper' object has no attribute 'open' and I'm not sure how to modify the with statement to get it to run.

I'm also not entirely clear on what you mean by "opening multiple file handles in different modes". The following code only has one open statement and the problem can be seen if you open the file in a text editor. I don't see how it's functionally any different from the intent of yours - yours intends to make a file object (outfile) and then use it to perform four write functions. The only reason I re-open the file with the 'r' mode is so I can read it and print the result.

file = open(filename,'w')
file.write('StationName:station1'+chr(10))
file.write('version:3'+chr(10))
file.write('height:10'+chr(10)+chr(10))
df.to_csv(file,header=True,index=True,index_label='index')

Thanks,
-J

@jeremymatt
Copy link
Author

Chris:
Thanks for the flush() suggestion - that worked and is a bit more elegant than close/open.

@jdrudolph
Copy link
Contributor

I think I was running into a related issue with writing to an already opened file.
Everything works in 0.23.0, is broken in 0.23.1 and works again in master.

I made a pull request with a unit test to prevent a future regression.

@jreback jreback added this to the 0.24.0 milestone Jul 2, 2018
@graphedge
Copy link

graphedge commented Feb 15, 2019

For me in Win10, WinPython 3.6, Pandas 0.23.4 the file.flush() is necessary and works like a charm. I'll try to remember to come back and offer non-proprietary sample data.

scrubbed code:

def writepathsfile(pathsdf=pathsdf,writepath=outputpath,writefile=outputfilename):
headerlist=['header0','header1','header2']
wholepath=writepath + writefile + ".path"
if os.path.exists(wholepath):
os.remove(wholepath)
else:
print("writepathsfile: The file does not exist")
with open(wholepath, 'w', newline='',encoding='utf-8') as file:
file.writelines("%s\r\n" % item for item in headerlist)
file.flush() #without this the first row of the df is treated as header
pathsdf.loc[:,['col1','col2','col3']].to_csv(
wholepath, mode='a', index=False,header=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants