Skip to content

to_csv regression in 0.23.1 #21471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
francois-a opened this issue Jun 14, 2018 · 12 comments
Closed

to_csv regression in 0.23.1 #21471

francois-a opened this issue Jun 14, 2018 · 12 comments
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@francois-a
Copy link

Writing to gzip no longer works with 0.23.1:

with gzip.open('test.txt.gz', 'wt') as f:
    pd.DataFrame([0,1],index=['a','b'], columns=['c']).to_csv(f, sep='\t')

produces corrupted output. This works fine in 0.23.0.

Presumably this is related to #21241 and #21118.

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jun 14, 2018
@WillAyd
Copy link
Member

WillAyd commented Jun 14, 2018

Please provide a reproducible example:

http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@Liam3851
Copy link
Contributor

@WillAyd Francois's example is reproducible for me on Windows 7 using master. The output file test.txt.gz is empty instead of containing data.

If I let pandas do the compression it appears to work fine:

df = pd.DataFrame([0,1],index=['a','b'], columns=['c'])
df.to_csv('C:/temp/test.txt.gz', sep='\t', compression='gzip')

@saidie
Copy link

saidie commented Jun 14, 2018

Hi,
I also encountered a to_csv problem on 0.23.1 although my case is different to others:

import sys
import pandas as pd
df = pd.DataFrame([0,1])
df.to_csv(sys.stdout)

This code writes the dataframe to a file named <stdout> while it is expected to be printed out to the stdout.

@littleK0i
Copy link

I also have a problem with "to_csv" specifically on 0.23.1.

Looks like function "_get_handle()" returns "f" as FD number (int) instead of buf.

            # GH 17778 handles zip compression for byte strings separately.
            buf = f.getvalue()
            if path_or_buf:
                f, handles = _get_handle(path_or_buf, self.mode,
                                         encoding=encoding,
                                         compression=self.compression)
                f.write(buf)
                f.close()

Error text:

  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
    formatter.save()
  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 168, in save
    f.write(buf)
AttributeError: 'int' object has no attribute 'write'

@WillAyd
Copy link
Member

WillAyd commented Jun 14, 2018

@Liam3851 thanks - I misread the original post so I see the point now.

@saidie and @wildraid please do not add distinct issues to this. If you feel you have a different issue please open it separately

@WillAyd WillAyd added Regression Functionality that used to work in a prior pandas version IO CSV read_csv, to_csv and removed Needs Info Clarification about behavior needed to assess issue labels Jun 14, 2018
@littleK0i
Copy link

littleK0i commented Jun 14, 2018

@WillAyd , I did a quick research.

It seems that all "file-like" objects which cannot be converted to string file paths are affected. Gzip wrapper, stdout, FD's - all these problems have the same origin.

Example with FD:

import pandas
import os

with os.fdopen(3, 'w') as f:
    print(f)
    pandas.DataFrame([0, 1]).to_csv(f)

Output:

<_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
Traceback (most recent call last):
  File "gg.py", line 6, in <module>
    pandas.DataFrame([0, 1]).to_csv(f)
  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
    formatter.save()
  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 166, in save
    f.write(buf)
AttributeError: 'int' object has no attribute 'write'

I guess, integer comes from "name" attribute of TextIOWrapper. For STDOUT it will be <stdout>, etc.

@minggli
Copy link
Contributor

minggli commented Jun 14, 2018

I think the issue is caused by #21249 in response to #21227

The correct use when passing a file-handle and expect compression should be like francois's case: i.e. pass a gzip file handle or other compression archive file handle.

@jeffzi
Copy link
Contributor

jeffzi commented Jun 14, 2018

Writing to TemporaryFile fails as well. The file remains empty:

import tempfile
import pandas as pd

df = pd.DataFrame([0, 1], index=['a', 'b'], columns=['c'])
with tempfile.TemporaryFile() as f:
    df.to_csv(f)
    f.seek(0)
    print f.read()

@TomAugspurger TomAugspurger added this to the 0.23.2 milestone Jun 14, 2018
@ernoc
Copy link

ernoc commented Jun 14, 2018

Hi, here are some additional examples of the changes in the behaviour of to_csv.

A common use case is to write a file header once and then write many dataframes' data to that file. Our implementation looks like this:

df = pd.DataFrame({
    'col1': [1, 2, 3],
    'col2': [1.0, 2.0, 3.0],
})
df2 = ... 

with open('/tmp/no_headers.csv', 'w') as f:
    f.write('col1,col2\n')
    df.to_csv(f, index=False, header=False)
    ...
    df2.to_csv(f, ...
    df3.to_csv(f, ...

This works in 0.23.0 but in 0.23.1 it produces a file that looks like this:

col1,col2
0
3,3.0

What happened here is that pandas has opened a second handle to the same file path in write mode, and our f.write line was flushed last, overwriting some of what pandas wrote.

Flushing alone would not help because now pandas will overwrite our data:

with open('/tmp/no_headers.csv', 'w') as f:
    f.write('col1,col2\n')
    f.flush()
    df.to_csv(f, index=False, header=False)

produces:

1,1.0
2,2.0
3,3.0

One workaround is both flushing manually AND giving pandas a write mode:

with open('/tmp/no_headers.csv', 'w') as f:
    f.write('col1,col2\n')
    f.flush()
    df.to_csv(f, index=False, header=False, mode='a')

IMO this is not expected behaviour: if we give pandas an open file handle, we don't expect pandas to find out what the original path was, and open it again on a second file handle.

This is the bit of code where re-opening is decided: https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/csvs.py#L139 . This gives the "" behaviour pointed out by @saidie . Data is written to a StringIO first, finally the file is opened again by path and the data in the StringIO is written to it.

@jorisvandenbossche
Copy link
Member

Thanks all for the reports!
There is a PR now that tries to fix this: #21478. Trying out or review of that is certainly welcome.

@minggli
Copy link
Contributor

minggli commented Jun 14, 2018

hello, raised a PR to remedy this issue. welcome testing and review. for reports from @francois-a and @saidie, and other reproducible, this patch should fix it.

for now a workaround would be to use file path or StringIO.

@WillAyd
Copy link
Member

WillAyd commented Jun 19, 2018

Closed via #21478

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

10 participants