Skip to content

to_csv with lists of strings and unicode encoding produces wrong output #10813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tdszyman opened this issue Aug 13, 2015 · 11 comments
Closed

to_csv with lists of strings and unicode encoding produces wrong output #10813

tdszyman opened this issue Aug 13, 2015 · 11 comments
Labels
IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string

Comments

@tdszyman
Copy link

If I have a dataframe with cells containing lists of strings (or unicode strings), then these lists are broken when I use to_csv() with the encoding parameter set. The error does not occur if the encoding is not set.

Here is an example (using pandas version 0.16.2):

df = pd.DataFrame.from_records(
    [('Mary S.',['Detroit, MI','New York, NY']),
     ('John U.',[u'Atlanta, GA',u'Paris, France'])],
    columns=['name','residences'])
df.to_csv('ascii.csv')
df.to_csv('utf8.csv',encoding='utf-8')

The ascii-encoded CSV file is fine. (contents of 'ascii.csv' below)

,name,residences
0,Mary S.,"['Detroit, MI', 'New York, NY']"
1,John U.,"[u'Atlanta, GA', u'Paris, France']"

But the unicode CSV file fails to quote the strings within the lists. (contents of 'utf8.csv' below)

,name,residences
0,Mary S.,"[Detroit, MI, New York, NY]"
1,John U.,"[Atlanta, GA, Paris, France]"

This results in the data being impossible to recover. For example, if I load this file using read_csv(), the relevant cells are treated as strings, and cannot be accurately recast as lists.

The behavior is the same using encoding='utf-16' but I didn't check any other encodings.

@rtkaleta
Copy link
Contributor

Hi,

This is still an issue in Pandas v0.18.1:

>>> import pandas as pd
>>> pd.__version__
u'0.18.1'
>>> data = [{'names': ['foo', 'bar']}, {'names': ['baz', 'qux']}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='utf-8')

Result:

>>> cat temp.csv
,names
0,"[foo, bar]"
1,"[baz, qux]"

An even weirder quirk is that even when encoding='ascii' - i.e. we are explicitly setting the encoding to its apparent default - the result is also broken:

>>> import pandas as pd
>>> data = [{'names': ['foo', 'bar']}, {'names': ['baz', 'qux']}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='ascii')

Result:

>>> cat temp.csv
,names
0,"[foo, bar]"
1,"[baz, qux]"

Note this seems to affect columns containing array of strings only. If the column contains a list of e.g. dictionaries, the data is written down to csv correctly:

>>> import pandas as pd
>>> data = [{'names': [{'foo': 1}, {'bar': 2}]}, {'names': [{'baz': 3}, {'qux': 4}]}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='utf-8')

Result:

>>> cat temp.csv
,names
,names
0,"[{u'foo': 1}, {u'bar': 2}]"
1,"[{u'baz': 3}, {u'qux': 4}]"

@jreback jreback added Output-Formatting __repr__ of pandas objects, to_string IO CSV read_csv, to_csv labels Sep 15, 2016
@jreback
Copy link
Contributor

jreback commented Sep 15, 2016

I suppose. embedded lists of non-scalars are not first class citizens of pandas at all, nor are they generally lossleslly convertible to/from csv. json is a better format for this. If a community supported PR is pushed that would be ok.

@jorisvandenbossche jorisvandenbossche added this to the Someday milestone Sep 15, 2016
@TomAugspurger
Copy link
Contributor

I think this has been fixed, but not by #17821. Would be nice ensure we have a regression test in place.

@rtkaleta
Copy link
Contributor

@TomAugspurger Thanks for picking this up, and sorry it took me so long to respond. Looks like this is now fixed for writing string arrays using the ascii encoding but still broken for utf-8 encoded values. See #18013.

@rtkaleta
Copy link
Contributor

rtkaleta commented Oct 28, 2017

I'll have a stab at a fix... It stems from the fact that pandas' own UnicodeWriter calls pprint_thing without quote_strings=True so then:

>>> from pandas.io.formats.printing import pprint_thing
>>> pprint_thing([u'foo', u'bar'])
u'[foo, bar]'

instead of the more intuitive:

>>> pprint_thing([u'foo', u'bar'], quote_strings=True)
u"[u'foo', u'bar']"

Why do we have our own UnicodeWriter here instead of unicodecsv.writer?

@jreback jreback modified the milestones: Someday, Next Major Release Nov 7, 2017
@rtkaleta
Copy link
Contributor

rtkaleta commented Nov 7, 2017

@jreback This should not have been closed, got closed because I mentioned it in #18013, please reopen and I'll have a stab at the fix, thanks.

@TomAugspurger TomAugspurger reopened this Nov 8, 2017
@Rajjae
Copy link

Rajjae commented Jan 22, 2019

Hello,
This is still an issue in Pandas v0.23.4. It is fixed when using the ascii encoding, but still broken when using the utf-8 encoding.

@aausch
Copy link

aausch commented Jul 23, 2019

ping any chance this is getting fixed?

@TomAugspurger
Copy link
Contributor

@jschendel did #25864 fix this?

@jschendel
Copy link
Member

I've only been able to reproduce this issue on Python 2; the output has looked fine to me on Python 3 even using some fairly old versions (e.g. 0.20.x). So this issue might not be relevant anymore in the sense that we no longer support Python 2, if it's indeed the case that this is a Python 2 specific issue.

@TomAugspurger
Copy link
Contributor

I also can't reproduce on python 3. If anyone can, let us know and we'll reopen.

@TomAugspurger TomAugspurger modified the milestones: Contributions Welcome, No action Jul 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

No branches or pull requests

8 participants