to_csv with lists of strings and unicode encoding produces wrong output #10813

tdszyman · 2015-08-13T11:50:25Z

If I have a dataframe with cells containing lists of strings (or unicode strings), then these lists are broken when I use to_csv() with the encoding parameter set. The error does not occur if the encoding is not set.

Here is an example (using pandas version 0.16.2):

df = pd.DataFrame.from_records(
    [('Mary S.',['Detroit, MI','New York, NY']),
     ('John U.',[u'Atlanta, GA',u'Paris, France'])],
    columns=['name','residences'])
df.to_csv('ascii.csv')
df.to_csv('utf8.csv',encoding='utf-8')

The ascii-encoded CSV file is fine. (contents of 'ascii.csv' below)

,name,residences
0,Mary S.,"['Detroit, MI', 'New York, NY']"
1,John U.,"[u'Atlanta, GA', u'Paris, France']"

But the unicode CSV file fails to quote the strings within the lists. (contents of 'utf8.csv' below)

,name,residences
0,Mary S.,"[Detroit, MI, New York, NY]"
1,John U.,"[Atlanta, GA, Paris, France]"

This results in the data being impossible to recover. For example, if I load this file using read_csv(), the relevant cells are treated as strings, and cannot be accurately recast as lists.

The behavior is the same using encoding='utf-16' but I didn't check any other encodings.

The text was updated successfully, but these errors were encountered:

rtkaleta · 2016-09-15T10:21:13Z

Hi,

This is still an issue in Pandas v0.18.1:

>>> import pandas as pd
>>> pd.__version__
u'0.18.1'
>>> data = [{'names': ['foo', 'bar']}, {'names': ['baz', 'qux']}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='utf-8')

Result:

>>> cat temp.csv
,names
0,"[foo, bar]"
1,"[baz, qux]"

An even weirder quirk is that even when encoding='ascii' - i.e. we are explicitly setting the encoding to its apparent default - the result is also broken:

>>> import pandas as pd
>>> data = [{'names': ['foo', 'bar']}, {'names': ['baz', 'qux']}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='ascii')

Result:

>>> cat temp.csv
,names
0,"[foo, bar]"
1,"[baz, qux]"

Note this seems to affect columns containing array of strings only. If the column contains a list of e.g. dictionaries, the data is written down to csv correctly:

>>> import pandas as pd
>>> data = [{'names': [{'foo': 1}, {'bar': 2}]}, {'names': [{'baz': 3}, {'qux': 4}]}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='utf-8')

Result:

>>> cat temp.csv
,names
,names
0,"[{u'foo': 1}, {u'bar': 2}]"
1,"[{u'baz': 3}, {u'qux': 4}]"

jreback · 2016-09-15T10:27:38Z

I suppose. embedded lists of non-scalars are not first class citizens of pandas at all, nor are they generally lossleslly convertible to/from csv. json is a better format for this. If a community supported PR is pushed that would be ok.

TomAugspurger · 2017-10-10T15:43:11Z

I think this has been fixed, but not by #17821. Would be nice ensure we have a regression test in place.

rtkaleta · 2017-10-28T18:08:25Z

@TomAugspurger Thanks for picking this up, and sorry it took me so long to respond. Looks like this is now fixed for writing string arrays using the ascii encoding but still broken for utf-8 encoded values. See #18013.

rtkaleta · 2017-10-28T18:13:20Z

I'll have a stab at a fix... It stems from the fact that pandas' own UnicodeWriter calls pprint_thing without quote_strings=True so then:

>>> from pandas.io.formats.printing import pprint_thing
>>> pprint_thing([u'foo', u'bar'])
u'[foo, bar]'

instead of the more intuitive:

>>> pprint_thing([u'foo', u'bar'], quote_strings=True)
u"[u'foo', u'bar']"

Why do we have our own UnicodeWriter here instead of unicodecsv.writer?

rtkaleta · 2017-11-07T19:08:56Z

@jreback This should not have been closed, got closed because I mentioned it in #18013, please reopen and I'll have a stab at the fix, thanks.

Rajjae · 2019-01-22T15:57:00Z

Hello,
This is still an issue in Pandas v0.23.4. It is fixed when using the ascii encoding, but still broken when using the utf-8 encoding.

aausch · 2019-07-23T15:15:14Z

ping any chance this is getting fixed?

TomAugspurger · 2019-07-23T15:19:17Z

@jschendel did #25864 fix this?

jschendel · 2019-07-23T20:36:28Z

I've only been able to reproduce this issue on Python 2; the output has looked fine to me on Python 3 even using some fairly old versions (e.g. 0.20.x). So this issue might not be relevant anymore in the sense that we no longer support Python 2, if it's indeed the case that this is a Python 2 specific issue.

TomAugspurger · 2019-07-30T03:22:49Z

I also can't reproduce on python 3. If anyone can, let us know and we'll reopen.

jreback added Output-Formatting __repr__ of pandas objects, to_string IO CSV read_csv, to_csv labels Sep 15, 2016

jorisvandenbossche added this to the Someday milestone Sep 15, 2016

jreback mentioned this issue Oct 10, 2017

BUG: Fix default encoding for CSVFormatter.save #17821

Merged

4 tasks

rtkaleta mentioned this issue Oct 28, 2017

Fix df.to_csv() for string arrays when encoded in utf-8 #18013

Merged

jreback modified the milestones: Someday, Next Major Release Nov 7, 2017

jreback added Difficulty Intermediate labels Nov 7, 2017

jreback closed this as completed in #18013 Nov 7, 2017

TomAugspurger reopened this Nov 8, 2017

jschendel mentioned this issue Mar 24, 2019

CLN: Remove unicode u string prefix #25864

Merged

3 tasks

TomAugspurger closed this as completed Jul 30, 2019

TomAugspurger modified the milestones: Contributions Welcome, No action Jul 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_csv with lists of strings and unicode encoding produces wrong output #10813

to_csv with lists of strings and unicode encoding produces wrong output #10813

tdszyman commented Aug 13, 2015

rtkaleta commented Sep 15, 2016

jreback commented Sep 15, 2016

TomAugspurger commented Oct 10, 2017

rtkaleta commented Oct 28, 2017

rtkaleta commented Oct 28, 2017 •

edited

Loading

rtkaleta commented Nov 7, 2017

Rajjae commented Jan 22, 2019

aausch commented Jul 23, 2019

TomAugspurger commented Jul 23, 2019

jschendel commented Jul 23, 2019

TomAugspurger commented Jul 30, 2019

to_csv with lists of strings and unicode encoding produces wrong output #10813

to_csv with lists of strings and unicode encoding produces wrong output #10813

Comments

tdszyman commented Aug 13, 2015

rtkaleta commented Sep 15, 2016

jreback commented Sep 15, 2016

TomAugspurger commented Oct 10, 2017

rtkaleta commented Oct 28, 2017

rtkaleta commented Oct 28, 2017 • edited Loading

rtkaleta commented Nov 7, 2017

Rajjae commented Jan 22, 2019

aausch commented Jul 23, 2019

TomAugspurger commented Jul 23, 2019

jschendel commented Jul 23, 2019

TomAugspurger commented Jul 30, 2019

rtkaleta commented Oct 28, 2017 •

edited

Loading