-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
to_csv with lists of strings and unicode encoding produces wrong output #10813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, This is still an issue in Pandas >>> import pandas as pd
>>> pd.__version__
u'0.18.1'
>>> data = [{'names': ['foo', 'bar']}, {'names': ['baz', 'qux']}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='utf-8') Result: >>> cat temp.csv
,names
0,"[foo, bar]"
1,"[baz, qux]" An even weirder quirk is that even when >>> import pandas as pd
>>> data = [{'names': ['foo', 'bar']}, {'names': ['baz', 'qux']}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='ascii') Result: >>> cat temp.csv
,names
0,"[foo, bar]"
1,"[baz, qux]" Note this seems to affect columns containing array of strings only. If the column contains a list of e.g. dictionaries, the data is written down to >>> import pandas as pd
>>> data = [{'names': [{'foo': 1}, {'bar': 2}]}, {'names': [{'baz': 3}, {'qux': 4}]}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='utf-8') Result: >>> cat temp.csv
,names
,names
0,"[{u'foo': 1}, {u'bar': 2}]"
1,"[{u'baz': 3}, {u'qux': 4}]" |
I suppose. embedded lists of non-scalars are not first class citizens of pandas at all, nor are they generally lossleslly convertible to/from csv. json is a better format for this. If a community supported PR is pushed that would be ok. |
I think this has been fixed, but not by #17821. Would be nice ensure we have a regression test in place. |
@TomAugspurger Thanks for picking this up, and sorry it took me so long to respond. Looks like this is now fixed for writing string arrays using the |
I'll have a stab at a fix... It stems from the fact that pandas' own >>> from pandas.io.formats.printing import pprint_thing
>>> pprint_thing([u'foo', u'bar'])
u'[foo, bar]' instead of the more intuitive: >>> pprint_thing([u'foo', u'bar'], quote_strings=True)
u"[u'foo', u'bar']" Why do we have our own |
Hello, |
ping any chance this is getting fixed? |
@jschendel did #25864 fix this? |
I've only been able to reproduce this issue on Python 2; the output has looked fine to me on Python 3 even using some fairly old versions (e.g. 0.20.x). So this issue might not be relevant anymore in the sense that we no longer support Python 2, if it's indeed the case that this is a Python 2 specific issue. |
I also can't reproduce on python 3. If anyone can, let us know and we'll reopen. |
If I have a dataframe with cells containing lists of strings (or unicode strings), then these lists are broken when I use
to_csv()
with theencoding
parameter set. The error does not occur if theencoding
is not set.Here is an example (using pandas version 0.16.2):
The ascii-encoded CSV file is fine. (contents of 'ascii.csv' below)
But the unicode CSV file fails to quote the strings within the lists. (contents of 'utf8.csv' below)
This results in the data being impossible to recover. For example, if I load this file using
read_csv()
, the relevant cells are treated as strings, and cannot be accurately recast as lists.The behavior is the same using
encoding='utf-16'
but I didn't check any other encodings.The text was updated successfully, but these errors were encountered: