Skip to content

Fix df.to_csv() for string arrays when encoded in utf-8 #18013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 7, 2017

Conversation

rtkaleta
Copy link
Contributor

@rtkaleta rtkaleta commented Oct 28, 2017

So it looks like df.to_csv() is now working correctly for string arrays when using the ascii encoding but it is still broken when using utf-8.

@rtkaleta rtkaleta changed the title to_csv now working for string arrays using ascii, still broken for utf-8 Fix df.to_csv() for string arrays when encoded in utf-8 Oct 28, 2017
@gfyoung gfyoung added IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string labels Oct 30, 2017
str_array = [{'names': ['foo', 'bar']}, {'names': ['baz', 'qux']}]
df = pd.DataFrame(str_array)
expected_ascii = '''\
,names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if you make this 2 test functions, then you can xfail the non-working one (to at least get things passing)

@rtkaleta
Copy link
Contributor Author

rtkaleta commented Nov 5, 2017

@jreback It seems the current behaviour stems from the fact that pandas' own UnicodeWriter calls pprint_thing without quote_strings=True so then:

>>> from pandas.io.formats.printing import pprint_thing
>>> pprint_thing([u'foo', u'bar'])
u'[foo, bar]'

instead of the more intuitive (at least to me):

>>> pprint_thing([u'foo', u'bar'], quote_strings=True)
u"[u'foo', u'bar']"

A couple of questions come to mind:

  1. Can you recall why we have our own UnicodeWriter here instead of e.g. unicodecsv.writer?
  2. Better to expose the quote_strings (or similar) parameter to the to_csv caller, or is the behaviour I expect so ubiquitously intuitive that we should be changing things under the hood?

@codecov
Copy link

codecov bot commented Nov 5, 2017

Codecov Report

Merging #18013 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18013      +/-   ##
==========================================
- Coverage   91.24%   91.24%   -0.01%     
==========================================
  Files         163      163              
  Lines       50176    50124      -52     
==========================================
- Hits        45785    45734      -51     
+ Misses       4391     4390       -1
Flag Coverage Δ
#multiple 89.05% <ø> (ø) ⬆️
#single 40.32% <ø> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/tseries/frequencies.py 96% <0%> (-0.11%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/io/excel.py 80.39% <0%> (-0.01%) ⬇️
pandas/io/stata.py 93.7% <0%> (-0.01%) ⬇️
pandas/tseries/offsets.py 97.15% <0%> (-0.01%) ⬇️
pandas/io/sas/sas_xport.py 90.27% <0%> (ø) ⬆️
pandas/core/reshape/merge.py 94.26% <0%> (ø) ⬆️
pandas/tslib.py 100% <0%> (ø) ⬆️
pandas/plotting/_core.py 82.45% <0%> (ø) ⬆️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39a6b8f...86a3a1f. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Nov 6, 2017

will have a look

@jreback jreback added this to the 0.22.0 milestone Nov 7, 2017
@jreback jreback merged commit a2d0eed into pandas-dev:master Nov 7, 2017
@jreback
Copy link
Contributor

jreback commented Nov 7, 2017

thanks @rtkaleta

love for you to take a stab at fixing the xfailed unicode case!

watercrossing pushed a commit to watercrossing/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants