Skip to content

QST: CSV: export, import and new lines #36062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 2 tasks
kuraga opened this issue Sep 2, 2020 · 10 comments
Closed
1 of 2 tasks

QST: CSV: export, import and new lines #36062

kuraga opened this issue Sep 2, 2020 · 10 comments
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Usage Question

Comments

@kuraga
Copy link

kuraga commented Sep 2, 2020

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.


Question about pandas

df = pd.DataFrame([ ['a\rb'] ], columns=['A'])
df.to_csv('df.csv', index=False, sep=';', encoding='utf-8')
df_imported = pd.read_csv('df.csv', sep=';', encoding='utf-8')
(len(df), len(df_imported)) # => (1, 2)

On Linux, symbol \r is not escaped by pd.to_csv. But seems like pd.read_csv expects something else.

Is it bug or feature? How to avoid this?

@kuraga kuraga added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Sep 2, 2020
@MarcoGorelli
Copy link
Member

How about

df = pd.DataFrame([ [r'a\rb'] ], columns=['A'])
df.to_csv('df.csv', index=False, sep=';', encoding='utf-8')
df_imported = pd.read_csv('df.csv', sep=';', encoding='utf-8')
(len(df), len(df_imported)) # => (1, 1)

?

@kuraga
Copy link
Author

kuraga commented Sep 2, 2020

@MarcoGorelli , that's a synthetic example. \r comes from wild data. Excel, AFAIR.

Fact is:

symbol \r is not escaped by pd.to_csv

@MarcoGorelli
Copy link
Member

Ah, I see. How about

In [5]: df = pd.DataFrame([ ['a\rb'] ], columns=['A']) 
   ...: df.to_csv('df.csv', index=False, sep=';', encoding='utf-8', escapechar='\r') 
   ...: df_imported = pd.read_csv('df.csv', sep=';', encoding='utf-8') 
   ...: (len(df), len(df_imported))                                                     
Out[5]: (1, 1)

then?

@kuraga
Copy link
Author

kuraga commented Sep 2, 2020

@MarcoGorelli , ok, it works.

But why escapechar?!

escapechar - str, default None
    String of length 1. Character used to escape sep and quotechar when appropriate.

@MarcoGorelli
Copy link
Member

Sorry, ignore me, my answer doesn't make sense. I think @gfyoung would know about this

@asishm
Copy link
Contributor

asishm commented Sep 2, 2020

add a quoting=1 in the to_csv. Had run into this before as well during our production jobs where ingests are excel files generated on Windows but the processing was running in a debian based docker.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([ ['a\rb'] ], columns=['A'])

In [3]: df.to_csv('df.csv', index=False, sep=';', encoding='utf-8', quoting=1)

In [4]: df_imported = pd.read_csv('df.csv', sep=';', encoding='utf-8')

In [5]: (len(df), len(df_imported))
Out[5]: (1, 1)

In [6]: df
Out[6]:
      A
0  a\rb

In [7]: df_imported
Out[7]:
      A
0  a\rb

@MarcoGorelli
Copy link
Member

Thanks @asishm ! For completeness, 1 corresponds to csv.QUOTE_ALL.

@kuraga is this OK to close?

@kuraga
Copy link
Author

kuraga commented Sep 2, 2020

@MarcoGorelli , @asishm, well... First of all, thanks!

quoting defaults to csv.QUOTE_MINIMAL. So I thought that behavior is by csv module:

csv.QUOTE_MINIMAL
Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.
Dialect.lineterminator
The string used to terminate lines produced by the writer. It defaults to '\r\n'.

And csv.writer escapes \r with quoting=csv.QUOTE_MINIMAL. Why Pandas doesn't?

@asishm
Copy link
Contributor

asishm commented Sep 2, 2020

You are correct. What I mentioned was a workaround (should've been explicit there). I'd like the default behavior to work as csv.writer does with QUOTE_MINIMAL. I also forgot that I had raised this issue at that time!

xref #10018, #22678

Probably can close as duplicates of those. thoughts @kuraga ?

@kuraga
Copy link
Author

kuraga commented Sep 2, 2020

Now exhaustively, thanks!

@kuraga kuraga closed this as completed Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Usage Question
Projects
None yet
Development

No branches or pull requests

3 participants