Skip to content

read_csv parse issue with newline in quoted items combined with skiprows #10911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cstegemann opened this issue Aug 27, 2015 · 5 comments
Closed
Labels
IO CSV read_csv, to_csv Usage Question
Milestone

Comments

@cstegemann
Copy link

Now I don't know if this is known or the desired behaviour but when I try to read certain rows from a large file that uses "~" (tilde) as a quotechar and use skiprows at the same time, the parser screws up as follows:
Note: I use "" in the output even though that isn't shown, if I didn't the markup would become messed up - sorry...

>>> pd.read_csv(StringIO.StringIO('a,b,c\r~a\n b~,~e\n d~,~f\n f~\r1,2,~12\n 13\n 14~'), quotechar="~", skiprows=range(1,2) )
     a                  b        c
   "b~"           "e\n d"  "f\n f"
1    2     "12\n 13\n 14"     NaN

while the output I wish to get would be in this artificial case:

      a      b                 c
0     1      2     "12\n 13\n 14"

it seems when skipping rows, the parser ignores custom quotation - which in this case is undesired from my point of view.

EDIT: It might well be that in the quoted texts newlines are not always \n but sometimes also \r.

EDIT2 (31.8.):
The lineterminator fix fails as far as I can see with the following example:

>>> a = StringIO.StringIO('Text,url\r~example\r sentence\r one~,url1\r~example\n sentence\n two~,url2')
>>> pd.read_csv(a, quotechar="~", skiprows=range(1,2), lineterminator='\r' )
                            Text        url
0                       sentence        NaN
1                         "one~"       url1
2     "example\n sentence\n two"       url2

The problem is that there is a "text"-column in the csv with html-formatted textblocks as content. However, there is no saying what kind of newline the creators of the html used originally and the textblocks stem from different sources.
I might also add that it respects the quoting perfectly if one does not use "skiprows".

versioninfo:

python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7

pandas: 0.16.2
nose: None
Cython: None
numpy: 1.9.2
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Aug 27, 2015

This looks like what you want. quotechar doesn't apply to line endings (which are normally a different character, e.g. \n

In [2]: pd.read_csv(StringIO('a,b,c\r~a\n b~,~e\n d~,~f\n f~\r1,2,~12\n 13\n 14~'), quotechar='~', skiprows=range(1,2), lineterminator='\r' )
Out[2]: 
   a  b             c
0  1  2  12\n 13\n 14

@jreback jreback added IO CSV read_csv, to_csv Usage Question labels Aug 27, 2015
@cstegemann
Copy link
Author

That would work for this example, but what if the quoted texts in the csv, coming from many different sources, use not only \n but sometimes also \r as a newline?

@jreback
Copy link
Contributor

jreback commented Aug 28, 2015

well if u show an example that would help

@cstegemann
Copy link
Author

I added an example to the original post

@selasley
Copy link
Contributor

selasley commented Apr 4, 2016

A combination of universal newline mode and the python parsing engine seems to work

# create a text file with mixed newlines
with open('testunl.txt', 'wb') as tstunl:
    tstunl.write('a,b,c\r~a\n b~,~e\n d~,~f\n f~\r1,2,~12\n 13\n 14~')

with open('testunl.txt', 'U') as tstunl:
    print(pd.read_csv(tstunl, quotechar="~", skiprows=range(1,2), engine='python'))

   a  b             c
0  1  2  12\n 13\n 14

with open('testunl.txt', 'wb') as tstunl:
    tstunl.write('Text,url\r~example\r sentence\r one~,url1\r~example\n sentence\n two~,url2')

with open('testunl.txt', 'U') as tstunl:
    print(pd.read_csv(tstunl, quotechar="~", skiprows=range(1,2), engine='python'))

                       Text   url
0  example\n sentence\n two  url2

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 22, 2016
Patches bug in C engine CSV parser in
which quotation marks were not being
respected in skipped rows.

Closes pandas-devgh-10911.
Closes pandas-devgh-12775.
@jreback jreback added this to the 0.18.1 milestone Apr 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Usage Question
Projects
None yet
Development

No branches or pull requests

3 participants