Skip to content

ignore newline while reading files? #4917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jseabold opened this issue Sep 21, 2013 · 10 comments
Closed

ignore newline while reading files? #4917

jseabold opened this issue Sep 21, 2013 · 10 comments

Comments

@jseabold
Copy link
Contributor

Apparently this [1] is a thing. Is there a way to specify to read_table or read_fwf to read X number of variables. I was hoping something like this might work. I'm not positive I got the regex right yet, but the idea is match any consecutive space even optionally followed by a newline character. I don't know if it's worth supporting, but it seems like maybe it should be.

url = "http://cameron.econ.ucdavis.edu/mmabook/patr7079.asc"
names = ["CUSIP", "ARDSSIC", "SCISECT", "LOGK", "SUMPAT", "LOGR70",
         "LOGR71", "LOGR72", "LOGR73", "", "LOGR74", "LOGR75", "LOGR76",
         "LOGR77", "LOGR78", "LOGR79", "PAT70", "PAT71", "PAT72", "PAT73",
         "PAT74", "PAT75", "PAT76", "PAT77", "PAT78", "PAT79"]
dta = pd.read_table(url, sep=r" *(?=\n)?", names=names)

[1] http://cameron.econ.ucdavis.edu/mmabook/patr7079.asc

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

shouldn't you use read_fwf?

see also: #4488

@jseabold
Copy link
Contributor Author

Yep, that's what I want. Funny timing. So I guess this is in the pipeline. Good to hear. Closing.

I suppose, though, that it doesn't have to be fixed width...

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@jseabold you can do read_fwf now!

the issue above is about detecting the widths properly

@jseabold
Copy link
Contributor Author

I was hoping to avoid counting / scraping the format too.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@jseabold yep....and \s+ doesn't work? (as there are fields that don't have a marker)

@jseabold
Copy link
Contributor Author

The "rows" are split across line terminators.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

why don't you read them in as normal then just separate out after, eg df.iloc[::2] and df.iloc[1::2](assuming 2 lines per row)

@jseabold
Copy link
Contributor Author

That's what I ended up doing, but it'd be nice to have this in one command, which is what that other issue is proposing AFAICT E.g., in stata it's

infile varlist using url

@jseabold
Copy link
Contributor Author

It's also messy because the lines are not split with the same number of variables across rows. It's mainly just a pain and I think it could be easier.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

you could read in with normal seps
the unstack then chop at the boundaries

weird way to format the file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants