-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
read_fwf 'infer' where first hundred lines differ from other lines #15138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think it would be reasonable to pass down a new parameter, maybe PR's would be welcome! |
I am on it :) |
Ran into the same issue. Had to sort the file to have larger numbers or top. I'd suggest, at least for numeric columns, not to infer the left boundary for the first column but go with 0. I'd rather see a very conservative approach where nothing gets stripped/truncated initially. Rather, we can trim strings later on, if needed. Thanks! |
@yakovkeselman contributions welcome! This is pretty straightforward to actually implement. |
I just submitted a PR to address this issue. I left out the |
Code Sample, a copy-pastable example if possible
Problem description
I have a long fixed-width file (>100k lines) that whose head and tail are shown above. I want to read this file with pandas. I figure
pd.read_fwf
is the way to do this. The issue comes up because it reads the first hundred lines, which start with' 1'
to say "lets start reading at [2]" whereas the last hundred lines start with115
, so it skips the initial11
and starts the line with5
, so I lose data.A couple of approaches to solving this issue come to mind, though I'm sure there are others:
infer_from_all
)Output of
pd.show_versions()
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-107-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
The text was updated successfully, but these errors were encountered: