Skip to content

read_fwf colspecs autodetection could be more flexible #5056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Sep 30, 2013 · 25 comments
Closed

read_fwf colspecs autodetection could be more flexible #5056

ghost opened this issue Sep 30, 2013 · 25 comments

Comments

@ghost
Copy link

ghost commented Sep 30, 2013

See #4955 (comment)

id foo 
1  a
1  a
1  a
123a

This works with the snippet in the orig #4488 but fails with the merged #4955 due to the last line.
It's a common example of a fwf file.

@jtratner
Copy link
Contributor

So what should happen in this case? Are you thinking that detection should be entirely based on first line?

id foo 
123a
123a
123a
123a

@jtratner
Copy link
Contributor

Also, as an aside, this is probably a clearer example (when it's at the end, seems like it's a special footer)

id foo 
1  a
123a
1  a
1  a

@ghost
Copy link
Author

ghost commented Sep 30, 2013

Not at all. #4488 handles the first case by looking for a pattern which is common to a majority of records,
#4955 fails if there's a single record in the whole dataset that occupies the whole column.

In your example, the first row is indeed the only hint as to the width of the columns,
and it may be interpreted as a single column with a header label of "id foo".

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

cc @alefnula

I think that if detection is 'infer' and "doesn't work", then should raise.

not sure if you can make an accurate determination in all cases though?

@ghost
Copy link
Author

ghost commented Sep 30, 2013

True, can't make it work for all cases. I think the first example is more common then yours.
For the case where there is a full single-character column seperating the fields, as required by #4955,
#4488 will (almost) always work.

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

cc @alefnula

I think you can do a percentage computation on the mask that you are computing. If they are say the same in say 90% of the lines then you can go with that (or even higher per). I think that would allow for some (but not most) no-delimeter lines

@alefnula
Copy link
Contributor

Yes, this is a problem. The current implementation cannot detect the columns properly if they are not separated by the filler. But the suggested implementation in issue #4488 didn't work for any case I tested :( For the example you just wrote here it returns [8]. And even for simple cases like:

1 2 3
4 5 6
7 8 9

It returns [2, 3]...

But this is not the primary reason I haven't tried to fix it. The much bigger problem with the splitting approach is that it cannot handle the cases where columns are strings which can contain the filler, for example:

id  Firstname and Lastname  Age
1   Douglas Adams           44
1   Arthur Conan Doyle      55

The approach I used supports these cases.

So basically these are the reason why I chose to use the bitmask instead of splitting and counting. (and also it's more memory and time efficient - but this is not that important if we read only 100 lines of data). The only drawback is that it cannot handle the columns that are glued together, but that is clearly stated in the documentation.

Maybe an approach where the outliers will be discarded can be implemented. But I'll have to find a way to do it efficiently...

@jtratner
Copy link
Contributor

Ultimately, you can still pass colspecs directly in ambiguous cases.

@alefnula
Copy link
Contributor

Also I don't think this is a bug. The behavior of inferring columns is clearly documented, and the function behaves as the documentation states it does...
But, as everything, it can be improved :)

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

@alefnula can you squash the mas column wise and compare how many equal as per of total rows? (not sure if that is really useful.....) just a thought

@ghost
Copy link
Author

ghost commented Sep 30, 2013

Not true. There may be a bug in in #4488 (or not) but the underlying approach should handle strings
such as those just fine.
The issue is that it has a free parameter (the threshold) which has no "right" value, unless you know the distribution of all datasets out there (we don't).
It seems to me that #4955 is a private case of the threshold scheme in #4488, but with a threshold of 0%,
and I'm pretty sure whatever the "best" value is, it's not that.

Documented behaviour is great, but ultimately it's preferable to support more cases rather
then document the fact that you don't, no?

@ghost
Copy link
Author

ghost commented Sep 30, 2013

Removed bug label. It's a common case that isn't handled, which is a shame but not a bug.

@alefnula
Copy link
Contributor

@jreback I dont actually think that this problem can be solved in a way that will satisfy everyone :( since it's too ambiguous. The provided example:

id foo 
1  a
1  a
123a

Can be split in two different (both valid) ways:

 id  foo      id  foo
  1    a       1    a
  1    a  or   1    a
123    a      12   3a

And here is an even worse case.

c  a  foo 
12 1  5
23 1  6
4561237

If we try to implement any kind of heuristic that will determine where to split, there will be a case when we will get it wrong. The user will, maybe, get the correct number of columns but the data will be incorrect. So he will get a false conclusion (especially if he has a large data set that he cannot skim through) that we correctly split the columns, but the data will not be correct.
In the current implementation, all these ambiguous cases will produce wrong number of column in the beginning, so the user will immediately know that something is not right...
My opinion is that it's better to get a completely wrong number of columns in the first place, then to get the right number of columns, but have incorrect data in them because of the ambiguous cases.

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

@alefnula agreed....false positive is bad here. I was just wondering if you can detect the false positive and raise a helpful error message?

@alefnula
Copy link
Contributor

@jreback In the current implementation you wont get false positives. Either the number column will be correct and you will get valid data or the number of columns wont be correct...
The thing that I may be able to implement is to "suspect" that some lines are ambiguous and issue a warning:

Line 44 is suspicious. If the inferred columns are incorrect, check the line 44 or provide a correct colspecs argument.

@ghost
Copy link
Author

ghost commented Sep 30, 2013

In all the examples you provided #4488 would do the right thing.

@alefnula
Copy link
Contributor

@y-p The problem is that there is no right thing :)

c  a  foo 
12 1  5
23 1  6
4561237

Which of the following is right?

c     a foo 
12    1   5
23    1   6
45 6123   7

c   a foo 
12  1   5
23  1   6
456 1 237

c   a foo 
12  1   5
23  1   6
45 61 237

c    a foo 
12   1   5
23   1   6
456 12  37

c    a foo 
12   1   5
23   1   6
45 612  37

Also #4488 does not work for any case I tried :( Maybe you pasted the wrong version of the code there?

@ghost
Copy link
Author

ghost commented Sep 30, 2013

None of them, the right one is:

c   a   foo 
12  1   5
23  1   6
456 123 7

Both #4488 and your private case of it take the rightmost "filler" column as the edge of each column.
Honestly, most of the fwf datasets I've encountered would not work with #4955, that's the only
reason I'm pushing for this.

How about adding a detect_threshold kwarg and setting the defval to match the behaviour of
#4955? a conservative default is fine by me, just don't be too conservative to be useful.

@alefnula
Copy link
Contributor

Oh, now I see what you aim for :) Correct me if I'm wrong, but your point is that most of the fwf files are aligned left. If this is the case, we had a completely different experience with fwf files, since most of the ones I saw were aligned right so I would probably interpret this one as:

c   a foo 
12  1   5
23  1   6
45 61 237

Maybe a aligned=left|right|None kwarg? In case of None I'll do the thing I'm doing now, and in the case of left or rigth I can be a little bit "smarter".

@ghost
Copy link
Author

ghost commented Sep 30, 2013

No, it has nothing to do with alignment. Your code is a fine default, I'll add in a kwarg
threshold when I get a chance, preserving this default behaviour. np.

Correction: an aligned kwarg makes sense, but still won't handle the cases I'm concerned with,
unless there's a threshold argument.

@jtratner
Copy link
Contributor

@y-p if you don't have the column headers, you can't know which one is
correct and all of the examples could be valid. Only if the column headers
are present (and you can assume that they are both left-aligned and don't
have spaces in their name) can you unambiguously decide where to split.
(especially if the data for a column has filler in it but is supposed to
stay together)

@ghost
Copy link
Author

ghost commented Oct 1, 2013

@jtratner the example @alefnula gave aligns some of the column headings differently
from the data, that's misleading. Remove the bogus headings and assume
All columns are aligned the same way (valid assumption, in my experience), Is there
still ambiguity there?

@jtratner
Copy link
Contributor

jtratner commented Oct 1, 2013

I think you're right that the alignment should be consistent, but the
example still seems dependent upon the assumption that the first row are
column headings, which isn't always true with fwf that I've seen.

@ghost
Copy link
Author

ghost commented Oct 1, 2013

The algo I had in mind doesn't isn't dependent on col headings in any way.
We've had plenty of discussion on this at this point, I suggest we leave it
for now until I or someone else provides actual code to review.
@alefnula's code is a fine enhancement so we're ahead already. In the future,
we can do more.

pushed to 0.14, it's not a blocker. (or someday)

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 26, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@jbrockmendel jbrockmendel added IO Fixed Width read_fwf and removed IO Data IO issues that don't fit into a more specific label labels Dec 11, 2019
@jbrockmendel jbrockmendel removed the IO CSV read_csv, to_csv label Dec 20, 2019
@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
@mroeschke
Copy link
Member

Since there hasn't been much activity for this feature since and there appears to be significant complexity for determining the correct behavior I am going to close for now, but can reopen if there's more interest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants