Skip to content

Auto mapping for csv #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

philschmid
Copy link
Collaborator

What does this PR do?

This PR adds an automatic column assigned when a CSV without a header is provided, e.g. batch transform. Currently, it checks if the first line contains either inputs, question or context since these are the possible header we can have. If there is none of these it assumes that the CSV has no headers.
If it has no header and 1 col it parses the 1 col as inputs if it has 2 cols the first col is the question and the second the context.

cc @vdantu

@philschmid philschmid requested a review from vdantu August 4, 2021 06:18
@vdantu
Copy link
Contributor

vdantu commented Aug 5, 2021

I am missing some context here. Are we saying that customers would provide CSV input to batch-transform job which would not have headers?

@philschmid
Copy link
Collaborator Author

I am missing some context here. Are we saying that customers would provide CSV input to batch-transform job which would not have headers?

No @la-cruche told me that when you use csv with batch transform and splitlines it will only send the line without the header, every time. -> that's why I added this to support csv in batch transform too

@vdantu
Copy link
Contributor

vdantu commented Aug 11, 2021

I am missing some context here. Are we saying that customers would provide CSV input to batch-transform job which would not have headers?

No @la-cruche told me that when you use csv with batch transform and splitlines it will only send the line without the header, every time. -> that's why I added this to support csv in batch transform too

So this was tested with SM Batch Transform?

@philschmid
Copy link
Collaborator Author

Hey, I tested it for a simple example it worked, but then I ran a few more tests on real data and found some hard to fix constraints, which will lead to the fact that we should not support this.

Explanation:

When using batch_transform sagemaker is sending exactly 1 row as input to the endpoint, e.g.

united thx off the response, finally got through the 45 min wait and talked to someone.

The issue with this is that the csv.sniffer would identify the , as delimiter and would split this into two columns, and create an input for question-answering instead of text-classification. There is no way to identify wether the input is 1 column based or 2 column based just from 1 input, e.g. question-answering input

the flight was delayed 45 minutes, How long was the flight delayed?

When someone would use MultiRecord instead the sniffer would identify the delimiter correctly, but for me these are too many constraints and if for it to work properly.

I would rather stick with jsonlines as batch transform format since, textual data x csv not easy to handle especially without headers, where you don't have a change to identify columns

@philschmid
Copy link
Collaborator Author

close for #25

@philschmid philschmid closed this Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants