Skip to content

Bodyparser branch rewrite #72

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 24, 2020
Merged

Bodyparser branch rewrite #72

merged 11 commits into from
Mar 24, 2020

Conversation

jackjansen
Copy link
Contributor

Parsers for both multipart and form-encoded form POST handlers implemented.

The API is geared towards handling file uploads and other large data items, and require the fields to be handled sequentially.

An example program that allows uploading, downloading and editing (for text files) is supplied.

fhessel and others added 3 commits December 6, 2018 20:05
Parsers for both multipart and form-encoded form POST handlers implemented.

The API is geared towards handling file uploads and other large data items, and require the fields to be handled sequentially.

An example program that allows uploading, downloading and editing (for text files) is supplied.
@fhessel fhessel self-requested a review February 21, 2020 14:40
@fhessel
Copy link
Owner

fhessel commented Feb 28, 2020

Sorry for the delay.

I did some minor style-related changes (mostly using only spaces for indentation and explicit if-blocks).

Currently, I'm having some trouble with the URL-encoded parser during POST to edit a file. It stops parsing after a few hundred characters, the file is truncated and I get this in the log:

[HTTPS:I:     87460] New connection. Socket FID=59
[HTTPS:I:     89271] Request: POST /edit?filename=test.txt (FID=59)
[HTTPS:W:     89298] Callback function did not parse full request body

I'll have a look at it and try to fix it. Other than that, the parsers seem to work quite well.

@fhessel
Copy link
Owner

fhessel commented Feb 28, 2020

The problem is this block. Adding a log statement makes it visible:

size_t readchars = _request->readChars(bodyBuffer, bodyLength);
HTTPS_LOGI("Got %u chars, wanted %u chars.", readchars, bodyLength);
[HTTPS:I:     59535] Request: POST /edit?filename=test.txt (FID=57)
[HTTPS:I:     59541] Got 512 chars, wanted 7612 chars.
[HTTPS:W:     59561] Callback function did not parse full request body

As the internal buffer of the HTTPRequest is limited to 512 bytes, you won't get more there without reading again. So if the url-encoded content of the text field is bigger than that, the remaining body will never be parsed.
Furthermore, relying on knowing the size will break if you don't have the final request length, e.g., if content-encoding: chunked is used.

@jackjansen
Copy link
Contributor Author

You're right (of course:-).

I have the feeling that worrying about chunked content may be a bit much (I cannot imagine any implementation in their right mind using chunked encoding together with a url-encoded form... Or am I oversimplifying the world and might this happen because lower layers would select chunked encoding in some implementation?).

So a simple while loop around the readChars should do the trick.

I can add and test, but it may be a week or two before I get around to it so feel free to do it yourself if you think its a good idea.

@fhessel
Copy link
Owner

fhessel commented Feb 29, 2020

I wouldn't expect desktop browsers using chunked encoding either, but other clients may. Imagine using a Python script to send data that you generate on the fly, then the HTTP client lib would have to buffer everything or switch to chunked transfer encoding.

But chunked transfer isn't currently implemented anyway. My main concern is only not to rely on Content-Length being present, as that cannot be guaranteed.

So a simple while loop around the readChars should do the trick.

Tried that already, but then you get to the problem that there is no 1:1 representation between input and output bytes and you might read chunks like %2 and 0 instead of %20, and urlDecode() isn't able to cope with that.

I started by writing a second variant for urlDecode which a) works on byte buffers and b) can tell the caller that not all bytes have been processed, quite similar to what I've suggested in the issue linked above. Seems like we don't get around that. I think from there it's not much more work, so I can do that.

And you're right with your comment about buffering, that indeed gets quite excessive. But I think that's a problem of the lower layers (HTTPConnection etc.), and has to be resolved there.

I wouldn't put that into this PR, however, there are some very low-hanging fruits for performance improvement (e.g. this not being at least a memcpy). I'm currently trying to combine hardware-based CI with GitHub Actions. Once that's ready, it'll be much easier to work on the code without breaking things.

@jackjansen
Copy link
Contributor Author

Tried that already, but then you get to the problem that there is no 1:1 representation between input and output bytes and you might read chunks like %2 and 0 instead of %20, and urlDecode() isn't able to cope with that

I had a similar issue with the boundary-detection in the multipart reader (you may have CR LF at the end of the buffer but you shouldn't read them until you're sure they're not part of a boundary marker). There I solved it (I hope:-) by never consuming to end-of-buffer, at least not until I know that all data has been read. A similar scheme could be used here, by having the consumer only read at most len()-3 bytes from the buffer until we know that end-of-buffer coincides with end-of-data.

@jackjansen
Copy link
Contributor Author

I've implemented reading the whole body into a malloc()ed buffer (either knowing the length or not knowing it). This isn't the most efficient implementation, but I didn't want to go all-out and implement the buffering scheme. And the inefficiency is only an issue if someone builds a website/webserver that transmits large volumes of data through an url encoded POST, and you shouldn't do that:-)

@fhessel
Copy link
Owner

fhessel commented Mar 19, 2020

Sorry again for the delay, I'll give it a review asap. I just wanted to have a rudimentary test script in place before, as it eases checking PRs a lot.

Copy link
Owner

@fhessel fhessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally got to it. Files are not truncated any more, and from testing with the example, I'd say it works.

Buffering has to be reworked again, but that's not only a problem of the body parser nor this PR. Also, I'm not very happy with the use of malloc, but I see that it's easier to be able to expand the buffers this way.

From checking the code, I only found some issues around memory allocation which might be a problem if the ESP is low on memory.

Just a note: If you accept the suggestions or add a commit, the new checking actions will be triggered and then most likely fail as the scripts are missing in your PR. If you don't want to rebase, you can ignore that. Merging should be possible either way.

@fhessel
Copy link
Owner

fhessel commented Mar 23, 2020

If I didn't miss anything, only the comment regarding handling a NULL return value for malloc and realloc in HTTPMultipartBodyParser.cpp remains. With that being fixed, the PR is ready to merge.

@jackjansen
Copy link
Contributor Author

Overlooked that one, sorry. Fixed.

@fhessel fhessel merged commit be34405 into fhessel:master Mar 24, 2020
@jackjansen jackjansen deleted the bodyparser branch August 4, 2024 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants