Skip to content

Check qos, heartbeat, max channel are unsigned shorts #640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
drobert opened this issue Mar 31, 2020 · 16 comments · Fixed by #641
Closed

Check qos, heartbeat, max channel are unsigned shorts #640

drobert opened this issue Mar 31, 2020 · 16 comments · Fixed by #641

Comments

@drobert
Copy link

drobert commented Mar 31, 2020

In all versions of this client, including the current version in master, the basic QoS "prefetch count" property is specified of type int. However, the AMQP 0.9.1 Reference specifies prefetch count to be of type short, which it further clarifies to be 16-bit integer.

This most naturally maps to java type short rather than int.

As a result, it's relatively simple to specify values that are out of bounds, such as 100000 (100k). These values have their most significant bits truncated and result in a transmitted value to the server of something else. In the case of 100000 the server will see the 16 least significant bits, which come out to value 16960. (See this SO post for context)

Possible fixes for this would include one or all of:
a) explicitly failing of the value supplied is > Short.MAX_VALUE
b) changing the allowed type in the client from int to short

@michaelklishin
Copy link
Contributor

I'm not sure what'd be the big idea behind setting a prefetch of 100K. You might as well not have any limit, the practical outcome would be the same. This is the first time I remember this being brought up after 10-11 years maintaining RabbitMQ clients.

A PR that implements either option (@acogoluegnes do you have any preferences?) or even both via two overloads would be considered.

@michaelklishin michaelklishin changed the title prefetch count as int rather than short Consider changing QoS prefetch field type to short Mar 31, 2020
@drobert
Copy link
Author

drobert commented Mar 31, 2020

Hmm, correction: I'm hearing reports that a python client supports 65535. It's technically unclear to me if the truncation takes place in the JVM or in a wire protocol translation layer. If the AMQP 'short' is an unsigned 16-bit integer, than a java 'int' makes sense, but I think would still require some safe-guarding on the java client side to disallow unsupported values. (Further, if 32k-64k is supported by the protocol and the java client isn't sending these values properly, that's perhaps a separate issue?)

As for 'why', I can give a sample use case. If one were implementing something similar to kafka-connect draining messages to s3, optimizing for larger file sizes. For example: take 100,000 messages, stream them to s3, then 'ack' the messages. The late 'ack' is important to ensure messages are not lost. A smaller prefetch size would limit the total messages available to be written to 32k.

@michaelklishin
Copy link
Contributor

michaelklishin commented Mar 31, 2020

A traffic capture would provide a lot of relevant information.

I'm not buying your example. You seem to assume that an infinite prefetch implies consuming with automatic acknowledgments. They are orthogonal settings. Since there is no notification of when the prefetch is hit, in the end, it is the consumer that will do all the counting of deliveries received to date in your example. A prefetch of 100K is as good as an unlimited prefetch for nearly all intents and purposes.

@drobert
Copy link
Author

drobert commented Mar 31, 2020

I'm not buying your example. ... A prefetch of 100K is as good as an unlimited prefetch for nearly all intents and purposes.

Perhaps I'm not following the assertion that '100k is as good as unlimited'. Consider the flow mentioned:
1, consume 100k messages
2. write them to s3
3, acknowledge and accept the next 100k

Whether 100k is as good as unlimited would depend a lot on how slow step 2 is versus step 1, wouldn't it? If it takes a second or so to consume 100k messages from rabbit but 3 seconds to finish a multipart upload to s3, eventually the client will fall behind and be susceptible to the "unbounded buffer problem" (as referenced in the rabbitmq docs ), wouldn't it?

(Update): is there a backpressure mechanism available here other than prefetch count?

@michaelklishin
Copy link
Contributor

michaelklishin commented Mar 31, 2020

There is no other mechanism. For a lot of workloads, 100K consumed and unprocessed deliveries are as good as 200K or 1M: a heap spike and application process termination one way or another.

Very large batches is the only remotely practical scenario. Most workloads run into operational troubles with values well under 100K. The fact that this is the first such discussion I recall in years serves as a decent proof. Usually, the discussion is the opposite, with the values between 100 and low single-digit K being adopted after a developer without any operational metrics decides to roll without a limit and the reality of limited heap size then sinks in.

@michaelklishin
Copy link
Contributor

Now, the above argument may be a good indication that there should be a limit of, say, 65K or even lower as a safety measure in the client. I just mentioned it because a "wontfix" is not out of the question for a 13-year-old behavior in any library.

@drobert
Copy link
Author

drobert commented Mar 31, 2020

For a lot of workloads, 100K consumed and unprocessed deliveries are as good as 200K or 1M: a heap spike and application process termination one way or another.

I'm not entirely sure what the purpose of this part of the discussion is. Kudos I suppose to my development team for writing such streamlined software. However the message count and heap size are a function of one another. 100k messages taking 1kb each is 100MB of heap. Not a particularly big problem. On the JVM in particular, this is not even an 'interesting number'. We're running a service with 2GB of heap, 100k messages should be laughably small.

Further, I gave a fairly reasonable use case of how a large prefetch value is applicable, and I opened this ticket because we have this exact situation. Whether others have run into this or not I can't say (at least one other person, as there exists a stackoverflow question).

However the fact remains that there is an actual bug. I supply a prefetch value of 100000 and the client silently negotiates a value that is the result of truncation. In my case 16960. I picked 100000 out of the blue as it illustrates the problem. Any value that would result in 16-bit truncation is susceptible.

The rest of my commentary is that I'm unclear on the exact location of the bug or what the technically supported range is. I pointed out the python client supporting 65535 as a possible indication that there may be more than one bug on the java side. I don't know for certain right now. I do know that I can supply a value the client library purposes to handle but it mishandles.

@lukebakken
Copy link
Contributor

As mentioned here int is used because unsigned short doesn't exist in the JVM. A check that the prefetch value is in the range 0 - 65535 would suffice. @acogoluegnes what do you think of that?

For now, just use a value in that range.

@michaelklishin
Copy link
Contributor

The purpose of the discussion above is to mention that such high values are not common, and this behavior is 13 years old. If only two people bring something up in 13 years, one response to consider for the maintainers is "do not change anything", as an API behavior change can do more harm than good.

If all we can do is support a more narrow range, and the vast majority of users do not approach the new upper bound (65535?), then I'd say we should do this. Either option would be fine with me.

@acogoluegnes
Copy link
Contributor

@lukebakken got it right, there's no unsigned short in Java, so we have to use an int. We can introduce a check to avoid the unexpected truncation. This can break some application code, but that may be better than doing something not correct.

@acogoluegnes
Copy link
Contributor

The check could take place when writing the frame. This is a centralized place.

Or it could be done in the different places where the problem can show up, that is setting QoS and also setting the heartbeat and the max number of channels (tune-ok).

@michaelklishin
Copy link
Contributor

I think it should be done by Channel instances when the value is set.

acogoluegnes added a commit that referenced this issue Apr 1, 2020
To avoid truncation and subtle bugs.

Fixes #640
@michaelklishin michaelklishin added this to the 6.0.0 milestone Apr 1, 2020
@michaelklishin
Copy link
Contributor

@acogoluegnes if this can go into 5.9.0 (since it's a small change), feel free to backport and change the milestone.

@acogoluegnes
Copy link
Contributor

I was thinking cutting 5.8.1 for this, it's a bug fix after all.

@acogoluegnes acogoluegnes modified the milestones: 6.0.0, 5.8.1 Apr 1, 2020
@acogoluegnes acogoluegnes self-assigned this Apr 1, 2020
@michaelklishin
Copy link
Contributor

@acogoluegnes this may attract some angst from the Church of Strict SemVer crowd but fine with me

@acogoluegnes
Copy link
Contributor

Right, I'll make this explicit in the changelog. This should not affect many people and this is the price to pay for correctness I'd say.

acogoluegnes added a commit that referenced this issue Apr 1, 2020
To avoid truncation and subtle bugs.

Fixes #640

(cherry picked from commit 733788e)

Conflicts:
	src/test/java/com/rabbitmq/client/test/ConnectionFactoryTest.java
acogoluegnes added a commit that referenced this issue Apr 1, 2020
To avoid truncation and subtle bugs.

Fixes #640

(cherry picked from commit 733788e)

Conflicts:
	src/test/java/com/rabbitmq/client/test/ConnectionFactoryTest.java
(cherry picked from commit 14d9c1f)
acogoluegnes added a commit that referenced this issue Apr 1, 2020
To avoid truncation and subtle bugs.

Fixes #640

(cherry picked from commit 733788e)

Conflicts:
	src/main/java/com/rabbitmq/client/impl/ChannelN.java
	src/test/java/com/rabbitmq/client/test/ClientTests.java
	src/test/java/com/rabbitmq/client/test/ConnectionFactoryTest.java
acogoluegnes added a commit that referenced this issue Apr 1, 2020
To avoid truncation and subtle bugs.

Fixes #640

(cherry picked from commit 733788e)

Conflicts:
	src/main/java/com/rabbitmq/client/impl/ChannelN.java
	src/test/java/com/rabbitmq/client/test/ClientTests.java
	src/test/java/com/rabbitmq/client/test/ConnectionFactoryTest.java
(cherry picked from commit c244435)
@acogoluegnes acogoluegnes changed the title Consider changing QoS prefetch field type to short Check qos, heartbeat, max channel are unsigned shorts Apr 2, 2020
@acogoluegnes acogoluegnes modified the milestones: 5.8.1, 6.0.0 Apr 2, 2020
acogoluegnes added a commit that referenced this issue Feb 22, 2021
Difference between 6.x and 5.x.

References #640, #642, #672
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants