Check qos, heartbeat, max channel are unsigned shorts #640

drobert · 2020-03-31T19:29:35Z

In all versions of this client, including the current version in master, the basic QoS "prefetch count" property is specified of type int. However, the AMQP 0.9.1 Reference specifies prefetch count to be of type short, which it further clarifies to be 16-bit integer.

This most naturally maps to java type short rather than int.

As a result, it's relatively simple to specify values that are out of bounds, such as 100000 (100k). These values have their most significant bits truncated and result in a transmitted value to the server of something else. In the case of 100000 the server will see the 16 least significant bits, which come out to value 16960. (See this SO post for context)

Possible fixes for this would include one or all of:
a) explicitly failing of the value supplied is > Short.MAX_VALUE
b) changing the allowed type in the client from int to short

The text was updated successfully, but these errors were encountered:

michaelklishin · 2020-03-31T19:50:20Z

I'm not sure what'd be the big idea behind setting a prefetch of 100K. You might as well not have any limit, the practical outcome would be the same. This is the first time I remember this being brought up after 10-11 years maintaining RabbitMQ clients.

A PR that implements either option (@acogoluegnes do you have any preferences?) or even both via two overloads would be considered.

drobert · 2020-03-31T20:16:22Z

Hmm, correction: I'm hearing reports that a python client supports 65535. It's technically unclear to me if the truncation takes place in the JVM or in a wire protocol translation layer. If the AMQP 'short' is an unsigned 16-bit integer, than a java 'int' makes sense, but I think would still require some safe-guarding on the java client side to disallow unsupported values. (Further, if 32k-64k is supported by the protocol and the java client isn't sending these values properly, that's perhaps a separate issue?)

As for 'why', I can give a sample use case. If one were implementing something similar to kafka-connect draining messages to s3, optimizing for larger file sizes. For example: take 100,000 messages, stream them to s3, then 'ack' the messages. The late 'ack' is important to ensure messages are not lost. A smaller prefetch size would limit the total messages available to be written to 32k.

michaelklishin · 2020-03-31T20:59:40Z

A traffic capture would provide a lot of relevant information.

I'm not buying your example. You seem to assume that an infinite prefetch implies consuming with automatic acknowledgments. They are orthogonal settings. Since there is no notification of when the prefetch is hit, in the end, it is the consumer that will do all the counting of deliveries received to date in your example. A prefetch of 100K is as good as an unlimited prefetch for nearly all intents and purposes.

drobert · 2020-03-31T21:06:00Z

I'm not buying your example. ... A prefetch of 100K is as good as an unlimited prefetch for nearly all intents and purposes.

Perhaps I'm not following the assertion that '100k is as good as unlimited'. Consider the flow mentioned:
1, consume 100k messages
2. write them to s3
3, acknowledge and accept the next 100k

Whether 100k is as good as unlimited would depend a lot on how slow step 2 is versus step 1, wouldn't it? If it takes a second or so to consume 100k messages from rabbit but 3 seconds to finish a multipart upload to s3, eventually the client will fall behind and be susceptible to the "unbounded buffer problem" (as referenced in the rabbitmq docs ), wouldn't it?

(Update): is there a backpressure mechanism available here other than prefetch count?

michaelklishin · 2020-03-31T21:25:38Z

There is no other mechanism. For a lot of workloads, 100K consumed and unprocessed deliveries are as good as 200K or 1M: a heap spike and application process termination one way or another.

Very large batches is the only remotely practical scenario. Most workloads run into operational troubles with values well under 100K. The fact that this is the first such discussion I recall in years serves as a decent proof. Usually, the discussion is the opposite, with the values between 100 and low single-digit K being adopted after a developer without any operational metrics decides to roll without a limit and the reality of limited heap size then sinks in.

michaelklishin · 2020-03-31T21:29:00Z

Now, the above argument may be a good indication that there should be a limit of, say, 65K or even lower as a safety measure in the client. I just mentioned it because a "wontfix" is not out of the question for a 13-year-old behavior in any library.

drobert · 2020-03-31T21:32:03Z

For a lot of workloads, 100K consumed and unprocessed deliveries are as good as 200K or 1M: a heap spike and application process termination one way or another.

I'm not entirely sure what the purpose of this part of the discussion is. Kudos I suppose to my development team for writing such streamlined software. However the message count and heap size are a function of one another. 100k messages taking 1kb each is 100MB of heap. Not a particularly big problem. On the JVM in particular, this is not even an 'interesting number'. We're running a service with 2GB of heap, 100k messages should be laughably small.

Further, I gave a fairly reasonable use case of how a large prefetch value is applicable, and I opened this ticket because we have this exact situation. Whether others have run into this or not I can't say (at least one other person, as there exists a stackoverflow question).

However the fact remains that there is an actual bug. I supply a prefetch value of 100000 and the client silently negotiates a value that is the result of truncation. In my case 16960. I picked 100000 out of the blue as it illustrates the problem. Any value that would result in 16-bit truncation is susceptible.

The rest of my commentary is that I'm unclear on the exact location of the bug or what the technically supported range is. I pointed out the python client supporting 65535 as a possible indication that there may be more than one bug on the java side. I don't know for certain right now. I do know that I can supply a value the client library purposes to handle but it mishandles.

lukebakken · 2020-03-31T21:39:03Z

As mentioned here int is used because unsigned short doesn't exist in the JVM. A check that the prefetch value is in the range 0 - 65535 would suffice. @acogoluegnes what do you think of that?

For now, just use a value in that range.

michaelklishin · 2020-03-31T22:12:39Z

The purpose of the discussion above is to mention that such high values are not common, and this behavior is 13 years old. If only two people bring something up in 13 years, one response to consider for the maintainers is "do not change anything", as an API behavior change can do more harm than good.

If all we can do is support a more narrow range, and the vast majority of users do not approach the new upper bound (65535?), then I'd say we should do this. Either option would be fine with me.

acogoluegnes · 2020-04-01T07:25:33Z

@lukebakken got it right, there's no unsigned short in Java, so we have to use an int. We can introduce a check to avoid the unexpected truncation. This can break some application code, but that may be better than doing something not correct.

acogoluegnes · 2020-04-01T07:41:25Z

The check could take place when writing the frame. This is a centralized place.

Or it could be done in the different places where the problem can show up, that is setting QoS and also setting the heartbeat and the max number of channels (tune-ok).

michaelklishin · 2020-04-01T09:40:39Z

I think it should be done by Channel instances when the value is set.

To avoid truncation and subtle bugs. Fixes #640

michaelklishin · 2020-04-01T12:00:47Z

@acogoluegnes if this can go into 5.9.0 (since it's a small change), feel free to backport and change the milestone.

acogoluegnes · 2020-04-01T13:30:07Z

I was thinking cutting 5.8.1 for this, it's a bug fix after all.

michaelklishin · 2020-04-01T13:36:16Z

@acogoluegnes this may attract some angst from the Church of Strict SemVer crowd but fine with me

acogoluegnes · 2020-04-01T13:51:02Z

Right, I'll make this explicit in the changelog. This should not affect many people and this is the price to pay for correctness I'd say.

To avoid truncation and subtle bugs. Fixes #640 (cherry picked from commit 733788e) Conflicts: src/test/java/com/rabbitmq/client/test/ConnectionFactoryTest.java

To avoid truncation and subtle bugs. Fixes #640 (cherry picked from commit 733788e) Conflicts: src/test/java/com/rabbitmq/client/test/ConnectionFactoryTest.java (cherry picked from commit 14d9c1f)

To avoid truncation and subtle bugs. Fixes #640 (cherry picked from commit 733788e) Conflicts: src/main/java/com/rabbitmq/client/impl/ChannelN.java src/test/java/com/rabbitmq/client/test/ClientTests.java src/test/java/com/rabbitmq/client/test/ConnectionFactoryTest.java

To avoid truncation and subtle bugs. Fixes #640 (cherry picked from commit 733788e) Conflicts: src/main/java/com/rabbitmq/client/impl/ChannelN.java src/test/java/com/rabbitmq/client/test/ClientTests.java src/test/java/com/rabbitmq/client/test/ConnectionFactoryTest.java (cherry picked from commit c244435)

Difference between 6.x and 5.x. References #640, #642, #672

drobert mentioned this issue Mar 31, 2020

Handle prefetchCount > max Short value profunktor/fs2-rabbit#338

Open

michaelklishin changed the title ~~prefetch count as int rather than short~~ Consider changing QoS prefetch field type to short Mar 31, 2020

michaelklishin added the usability label Mar 31, 2020

acogoluegnes added a commit that referenced this issue Apr 1, 2020

Check qos, heartbeat, max channel are unsigned shorts

733788e

To avoid truncation and subtle bugs. Fixes #640

acogoluegnes mentioned this issue Apr 1, 2020

Check qos, heartbeat, max channel are unsigned shorts #641

Merged

michaelklishin closed this as completed in #641 Apr 1, 2020

michaelklishin added this to the 6.0.0 milestone Apr 1, 2020

acogoluegnes modified the milestones: 6.0.0, 5.8.1 Apr 1, 2020

acogoluegnes self-assigned this Apr 1, 2020

acogoluegnes added the bug label Apr 1, 2020

acogoluegnes added a commit that referenced this issue Apr 1, 2020

Check qos, heartbeat, max channel are unsigned shorts

14d9c1f

To avoid truncation and subtle bugs. Fixes #640 (cherry picked from commit 733788e) Conflicts: src/test/java/com/rabbitmq/client/test/ConnectionFactoryTest.java

acogoluegnes changed the title ~~Consider changing QoS prefetch field type to short~~ Check qos, heartbeat, max channel are unsigned shorts Apr 2, 2020

acogoluegnes mentioned this issue Apr 2, 2020

Make sure qos, heartbeat, max channel are unsigned shorts #642

Closed

acogoluegnes modified the milestones: 5.8.1, 6.0.0 Apr 2, 2020

acogoluegnes added the breaking-change label Feb 22, 2021

acogoluegnes added a commit that referenced this issue Feb 22, 2021

Fix AMQP parsing test

bf362d5

Difference between 6.x and 5.x. References #640, #642, #672

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check qos, heartbeat, max channel are unsigned shorts #640

Check qos, heartbeat, max channel are unsigned shorts #640

drobert commented Mar 31, 2020

michaelklishin commented Mar 31, 2020

drobert commented Mar 31, 2020

michaelklishin commented Mar 31, 2020 •

edited

Loading

drobert commented Mar 31, 2020 •

edited

Loading

michaelklishin commented Mar 31, 2020 •

edited

Loading

michaelklishin commented Mar 31, 2020

drobert commented Mar 31, 2020 •

edited

Loading

lukebakken commented Mar 31, 2020

michaelklishin commented Mar 31, 2020

acogoluegnes commented Apr 1, 2020

acogoluegnes commented Apr 1, 2020

michaelklishin commented Apr 1, 2020

michaelklishin commented Apr 1, 2020

acogoluegnes commented Apr 1, 2020

michaelklishin commented Apr 1, 2020

acogoluegnes commented Apr 1, 2020

Check qos, heartbeat, max channel are unsigned shorts #640

Check qos, heartbeat, max channel are unsigned shorts #640

Comments

drobert commented Mar 31, 2020

michaelklishin commented Mar 31, 2020

drobert commented Mar 31, 2020

michaelklishin commented Mar 31, 2020 • edited Loading

drobert commented Mar 31, 2020 • edited Loading

michaelklishin commented Mar 31, 2020 • edited Loading

michaelklishin commented Mar 31, 2020

drobert commented Mar 31, 2020 • edited Loading

lukebakken commented Mar 31, 2020

michaelklishin commented Mar 31, 2020

acogoluegnes commented Apr 1, 2020

acogoluegnes commented Apr 1, 2020

michaelklishin commented Apr 1, 2020

michaelklishin commented Apr 1, 2020

acogoluegnes commented Apr 1, 2020

michaelklishin commented Apr 1, 2020

acogoluegnes commented Apr 1, 2020

michaelklishin commented Mar 31, 2020 •

edited

Loading

drobert commented Mar 31, 2020 •

edited

Loading

michaelklishin commented Mar 31, 2020 •

edited

Loading

drobert commented Mar 31, 2020 •

edited

Loading