Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer Update Delivered Last Sequence Behavior [v2.10.17] #5844

Open
casix opened this issue Aug 28, 2024 · 7 comments
Open

Consumer Update Delivered Last Sequence Behavior [v2.10.17] #5844

casix opened this issue Aug 28, 2024 · 7 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@casix
Copy link

casix commented Aug 28, 2024

Observed behavior

Consumers increase the delivered last sequence, even though the stream they are connected to hasn't received any messages

Expected behavior

Each consumer only increases its own delivered last sequence when it reads a message from its stream.

Server and client version

In the cloud:

  • Helm version 1.2.1 with Applications version 2.10.17

In the leafnodes:

  • windows client version 2.10.18

Host environment

Cloud Environment:

  • Running in an EKS cluster.
  • Helm version: 1.2.1.
  • Application version: 2.10.17.
  • JetStream (in-disk mode) is enabled with a specific JetStream domain.
  • All leaf nodes connect to the same account using different usernames.

IoT Environment:

  • Running on Windows.
  • NATS version: 2.10.18.
  • No accounts.
  • Two local users.
  • Unique JetStream domain in each iot device.
  • Connects to the cloud server as a leaf node

Stream and Consumer Setup:
Each IoT device has the following NATS resources:

  • In the cloud, we create a stream.
  • In the leaf node (its own JetStream domain), we have:
    • A workqueue stream mirroring the cloud stream. We specify a unique delivery prefix.
    • A pull consumer that consumes from the local stream.

Specific Example (for IoT device “Alice”):

  • In the cloud:
    • stream 'down_alice' is subscribed to the subject 'down.alice'.
  • In the IoT device “Alice” (within the JetStream domain called 'alice'):
    • A workqueue stream named 'down' mirrors the stream 'down_alice' with delivery prefix 'deliver.cloud.alice'.
    • A pull consumer named 'down_consumer' operates on the local stream 'down'.

Steps to reproduce

Set up an environment as described, and send a message to down_alice with a client connected directly to the cloud. Then, check the delivered last sequence in all consumers.

Side effects

If we have Alice and Bob IoT devices, and we send 10 messages to Alice, both Bob's and Alice's consumers will have their delivered last sequence set to 10. If I now send a message to Bob, the message is lost. Both streams, 'down_bob' and 'down' (the local one), increase their own last sequence, but Bob's consumer does not change anything.

If I continue sending messages to Bob, the behavior remains the same until the sequence number of a message is 11. At that point, because it is greater than the consumer's delivered last sequence, the consumer reads it. However, Alice's consumer also increases its own last delivered sequence.

I suspect that the messages lost is because the 'down' queue is a workqueue, and since the consumer's last delivered sequence is higher than the sequence number of the message, the consumer assumes the message has already been read.

The stream with the highest sequence number will work (the messages will be read by the consumer), while the other consumers will discard all messages until their stream has the highest sequence.

If you need more information, feel free to ask.

Thanks!

@casix casix added the defect Suspected defect such as a bug or regression label Aug 28, 2024
@casix
Copy link
Author

casix commented Aug 29, 2024

I set two the leaf nodes servers to trace mode and analyzed the logs.

What I observed is this:

(I tried to clean up the logs. I've left some lines that I think aren't important, but I'm not sure. I deleted lines that I believe are not relevant (I hope I didn't delete any important ones))

  • When I send a message to the stream with the highest sequence:

    • Logs from the server that receives the message:
      [8924] 2024/08/29 11:11:14.101905 [TRC] 54.171.110.45:443 - lid_ws:27 - <<- [LMSG deliver.cloud.9f4849d5-45ff-403f-a955-3e4042685f72.M.PfgeANUG $JS.ACK.dev-rc_devices_down_9f4849d5-45ff-403f-a955-3e4042685f72.i3UZezSD.1.4064.4061.1724922674167386096.0@dev-rc.devices.down.9f4849d5-45ff-403f-a955-3e4042685f72 4]
      [8924] 2024/08/29 11:11:14.101905 [TRC] 54.171.110.45:443 - lid_ws:27 - <<- MSG_PAYLOAD: ["hola"]
      [8924] 2024/08/29 11:11:14.102555 [TRC] 127.0.0.1:51140 - cid:36 - ->> [MSG dev-rc.devices.down.9f4849d5-45ff-403f-a955-3e4042685f72 3 $JS.ACK.down.down_consumer.1.4064.5.1724922674167386096.0 4]
      [8924] 2024/08/29 11:11:14.104112 [TRC] 127.0.0.1:51140 - cid:36 - <<- [PUB $JS.ACK.down.down_consumer.1.4064.5.1724922674167386096.0 5]
      [8924] 2024/08/29 11:11:14.104656 [TRC] 127.0.0.1:51140 - cid:36 - <<- MSG_PAYLOAD: ["+TERM"]
      [8924] 2024/08/29 11:11:14.104656 [TRC] 54.171.110.45:443 - lid_ws:27 - ->> [LMSG $JS.ACK.down.down_consumer.1.4064.5.1724922674167386096.0 5]
      [8924] 2024/08/29 11:11:14.104656 [TRC] 127.0.0.1:51140 - cid:36 - <<- [PUB $JS.API.CONSUMER.MSG.NEXT.down.down_consumer _INBOX.Wpoi6ePfT40yDp1cirdnpL 35]
      [8924] 2024/08/29 11:11:14.104656 [TRC] 127.0.0.1:51140 - cid:36 - <<- MSG_PAYLOAD: ["{\"batch\": 1, \"expires\": 4999900000}"]
      [8924] 2024/08/29 11:11:14.104656 [DBG] 54.171.110.45:443 - lid_ws:27 - Not permitted to deliver to "$JS.API.CONSUMER.MSG.NEXT.down.down_consumer"

    • Logs from the other server:
      [4672] 2024/08/29 11:11:14.228083 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- [LMSG $JS.ACK.down.down_consumer.1.4064.5.1724922674167386096.0 5]
      [4672] 2024/08/29 11:11:14.228083 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- MSG_PAYLOAD: ["+TERM"]

  • When I send a message to the stream that does not have the highest sequence:

    • Logs from the server that receives the message:
      [4672] 2024/08/29 11:12:04.711266 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- [LMSG deliver.cloud.9a333ee0-b3d3-4393-bf13-320a63dd1b50.M.PXFU256b $JS.ACK.dev-rc_devices_down_9a333ee0-b3d3-4393-bf13-320a63dd1b50.LD5rWYj2.1.4064.4064.1724922724697142996.0@dev-rc.devices.down.9a333ee0-b3d3-4393-bf13-320a63dd1b50 4]
      [4672] 2024/08/29 11:12:04.711266 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- MSG_PAYLOAD: ["hola"]
      [4672] 2024/08/29 11:12:04.810368 [TRC] 52.30.103.187:443 - lid_ws:21 - ->> [HMSG deliver.9a333ee0-b3d3-4393-bf13-320a63dd1b50.cloud.S.zX2vOagH 75 75]
      [4672] 2024/08/29 11:12:04.810368 [TRC] 52.30.103.187:443 - lid_ws:21 - ->> [HMSG deliver.9a333ee0-b3d3-4393-bf13-320a63dd1b50.cloud.S.EhQlL4ZA 75 75]
      [4672] 2024/08/29 11:12:04.810368 [TRC] 52.30.103.187:443 - lid_ws:21 - ->> [HMSG deliver.9a333ee0-b3d3-4393-bf13-320a63dd1b50.cloud.S.WjICBx4N 75 75]
      [4672] 2024/08/29 11:12:04.810368 [TRC] 52.30.103.187:443 - lid_ws:21 - ->> [HMSG deliver.9a333ee0-b3d3-4393-bf13-320a63dd1b50.cloud.S.YsTdnk8j 75 75]
      [4672] 2024/08/29 11:12:04.856976 [TRC] 127.0.0.1:51085 - cid:26 - ->> [HMSG _INBOX.prq3VgHHPEiLyRURkcw7i1 2 81 81]
      [4672] 2024/08/29 11:12:04.918818 [TRC] 127.0.0.1:51085 - cid:26 - <<- [PUB $JS.API.CONSUMER.MSG.NEXT.sessions.events_sessions_consumer _INBOX.prq3VgHHPEiLyRURkcw7i1 35]
      [4672] 2024/08/29 11:12:04.918818 [TRC] 127.0.0.1:51085 - cid:26 - <<- MSG_PAYLOAD: ["{\"batch\": 1, \"expires\": 4999900000}"]
      [4672] 2024/08/29 11:12:04.918818 [DBG] 52.30.103.187:443 - lid_ws:21 - Not permitted to deliver to "$JS.API.CONSUMER.MSG.NEXT.sessions.events_sessions_consumer"
      [4672] 2024/08/29 11:12:05.711545 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- [HMSG deliver.cloud.9a333ee0-b3d3-4393-bf13-320a63dd1b50.M.PXFU256b 81 81]
      [4672] 2024/08/29 11:12:05.711636 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- MSG_PAYLOAD: ["NATS/1.0 100 Idle Heartbeat\r\nNats-Last-Consumer: 4064\r\nNats-Last-Stream: 4064\r\n\r\n"]

    • Logs from the other server:
      [8924] 2024/08/29 11:12:04.130224 [TRC] 54.171.110.45:443 - lid_ws:27 - <<- [HMSG deliver.cloud.9f4849d5-45ff-403f-a955-3e4042685f72.M.PfgeANUG 81 81]
      [8924] 2024/08/29 11:12:04.130224 [TRC] 54.171.110.45:443 - lid_ws:27 - <<- MSG_PAYLOAD: ["NATS/1.0 100 Idle Heartbeat\r\nNats-Last-Consumer: 4061\r\nNats-Last-Stream: 4064\r\n\r\n"]
      [8924] 2024/08/29 11:12:04.130389 [TRC] 54.171.110.45:443 - lid_ws:27 - ->> [HMSG deliver.9f4849d5-45ff-403f-a955-3e4042685f72.cloud.S.uGuKHIiG 75 75]
      [8924] 2024/08/29 11:12:04.782781 [TRC] 127.0.0.1:51140 - cid:36 - ->> [HMSG _INBOX.Wpoi6ePfT40yDp1cirdnpL 3 81 81]
      [8924] 2024/08/29 11:12:04.845309 [TRC] 54.171.110.45:443 - lid_ws:27 - ->> [HMSG deliver.9f4849d5-45ff-403f-a955-3e4042685f72.cloud.S.qRFCuY6K 75 75]
      [8924] 2024/08/29 11:12:04.845309 [TRC] 54.171.110.45:443 - lid_ws:27 - ->> [HMSG deliver.9f4849d5-45ff-403f-a955-3e4042685f72.cloud.S.Yh26ypG9 75 75]
      [8924] 2024/08/29 11:12:04.845309 [TRC] 127.0.0.1:51140 - cid:36 - <<- [PUB $JS.API.CONSUMER.MSG.NEXT.down.down_consumer _INBOX.Wpoi6ePfT40yDp1cirdnpL 35]
      [8924] 2024/08/29 11:12:04.845309 [TRC] 127.0.0.1:51140 - cid:36 - <<- MSG_PAYLOAD: ["{\"batch\": 1, \"expires\": 4999900000}"]
      [8924] 2024/08/29 11:12:04.845309 [DBG] 54.171.110.45:443 - lid_ws:27 - Not permitted to deliver to "$JS.API.CONSUMER.MSG.NEXT.down.down_consumer"

I don't have the expertise to fully analyze this, but maybe it helps.

What seems most suspicious to me is this message:
[4672] 2024/08/29 11:11:14.228083 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- [LMSG $JS.ACK.down.down_consumer.1.4064.5.1724922674167386096.0 5]
[4672] 2024/08/29 11:11:14.228083 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- MSG_PAYLOAD: ["+TERM"]

This ack message is received by both servers, but I understand that only the server where the consumer receives the message should receive it. However, I could be wrong...

@casix
Copy link
Author

casix commented Aug 29, 2024

I made a test using different consumer names on each leaf node. I also had to change the way we consume messages from our Python program to a CLI command. Now, when I send a message to the stream with the highest sequence, the logs are:

  • On the server that receives the request:
    [8924] 2024/08/29 13:07:40.204543 [TRC] 127.0.0.1:57693 - cid:50 - ->> [MSG dev-rc.devices.down.9f4849d5-45ff-403f-a955-3e4042685f72 2 $JS.ACK.down.down_consumer_9f4849d5-45ff-403f-a955-3e4042685f72.1.4069.3.1724929631952185043.0 4]
    [8924] 2024/08/29 13:07:40.204543 [TRC] 127.0.0.1:57693 - cid:50 - <-> [DELSUB 2]
    [8924] 2024/08/29 13:07:40.204543 [TRC] 54.171.110.45:443 - lid_ws:27 - ->> [LS- _INBOX.T3dKjp9BZd0iB8ioQ8TbLO.NKVZp7uj]
    [8924] 2024/08/29 13:07:40.205055 [TRC] 127.0.0.1:57693 - cid:50 - <<- [PUB $JS.ACK.down.down_consumer_9f4849d5-45ff-403f-a955-3e4042685f72.1.4069.3.1724929631952185043.0 0]
    [8924] 2024/08/29 13:07:40.205055 [TRC] 127.0.0.1:57693 - cid:50 - <<- MSG_PAYLOAD: [""]
    [8924] 2024/08/29 13:07:40.205055 [TRC] 127.0.0.1:57693 - cid:50 - <<- [PING]
  • On the other server
    nothing relevant

Another important point is that we are mostly sure this behavior appeared when we changed the consumers from push to pull.

@ripienaar
Copy link
Contributor

What's happening here appears to be 2 domains with identical stream and consumer names.

The ack messages like $JS.ACK.STREAM.CONSUMER.1.2.2.1725279000278321015.0 are not domain scoped and so the leaf nodes with the same stream and consumer name both get the acks.

Because we don't track what are valid outstanding ACKs and just accept any ACK this causes actual ACKs to happen.

@ripienaar
Copy link
Contributor

Best option today is to make unique stream names in each domain.

@ripienaar
Copy link
Contributor

We do have a domain aware ACK format, but at least in my setup and yours that's not used - asking around to find out how that works.

@casix
Copy link
Author

casix commented Sep 2, 2024

Best option today is to make unique stream names in each domain.

Make unique consumers is valid too? or is better to make unique streams?

@ripienaar
Copy link
Contributor

Unique consumer would also avoid it yeah

@wallyqs wallyqs changed the title Consumer Update Delivered Last Sequence Behavior Consumer Update Delivered Last Sequence Behavior [v2.10.17] Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

2 participants