Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeswitch packet relay metrics #2292

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

jchavanton
Copy link
Contributor

@jchavanton jchavanton commented Oct 27, 2023

As I am now using the elastic jitter buffer, I wanted precise metrics on the time packets are spending inside FS.

I made some modifications to have such reporting.

This is the artificial jitter I am using
sudo tc qdisc add dev ethx root netem delay 475ms 600ms

Example with a test with high jitter with/without the elastic jitter buffer activated.

With the elastic jitter buffer activated

{
  "call_id_in": "j6mk7sg4g7kppjr4koii",
  "call_id_out": "a5045e74-c3c5-4fe7-825e-49d8185cc852",
  "report": {
    "in": {
      "ssrc": "0x8A0F2483",
      "remote_socket": "172.21.61.69:11836",
      "local_socket": "172.21.61.69:33034",
      "codec": "opus",
      "count": 1045,
      "plc": 36                        << very little PLC was required
    },
    "out": {
      "ssrc": "0xB93EA6DF",
      "remote_socket": "172.21.50.119:27362",
      "local_socket": "172.21.61.69:19990",
      "codec": "PCMU",
      "count": 1081,
      "max": 1100,
      "avg": 429.21               << average time spend in the jitter buffer in ms, it seems to make perfect sense
    }
  }
}

Without the elastic jitter buffer activated

{
  "call_id_in": "hh95nco8uo20428poa9q",
  "call_id_out": "45e7d02a-bd72-4ee3-9dad-cec79fe5247c",
  "report": {
    "in": {
      "ssrc": "0xC81DB53F",
      "remote_socket": "172.21.61.69:11810",
      "local_socket": "172.21.61.69:29424",
      "codec": "opus",
      "count": 1654,
      "plc": 496      << a lot of plc was generated, the audio was terrible
    },
    "out": {
      "ssrc": "0xB93EA5D4",
      "remote_socket": "172.21.50.81:26902",
      "local_socket": "172.21.61.69:18580",
      "codec": "PCMU",
      "count": 2150,
      "max": 479,                  
      "avg": 19.00         << average size in ms, this is not looking good 
    }
  }
}

load tests

I did several load tests to measure the impact of the extra work load and nothing significant on a very small instance for example.

With packet stats

Without packet stats

I understand that some refactoring and modifications will be needed to consider merging this ...

but I wanted to get some feedback to see if you guys see the value of such feature.

@signalwire-ci
Copy link

signalwire-ci bot commented Oct 27, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Oct 27, 2023

@@ -57,6 +57,7 @@ typedef struct {
char body[SWITCH_RTP_MAX_BUF_LEN+4+sizeof(char *)];
switch_rtp_hdr_ext_t *ext;
char *ebody;
switch_time_t received_ts;
Copy link
Contributor

@dragos-oancea dragos-oancea Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be here, nor in the rtp_msg_t . If you want to to add the received timestamp to the JB you will have to create a different struct with both switch_rtp_packet_t and switch_time_t in it.
Expose it in switch_rtp.h and use it for switch_jb_put_packet() and switch_jb_get_packet() (at least) .
eg:
typedef struct {
switch_rtp_packet_t pkt;
switch_time_t rcvd_time;
} switch_rtp_incoming_pkt_t;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, I wanted to measure from the input to the output socket to be sure that the measurements are taken as close as possible to the input and output.

Do I understand that adding this at the packet level is to intrusive and measuring only at the jitter buffer in/out would be preferable ?

This approach would probably require a bit more code modification since some functions parameters will
have to be modified to support this new struct.

Thanks for the feedback i will take some time to consider and look at it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll need a rcvd_time in the switch_jb_node_t too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the sizeof() of those structs (rtp_msg_t or switch_rtp_packet_t) - better to keep them as they are in master branch now.

@@ -170,6 +170,8 @@ struct switch_core_session {
switch_buffer_t *text_line_buffer;
switch_mutex_t *text_mutex;
const char *external_id;

packet_stats_t stats;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be here. We have RTP stats on the session through media_handle -> RTP engines.
See switch_rtp_stats_t stats in the rtp_session.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, will look at it

@@ -87,6 +87,7 @@ typedef struct switch_frame_geometry {
payload_map_t *pmap;
switch_image_t *img;
struct switch_frame_geometry geometry;
switch_time_t received_ts;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a timestamp on the frame. Do we really need another ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I will look at that other timestamp to see if it is timestamped at the extraction of the frame from the packet or at the reception of the packet like the one I am using, even both should be the same most of the time.

@jchavanton
Copy link
Contributor Author

jchavanton commented Nov 2, 2023

Looking that the packet relay metrics, I found a problem with the EJB.

2023-11-01 22:20:04.423912 73.87% [INFO] switch_cpp.cpp:1392 DIRECTION[outbound]REPORT[{"call_id_in": "2vhualhn155j6c6stl07",  "call_id_out": "bac93dce-96a3-468e-9225-a3f9337acfb3", "report": {"in" : { "ssrc": "0x537EC1F3", "remote_socket": "172.31.61.245:13114", "local_socket": "172.31.61.245:34366", "codec": "opus", "count": 93322, "plc": 77165},"out" : { "ssrc": "0xD95425B9", "remote_socket": "172.31.50.26:35840", "local_socket": "172.31.61.245:15564", "codec": "PCMU", "count": 170487, "max": 1824759, "avg": 464684.78 }}}]

2023-11-01 22:20:04.443912 73.87% [INFO] switch_cpp.cpp:1392 DIRECTION[inbound]REPORT[{"call_id_in": "bac93dce-96a3-468e-9225-a3f9337acfb3",  "call_id_out": "2vhualhn155j6c6stl07", "report": {"in" : { "ssrc": "0x0E8C1492", "remote_socket": "172.31.50.26:15564", "local_socket": "172.31.61.245:35840", "codec": "PCMU", "count": 187168, "plc": 0},"out" : { "ssrc": "0x7638CC79", "remote_socket": "172.31.61.245:34366", "local_socket": "172.31.61.245:13114", "codec": "opus", "count": 187168, "max": 20, "avg": 9.74 }}}]

2023-11-01 22:20:04.443912 73.87% [INFO] switch_cpp.cpp:1392 [http-kamailio] :{"in_call_id": "2vhualhn155j6c6stl07", "out_call_id": "bac93dce-96a3-468e-9225-a3f9337acfb3", "jb":{"size_max_ms":780,"size_est_ms":0,"acceleration_ms":53120,"expand_ms":53060,"fast_acceleration_ms":0,"forced_acceleration_ms":0,"jitter_max_ms":49,"jitter_est_ms":0,"reset":7,"reset_too_big":0}}
in_call_id: 2vhualhn155j6c6stl07 
out_call_id: bac93dce-96a3-468e-9225-a3f9337acfb3

based on packet stat report :
duration: 62.38 minutes (187168/50/60) 
plc: 25 minutes (77165/50/60)
average one way delay : 7.74 minutes (464684/1000/60)
max one way delay : 30.41 minutes (1824759/1000/60)

from  Web >> "count": 170487, "max": 1824759, "avg": 464684.78,  "duration": 60 minutes
to    web << "count": 187168, "max": 20, "avg": 9.74              

I will search the root, cause, maybe packet sequence id roll-over is not well handled ...

However, a lot of load tests were conducted with random call duration, not sure why this would not been triggered ...

@jchavanton
Copy link
Contributor Author

I am load testing while forcing a lot of elasticity

// exacerbate the expand / accelerate behavior 
 if (jb->jitter.stats.fake_burst > 0 || ((uint16_t) rand()) < 500) {
         status = SWITCH_STATUS_NOTFOUND; 
         if (jb->jitter.stats.fake_burst > 0) {
                 jb->jitter.stats.fake_burst--;
         } else {
                 jb->jitter.stats.fake_burst = rand() % 25;
         }
 }
 // pretend that the expected packet is not in the JB
 jb->jitter.stats.estimate_ms = 120;
 

I can not believe I will not be able to reproduce on calls that can last up to 50 minutes.

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 6, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 6, 2023

@dragos-oancea
Copy link
Contributor

could you do unit-tests upon this work ?

@jchavanton
Copy link
Contributor Author

jchavanton commented Nov 6, 2023

could you do unit-tests upon this work ?

sure, what are we using for unit tests ?

(edit, I just saw the unit test task, I will look at it)

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 9, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 9, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 13, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 13, 2023

@jchavanton
Copy link
Contributor Author

jchavanton commented Nov 14, 2023

I think I found the problem, thanks to this report.

{
  "rp": {
    "call_id_in": "g0b9v5v17e9gc0lcdplr",
    "call_id_out": "9c2b2026-b65a-4569-8812-d6ffaee6e73f",
    "report": {
      "in": {
        "ssrc": "0x3E506D97",
        "remote_socket": "172.31.61.245:13230",
        "local_socket": "172.31.61.245:34534",
        "codec": "opus",
        "count": 3766,
        "plc": 33420
      },
      "out": {
        "ssrc": "0xF16505E3",
        "remote_socket": "172.31.51.185:34256",
        "local_socket": "172.31.61.245:14556",
        "codec": "PCMU",
        "count": 37186,
        "max": 1807499,
        "avg": 1009222.50
      }
    }
  },
  "jb": {
    "in_call_id": "g0b9v5v17e9gc0lcdplr",
    "out_call_id": "9c2b2026-b65a-4569-8812-d6ffaee6e73f",
    "jb": {
      "size_max_ms": 220,
      "size_est_ms": 0,
      "acceleration_ms": 9580,
      "expand_ms": 660520,
      "fast_acceleration_ms": 0,
      "forced_acceleration_ms": 0,
      "jitter_max_ms": 29,
      "jitter_est_ms": 14,
      "reset": 7,
      "reset_too_big": 0
    }
  }
}

seems obvious that one of the value here or something can be wrong .

if (jb->jitter.stats.buffer_size_ms < (3 * jb->jitter.stats.estimate_ms)) {
   ... expand ...

I will add some logging and protection here

  • negative jitter
  • jitter too high to make sense
  • negative buffer size

But I think I see it, very large burst or long lasting PLC can create that !

Note to myself : double check that PLC packets are not wrongly computed in terms of delay.

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 14, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 14, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 14, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 14, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 15, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 15, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 16, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 16, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 16, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 16, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Dec 18, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Dec 18, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Dec 19, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Dec 19, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Dec 21, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Dec 21, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Dec 21, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Dec 21, 2023

@signalwire-ci
Copy link

signalwire-ci bot commented Jan 4, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Jan 4, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Jan 4, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Jan 4, 2024

- add buffering skip counter
- jitter buffer reset when too many missed frames
- jb disable when facing too many consecutive missing frames
- fix check_jb_size while expanding
- disable non elactic skip due to buffering
- another check size roll-over issue
- reset when expanding above the maximum size
- buffer size management
- proper handling of seq roll-over and fix seq decrement
- force accelerate when above the max size
@signalwire-ci
Copy link

signalwire-ci bot commented Jan 11, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Jan 11, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Feb 14, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Feb 14, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Feb 14, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Feb 14, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Feb 14, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Feb 14, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Feb 14, 2024

@signalwire-ci
Copy link

signalwire-ci bot commented Feb 14, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants