Blog

STUN, TURN and ICE….Oh Boy!!

STUN_Picture1

For my first blog post at Coolharbour, I wanted it to be a technical one in nature but I was struggling to find a topic which would be meaningful enough to share with everyone out there. Then out of the blue I found myself debugging an issue for one of our customers that had me hooked to the point that I feel the urge to share my findings with you all. By the way I hope you like the STUNning picture of TURNing star trails and ICE covered hills at night. Hmmmm…..Anyhow!!

Background

At CoolHarbour, we had recently installed a Lync 2013 environment for one of our customers which was purely to integrate with the Avaya Scopia Video Gateway for Microsoft Lync. It’s a pretty cool product, which allows Lync/Skype for Business (SFB) clients to join a Scopia meeting rooms natively from their client without the need for the Scopia desktop software. The Lync environment we setup especially for this had a SIP domain of scopialync.com (not really but for the purposes of this blog let’s just say it was). Our solution with dummy IP addresses looks like the one below.

STUN_Figure 1

Figure 1

 

Pretty cool right? Well..I think it is anyways.  What was really cool about this setup was that with open federation as well as federation with Skype for Desktop, even consumers with a Skype for Desktop client could do a video call into the VMR. So all was good and despite the usual back and forth trying to decipher the configuration guides (let’s face it, most major enterprise products configuration guides can be a little difficult to chew sometimes ), we managed to get the integration working over TLS and woohoo…job done…all is good. Wrong!!!

 

The Problem

Well, the problems started early on when we tried to do a test with a rather generous third party who offered to direct federate with us (man, trying to find an open federated partner is harder than you think). Turns out for this customer, whenever they did a video call to the Virtual Meeting Room (VMR), the call was connected (with no audio and video) but after 10 seconds (and 10 seconds exactly each time) the call would fail with an error on their Lync client saying, “The call could not be completed because of Network issues”.

Now this was particularly tricky as all we had were the error logs on the Scopia gateway complaining of ICE errors, and as there is no actual client logged in on our Lync environment (the Scopia gateway registers as the VMR user in Lync and pushes presence data and takes the call when it’s made), this was a bit tricky to debug.

There was no way around it, I was going to have to pull my sleeves up and get to grips with STUN, TURN and ICE if we were ever going to solve this. The problem was STUN, TURN and ICE, what the hell is it, and how do you decipher the bloomin logs? To be clear, I am Certified in Lync having passed the 70-336/7 exams and attended the training course to boot. However in the course, I felt that the section on STUN, TURN and ICE was glossed over a little bit as it’s something that maybe people find a little hard to explain let alone understand in the first place. It just so happens that the guys at Microsoft have some excellent videos as part of their TechEd series on these protocols. My favourite was a rather old one from Bryan Nyce called “Lync Deep Dive: Edge Media Connectivity with ICE”, though the one from Thomas Binder was equally helpful called “ICE – Edge Media Connectivity in Lync 2013”. I strongly recommend you set aside the time to go through these as I doubt anyone could do this topic justice like these gentlemen can.

 

The Basics

You have to understand that STUN, TURN and ICE is not one protocol but three protocols that work together to enable the establishment of a connection stream of Audio and Video in Lync/SFB. To be honest it’s better to think of ICE as the magic wrapper that user STUN and TURN to make the connection happen.

Why do we even need these three complex proprietary protocols in the first place? Why can we just connect and talk like a good software client should? Well again, it’s not that complex once you understand what it’s doing, and it’s also not proprietary by Microsoft. These are in fact all IETF standard protocols used by all vendors who have products that relay media over the internet.

The problem these protocols try and solve is this. In the world of the internet, when we try and send media packets between clients, we have to navigate the jungle of home routers which NAT the machines behind them and corporate firewalls that tell you to get lost the moment you try and access a client sitting behind their corporate network. Now Lync and SFB are designed to allow collaboration not just between people within the office but remote workers within your organisation or even your federated partners whom you wish to work closely with. This is where the fun begins and this is exactly the problem these protocols are there to solve.

ICE

The stands for Interactive Connectivity Establishment protocol. All Lync/SFB clients are ICE clients and use ICE to try and establish connectivity between itself and another ICE client. In the Lync/SFB world many endpoints and servers can be viewed as an ICE client and you will get a feel for that once you watch the videos I link to. The EDGE server is not an ICE client, it doesn’t terminate any media sessions and is only doing STUN and TURN.

STUN

The new name for this acronym is Session Traversal Utilities for NAT. This protocol allows a client to discover the public IP addresses available to it behind a NAT router or a firewall in order to allow the establishment of connectivity. Discover is the key word here, and once found these IP addresses and ports are sent as potential candidates to the other party.

TURN

This stands for Traversal Using Relay around NAT. This is when a server in a chain volunteers as the go between you the external client and the person you are calling who is behind a company network. The EDGE server will act as a TURN server by offering a few ports for UDP and TCP on its public interface and then promise to send any media packets you send to these port over to the client using its private interface. It makes the same offer to the internal client trying to send packets out of the network.

 

The Call Scenario

In our problem scenario we had our third party Lync endpoint making a video call to our VMR. The third party’s client was sitting in their corporate network and our VMR was equally sitting behind our own corporate infrastructure. The call was having to use both parties EDGE server as an intermediary. Quick diagram of call and dummy reference IP addresses below.

STUN_Figure 2

Figure 2.

This diagram with its firewalls and infrastructure components should be familiar to all you Lync/SFB savvy folks out there so I will refrain from going into any detail about these. For the purposes of this blog, the test customer’s SIP domain was clientdomain.com. I have deliberately removed the Front End Servers in both environment as even though they are in the signalling flow chain as they are not necessary in our debugging scenario.

I must admit, I didn’t find a solution to my problem straight away. We did a lot of digging in Lync server logs, Scopia gateway logs, Firewall rules and traces, and the old tried and tested Wireshark traces. It’s worth mentioning at this time that all the firewall rules were correctly setup in our implementation (come one, I took care of that) as that is where people mostly turn to when things don’t work in Lync/SFB in general. Anyways after plenty of scratching our heads I decided to take the so called deep dive in the world of ICE connectivity. After watching the videos I requested the customer to send us their end user’s UCCAPI logs.

If you turn on logging in Lync (Google it, it’s pretty straightforward) the logs start appearing on the clients desktop folder under ‘C:\Users\<user login ID>\AppData\Local\Microsoft\Office\15.0\Lync\Tracing’. The file you need for debugging is the file with the .UccApilog extension.  From here Snooper is your friend. It is part of the Lync 2013 debugging tools and can be found at https://www.microsoft.com/en-gb/download/details.aspx?id=35453. When installed, you can find the snooper tool under ‘C:\Program Files\Microsoft Lync Server 2013\Debugging Tools’. This tool is pretty awesome. The more you play with it the easier it gets to understand the message flows in various call scenarios.

Back to the issue at hand. One thing that you will learn about establishing a call in Lync/SFB is that there are 4 phases that a call goes through before it is established completely.

Candidate Discovery: Where the clients discover their available public IP addresses for connectivity. These include both STUN and TURN addresses courtesy of the Edge server.

Candidate Exchange: Where the candidates send each other their list of addresses which they can be reached on. This happens both ways.

Connectivity Checks: This is where both the candidates try and exhaust all these addresses in parallel (not one by one), trying to establish a connection. At this point the Lync/SFB client is like…heck I will take any of these routes and start working with whichever one responds first.

Candidate Promotion: This is the final stage which usually happens after around 10 seconds once the call is up and running. This is where the clients after evaluating all their options have a change of heart and realise that there is a more optimum and quicker path to each other and decide to change the route they take to talk to each other accordingly.

Another thing I want to point out at this point is the Lync/SFB like to use UDP as opposed to TCP when it comes to establishing a media connection. Don’t get me wrong, it can use TCP and does indeed offer TCP ports as possible routing options to establish the call. However given the lower overheads of UDP traffic it is the preference for establishing communication between clients in Lync/SFB.

 

Snooping Around

So what did the UCCAPI log tell me? Turns out it told me exactly what was going wrong. Let’s start with a trace for the first 2 phases, Candidate Discovery and Exchange. Our customer’s test user is simon.pegg@clientdomain.com. Let’s see what that looks like in a snooper message trace.

STUN_Figure 3

Figure 3.

I have put red boxes around the relevant bits. If you start from the top, you will see that I clicked the messages tab and then out invite in the search box. This brings all SIP signalling messages which have the word Invite in their exchange. Turns out the top line in the signalling trace is the call INVITE from our customer Simon Pegg to our Scopia Gateway virtual meeting room user. If you click on that line, the whole SIP message is displayed to review in the bottom section in white, again highlighted in a red box.

Now at the time of sending this invite, the customer’s test user client has already done its candidate discovery and is now attempting to perform a candidate exchange. The invite for the call includes the calling party’s candidate data which can be seen as part of the sip trace.

If you scroll down the sip trace you will come across a candidate list which is not the one used by Lync but by Office Communications Server 2007. You can tell it’s for OCS 2007 based on the Content-Disposition text of ms-proxy-20007fallback. This is done for backwards compatibility and should be ignored when testing against Lync/SfB environments. Screen shot below shows you what you are should be looking for.

STUN_Figure 4

Figure 4.

Keep scrolling down till you get to another section like the previous one. This second section is the one for Lync / SfB.

STUN_Figure 5

Figure 5.

If you read the exchange in this section you will notice that 2 sets of candidates are provided. One is for Audio (as in and the other is for Video. Use the one that you are interested in debugging. A sample of the Video candidate list is below.

STUN_Figure 6

Figure 6.

The scopialync environment responds with a 200 OK and its list of candidates

STUN_Figure 7

Figure 7.

At this point, candidate exchange is complete, and both clients are busy exhausting all the candidates in each other’s lists in order to establish an Audio and Video RTP stream. Alas, this is where our test fails since after 10 seconds we receive a BYE from the scopialync environment which says game over better luck next time. I’m sorry to say that I won’t be able to show a candidate promotion as we never get that far in our tests, but trust me, it happens after 10 to 15 seconds.

So what went wrong, well, the clue was actually not in the candidate list sent out by our scopialync environment but in the one provided by the customer. Let me take some time to explain the list and how to decipher it. To stop you having to refer to the diagram above, the table below has the relevant IP addresses you need to keep track of.

 

IP Address Table

Figure 8.

STUN_Figure 8

Figure 9.

A:              This is the number for this candidate pair. All ICE candidates are sent in pairs. This just tell us that both the lines belong to the same Pair.

B:              Again, a number distinguishing the first and second candidate in a pair.

C:              This tells us that this candidate pair is for UDP. UDP is the preferred protocol for media and its preference is indicated in the next line.

D:             This large number indicates he priority of this candidate pair. The higher the number, the more Lync/SfB will give preference to it.

E:              The IP address of the ICE candidate which happens to be the actual client PC IP address. Yup, Lync/SfB is saying, if you can reach me on any of these ports, I will be waiting and ready to send and receive.

F:              These are ports on which Lync/SfB is listening and receiving. Why two ports? Well for UDP there are two ports specified because the first line is for RTP and the second line is for RTCP.

G:             This entry of ‘typ host’ reinforces that this candidate pair if for an actual client endpoint. A host … direct. Sort of ties in with the point I made for ‘E’

 

Ok .. so what about the rest…what do they mean. I’m getting there…

STUN_Figure 9

Figure 10.

A:              This indicates a TCP Passive pair for RTP and RTCP. Passive means it can receive on this pair

B:              This is the IP address of our Edge server so in a way at this point the Edge server is offering itself up as a TURN candidate

C:              The ports on which port Edge will be listening for Media traffic. Notice that both the ports are the same (unlike for UDP). This is because both RTP and RTCP can be multiplexed with TCPIP

D:             So what type of candidate pair is this? Well it’s a ‘type relay raddr’ which indicates this pair is TURN pair.  In your head associate relay with TURN

E:              This is the IP address which traffic will be forwarded to by the Edge server. Notice that it is the client IP address again. Also it’s a way of telling the other party that the packets received will appear to be originating from host client pc.

F:              Finally the relay port is the port to which the client will be listening to receive the data.

 

 

So with the above to pairs described the rest is starting to look quite similar. Apart from the bits below which you need to be aware of.

STUN_Figure 10

Figure 11.

A:              TCP-ACT indicates that with this candidate pair the client is able to send RTP and RTCP traffic

B:              This entry indicates that the candidate pair is a STUN pair

 

With all that said, I bet you are asking just one question…what was the problem? Well you may have already spotted it. If you take a good look at the candidate pairs a few things become obvious. Candidate pair 1 and 5 are out the running as they are to the host PC directly, and since that is sitting behind a corporate firewall there is no chance of ever reaching it directly. Our hopes are with the candidate pair 2-5 which are TURN relay pairs via the Edge server. But wait a minute, that IP address looks suspect. Remember I told you the customer’s Edge server was NATed, so the question is why I am seeing the non NATed IP address of the Edge server’s External interface.

 

The Pay Off

Turns out our test customer was oblivious to the fact that they had messed up their Edge server configuration. If you have a Lync 2013 environment with the External Edge Network interface sitting behind a NAT firewall, you need to make the following adjustment. Use topology builder and edit the edge pool configuration. Scroll down to the External Access settings and click the check box for A/V Edge Service is NAT enabled

STUN_Figure 11

Figure 12.

Then scroll down and enter the NAT public IP address in the section below.

STUN_Figure 12

Figure 13.

Save and deploy the topology and that should take care of your problem. Do give it time to replicate to the edge and changes should kick in.

 

Final Thoughts

I guess in our case, as frustrating as it was to debug this issue, we found ourselves scoring a few brownie points because eventually when the dust settled we could take credit for 2 things.

1)     We proved that the issue was not with the Lync environment we had setup for Scopia and it was working as designed.

2)     We were able show our debugging skills to a point that allowed us to identify issues with other customer environments and believe me this was appreciated by the other party as well

My only regret is that I didn’t find those awesome TechEd videos sooner. Could have been relaxing much earlier with a lot less stress. But hey…it was a pretty good learning experience which is what I ultimately wanted to share with everyone.

If you would like to be notified when we publish new blogs please follow CoolHarbour on LinkedIn by clicking here:

2 comments:

  1. Fabien

    Best article. Ever.

    Thank you sir.

    Reply
    1. Wajahat Khan

      Thanks Fabien. it is an old blog post now but I am glad it is still of use to people out there.

      Reply

Leave a Comment