Chris Kranky

Recent Posts

Amazon Web Services for WebRTC? Y/N

Chris KoehnckeChris Koehncke

icon_cloud_servicesMany folks looking to implement communication services ask whether they can host this in PaaS environment like Amazon Web Services. The question itself evokes a lot of emotional opinions on both sides of the yes and no response. Often these emotions are based on anecdotal evidence and ultimately “it depends” is the final answer.

As a partner in a media company, our engineers set about to try and answer that question with some specific testing unique to our application (your own situation may vary). I’m gonna give you the non-technical easy to consume answers we found.

Amazon Web Services (AWS) does not explicitly promise any QOS, however, they do offer hosted Flash Services which do have time sensitive elements (matching audio with a visual for example). However, this is a separate service and not part of the generic AWS offering. Why doesn’t AWS offer a communication sensitive QOS? Maybe because they don’t see the market as large enough (read this in 2 years perhaps and that statement may be false).

There are two elements to consider in a communications exchange. Media and signaling. Signaling, where a human is waiting, can take it’s merry time. A 1/4 second is eons in computer time; however, you and I don’t notice these brief delays (probably cause we’re too busy texting on our phones). However, we will notice media delays of more than 250 msec (1/4 sec for those not on the metric system).

In our testing, AWS was more than acceptable for signalling human communications applications (think voice, video, chat signaling of any sort). Hence, if you’re trying to run an IP PBX or node.js signaling element at AWS (or their competitors) you’ll be fine. Have a nice day.

Media is a different story, particularly if the media is having to hairpin thru AWS. For a WebRTC P2P service, there is no impact to the service. But what about where you want a WebRTC TURN service where media (voice or video) has to go up to the server and back down to the other side or some mixing application.

Our application test was purely for a voice application where the media would indeed hairpin thru the server. We set up a test rig using Iperf, an open source TCP/UDP bandwidth tool. Is it the best? No, but it had the right price (free) and close enough is close enough. We fired up remote drones on other machines on the public Internet and started hammering away.

What we were looking for in simple terms was a breaking point where the AWS servers would consistently have a media failure (meaning either starting losing or delays packets such that a human would notice, metric or not).

I pause because our drone servers were located on the public Internet (though in a data center with GigE connections) and frankly the public Internet misbehaves quite often on it’s own, like a thunderstorm, problems can appear and disappear with no warning and no reason for why afterwards. While Amazon has numerous Internet peering points, they buy on price, not quality, thus the providers don’t necessarily give them the best quality of service. The short is, if we had failures, we couldn’t necessarily blame the cloud provider (AWS).

After much testing, we ended up concluding that for our audio mixing needs, that Amazon worked just fine, if you selected one of the larger computing instances. We calculated that the reason for this was that in a larger instance, our application would take effective control of an Ethernet port of the “box” and not subject us to other tenants also running on the same hardware (and potentially trying to use a lot of network time). We also consistently found that in round the clock hammering that at ~ 400 simultaneous voice connections (G.711), we’d start to see unacceptable consistent performance.

That’s fine, we were happy. What we wanted was an empirical number and we had one. 400 simultaneous audio connections on an AWS large computing instance.

Our existing application was running on dedicated hardware but with peaks and valleys in demand. A single server wasn’t even breaking a sweat with over 2,000 concurrent sessions so going to cloud meant we’d lose 75% of capacity (and thus need more instances, more if you’re on the metric system, or it less, I can’t remember).  Our desire, though, was to move to the cloud so we could shut down servers that weren’t needed in the off hours and spin up as demand increased. So elastic. I need to work on getting a more exciting life.

Unfortunately, during our testing, we found that AWS instances would simple die, stop running or otherwise fail to respond (taking a little snooze). Our experience isn’t unique, in fact, Netflix has carefully explained what they face in running at AWS. For Netflix, they implemented Chaos Monkey such that an instance can disappear without you, the viewer, noticing. Thus we found that no single AWS instance is designed for hardened service (collectively it can be but you need Netflix engineers working on it).

The summary we had (and again – your own mileage will vary) was:

On the $$$ front, the competition has driven down the costs for rack space and bare metal servers (not to mention server prices have stayed about the same but have more horsepower). AWS also meters bandwidth (which isn’t friendly for a media application. Thus we did not see a brain dead economic decision to use AWS, in fact, it would probably cost more.

So for our real time media application, cloud service providers likely wasn’t an option. Having said that we do use AWS for numerous other services as I’m sure you do to.

As a start-up, using AWS to quickly get going makes absolute sense and for early deployment, probably fine. The issue with media problems though is they’re often hard to replicate and track down  (as they’re often temporary). Paying customers don’t want to be your debug tool.

Now that hits the very topic of QOS metrics, operations people love this sort of stuff and love to monitor it as well. My question to them (who mostly don’t seem to like me very much) is “so what are you going to do about it.” They often give me that “I have a special place for you in my cellar” look. Unless something is consistently misbehaving, trying to get gremlins out is at best educated guessing.

My answer (and they eventually go along with this) is also guessing (but has served me well). Always put more horsepower in the servers than you think, have more Internet paths than you’d imagine, serve it all from different data centers and eliminate as many components as possible between you and the end customer (reduce the miles as well).

Now if you’re running your little WebRTC app in the cloud and all is well. Good for you. However, since I depend on customers paying me (I haven’t figured out Internet economics yet), I couldn’t take the risk with AWS. Time will probably change my opinion (who wants to run servers).