Chris Kranky

Recent Posts


Amazon Web Services for WebRTC? Y/N

Chris KoehnckeChris Koehncke

icon_cloud_servicesMany folks looking to implement communication services ask whether they can host this in PaaS environment like Amazon Web Services. The question itself evokes a lot of emotional opinions on both sides of the yes and no response. Often these emotions are based on anecdotal evidence and ultimately “it depends” is the final answer.

As a partner in a media company, our engineers set about to try and answer that question with some specific testing unique to our application (your own situation may vary). I’m gonna give you the non-technical easy to consume answers we found.

Amazon Web Services (AWS) does not explicitly promise any QOS, however, they do offer hosted Flash Services which do have time sensitive elements (matching audio with a visual for example). However, this is a separate service and not part of the generic AWS offering. Why doesn’t AWS offer a communication sensitive QOS? Maybe because they don’t see the market as large enough (read this in 2 years perhaps and that statement may be false).

There are two elements to consider in a communications exchange. Media and signaling. Signaling, where a human is waiting, can take it’s merry time. A 1/4 second is eons in computer time; however, you and I don’t notice these brief delays (probably cause we’re too busy texting on our phones). However, we will notice media delays of more than 250 msec (1/4 sec for those not on the metric system).

In our testing, AWS was more than acceptable for signalling human communications applications (think voice, video, chat signaling of any sort). Hence, if you’re trying to run an IP PBX or node.js signaling element at AWS (or their competitors) you’ll be fine. Have a nice day.

Media is a different story, particularly if the media is having to hairpin thru AWS. For a WebRTC P2P service, there is no impact to the service. But what about where you want a WebRTC TURN service where media (voice or video) has to go up to the server and back down to the other side or some mixing application.

Our application test was purely for a voice application where the media would indeed hairpin thru the server. We set up a test rig using Iperf, an open source TCP/UDP bandwidth tool. Is it the best? No, but it had the right price (free) and close enough is close enough. We fired up remote drones on other machines on the public Internet and started hammering away.

What we were looking for in simple terms was a breaking point where the AWS servers would consistently have a media failure (meaning either starting losing or delays packets such that a human would notice, metric or not).

I pause because our drone servers were located on the public Internet (though in a data center with GigE connections) and frankly the public Internet misbehaves quite often on it’s own, like a thunderstorm, problems can appear and disappear with no warning and no reason for why afterwards. While Amazon has numerous Internet peering points, they buy on price, not quality, thus the providers don’t necessarily give them the best quality of service. The short is, if we had failures, we couldn’t necessarily blame the cloud provider (AWS).

After much testing, we ended up concluding that for our audio mixing needs, that Amazon worked just fine, if you selected one of the larger computing instances. We calculated that the reason for this was that in a larger instance, our application would take effective control of an Ethernet port of the “box” and not subject us to other tenants also running on the same hardware (and potentially trying to use a lot of network time). We also consistently found that in round the clock hammering that at ~ 400 simultaneous voice connections (G.711), we’d start to see unacceptable consistent performance.

That’s fine, we were happy. What we wanted was an empirical number and we had one. 400 simultaneous audio connections on an AWS large computing instance.

Our existing application was running on dedicated hardware but with peaks and valleys in demand. A single server wasn’t even breaking a sweat with over 2,000 concurrent sessions so going to cloud meant we’d lose 75% of capacity (and thus need more instances, more if you’re on the metric system, or it less, I can’t remember).  Our desire, though, was to move to the cloud so we could shut down servers that weren’t needed in the off hours and spin up as demand increased. So elastic. I need to work on getting a more exciting life.

Unfortunately, during our testing, we found that AWS instances would simple die, stop running or otherwise fail to respond (taking a little snooze). Our experience isn’t unique, in fact, Netflix has carefully explained what they face in running at AWS. For Netflix, they implemented Chaos Monkey such that an instance can disappear without you, the viewer, noticing. Thus we found that no single AWS instance is designed for hardened service (collectively it can be but you need Netflix engineers working on it).

The summary we had (and again – your own mileage will vary) was:

On the $$$ front, the competition has driven down the costs for rack space and bare metal servers (not to mention server prices have stayed about the same but have more horsepower). AWS also meters bandwidth (which isn’t friendly for a media application. Thus we did not see a brain dead economic decision to use AWS, in fact, it would probably cost more.

So for our real time media application, cloud service providers likely wasn’t an option. Having said that we do use AWS for numerous other services as I’m sure you do to.

As a start-up, using AWS to quickly get going makes absolute sense and for early deployment, probably fine. The issue with media problems though is they’re often hard to replicate and track down  (as they’re often temporary). Paying customers don’t want to be your debug tool.

Now that hits the very topic of QOS metrics, operations people love this sort of stuff and love to monitor it as well. My question to them (who mostly don’t seem to like me very much) is “so what are you going to do about it.” They often give me that “I have a special place for you in my cellar” look. Unless something is consistently misbehaving, trying to get gremlins out is at best educated guessing.

My answer (and they eventually go along with this) is also guessing (but has served me well). Always put more horsepower in the servers than you think, have more Internet paths than you’d imagine, serve it all from different data centers and eliminate as many components as possible between you and the end customer (reduce the miles as well).

Now if you’re running your little WebRTC app in the cloud and all is well. Good for you. However, since I depend on customers paying me (I haven’t figured out Internet economics yet), I couldn’t take the risk with AWS. Time will probably change my opinion (who wants to run servers).

 

Comments 8
  • sthustfo
    Posted on

    sthustfo sthustfo

    Author

    So what do you use for deploying your app for paying customers?


  • Chip Wilcox
    Posted on

    Chip Wilcox Chip Wilcox

    Author

    Great post Chris.

    Temasys is often billed as the “AWS of WebRTC”. Moreover, we had the distinction of being on stage with Werner Vogels at AWS Re:Invent last November, as one of five global (and the only WebRTC) startup. We make no secret of the fact that our WebRTC platform Skylink is built on top of AWS’ cloud-hosted infrastructure.

    Is Temasys biased towards AWS? The answer is a very strong “Yes!”

    When we saw this post, we felt it might help for us to provide our perspective, since we can credibly claim to have at least as much experience working with AWS and WebRTC, together, as anyone else.

    Using AWS has several merits: Global presence, reliability/security, flexibility, and solid customer support. Moreover, AWS offers many different options for all types of instances and servers to run, and while many others rest on their laurels, Amazon is launching new services all the time.

    AWS has also actively helped Temasys. From the start of our engaging with them, they have been providing recommendations on how to leverage different components or AWS products more effectively, with the aim of improving performance and to drive cost savings. Frankly, we have not gotten the same level of interest or support from other vendors we evaluated or tried to work with in the past. Lastly, AWS is extremely supportive of startups, which gives them a couple of +1’s from our perspective, as well.

    All that’s great. But is AWS the best suited vendor to support WebRTC PaaS solutions?

    This isn’t, as you point out, a “yes” or “no” question. And, rather than answer the question with “it depends,” we would advocate (as we always do), the “test it for yourself” approach. In doing so, and whatever solution one is trying provide, it’s important to tackle it with cloud deployment in mind, from day one.

    Next question: Is AWS perfect? No. But, no one platform ever is. That’s why third parties, middleware, APIs and developers come in.

    To your credit, you’ve provided a fairly thorough assessment, which we agree with, by and large. For all of AWS’s merits, including the ones you highlighted, (signaling, RTC capabilities, auto-scaling), AWS is not without its flaws. With our day-to-day experience we have certainly encountered intermittent failures – and perhaps more often than we’d like. However, if your WebRTC platform is designed in a fashion suited for dedicated rack-mounted servers then you will encounter the same problems, frequently, as well.

    As you have so aptly put it, developers will need to take the “Netflix” approach to availability and have redundancies when crippled with failures, including:
    * Eliminate single points of failure by always having an alternate waiting. Having redundancy doesn’t go away.
    * Expect failures and architect your services to self monitor and self heal. Again, a fully redundant architecture is needed, and it’s easier to create in the cloud than it ever was before.
    * Utilize queuing where it makes sense, and design for distributed processing
    * Give your application multiple paths to resources, rerouting the traffic immediately when issues reveal themselves
    * Don’t assume you have the liberty of taking the “Three strikes” approach: Fail once and you’re out! Take the endpoint out of production for inspection, and bring up a replacement immediately
    * Make sure that you are actively monitoring your instances and using appropriate types of instances for the task at hand. For example, a common problem is under spec’ing instances, and then hitting load-imposed throttling thresholds from being the noisy neighbor

    This is “motherhood and apple pie” talk for anyone working with any cloud-based platform, whether it’s Rackspace/OpenStack, Google Cloud, Azure, etc. They all behave in similar fashion and mitigation of risks is approached the same way with each.

    All that said, there’s one dimension to working with cloud vendors and AWS that your post did not touch on: The ability to offer multi-region deployments, redundancy within and across regions, and the reliability and performance that this enables.

    This is where cloud deployments have everyone else beat, hands down. In no universe we’ve visited is co-location in multiple data centers (with hot swappable slave hardware, reserve hardware waiting to come online, reserve routers, reserve switches, disk arrays, ongoing upgrades and maintenance contracts, 24/7 on-demand skilled staff, to maintain and monitor it all, along with trying to maintain a global presence) going to be as cost-effective as any one of the major cloud providers.

    In the cloud, solving less-than-perfect reliability and latency issues is an easier problem to overcome, if you want to be a global player. What it comes down to is approaching the problem with the right perspective.


  • Philippe Clavel
    Posted on

    Philippe Clavel Philippe Clavel

    Author

    Chip is really right, it depends on your need. At Rabbit we have our own MCU for webrtc and started hosting it in AWS. Bandwidth was fairly expensive given our use case (not much peer to peer), the network was fine. We then moved to GCE to try it out and same result, bandwidth becomes expensive. We are moving to Softlayer and can use less server for an overall cheaper price and almost no bandwidth cost.
    Our use case is very specific, if you do webrtc traditionally and just use AWS for Turn server, your cost will not skyrocket directly as most of your traffic will be peer to peer…


  • Brian Fields
    Posted on

    Brian Fields Brian Fields

    Author

    @Chip Wilcox: From the line “Unfortunately, during our testing, we found that AWS instances would simple die, stop running or otherwise fail to respond (taking a little snooze).”, we’ve noticed the “little snooze” issue when hosting a WebRTC forwarding server in AWS. The problem pops up enough such that it’s likely to disrupt a call that lasts an hour. Do you notice this issue?


  • Chip Wilcox
    Posted on

    Chip Wilcox Chip Wilcox

    Author

    @Brian Fields: Thanks for the follow up question!

    As we’ve acknowledged already, there are problems, but with a little bit of time and effort we can always find a way to avoid them. The key is not to approach this on a case-by-case basis, but to look for a way to solve the problems systemically, so we reach a state where the impact falls way down in the noise and isn’t disruptive to the customer experience.

    With regard to your comment about “snoozing”, generally we have found these types of issues to be caused by over-saturation or throttling when using burstable instances, or sometimes from Node code problems blocking the call stack.

    These issues are caused by other factors and can be identified and resolved. In our own experience we have had issues with bad EBS volumes, and kernel driver incompatibility when running Ubuntu instances. We have switched to running Amazon Linux on most instances, these days. There hasn’t been a situation, so far, where have we not been able to find a solution to resolve or work around the issue on our own, or with the assistance of AWS architects and support services.

    Another factor, often overlooked, is that AWS is often a target for mischief like DOS attacks. The IP pool of AWS is known. Port scans and other probing threats are never ending, so keeping your instances secured is key. Allowing only the the traffic you want to hit your instances, and being able to handle the bad actors without failure is just something you have to plan for.


  • benstokes
    Posted on

    benstokes benstokes

    Author

    Hi, Nice article, can you suggest few considerations for running WebRTC on AWS


  • Chris Koehncke
    Posted on

    Chris Koehncke Chris Koehncke

    Author

    Thanks Ben! Indeed I just wrote a new article about Google Cloud Platform and the value of the underlying network. However WebRTC P2P natural tilt doesn’t mean you need a heavy duty cloud provider (depends though on your application).


This site uses Akismet to reduce spam. Learn how your comment data is processed.