3rd Party SDKs are a mainstay in our applications: These dependencies give us the ability to do so many different things but they come at a cost.
Last Wednesday, we saw a multi-hour outage from Slack that caused "message failures and timeouts". The week before that, we saw another big company with an outage—Facebook—that caused more widespread issues among different applications that required log-ins through Facebook.
These events truly show how much our applications rely on so many different moving parts. If even one of these parts fails, so too might our app.
While Slack might be a little further from the mobile conversation, though it still affects mobile teams, Facebook hits too close to home.
At Embrace, we monitor mobile apps, so we noticed immediately when our partners’ apps started experiencing the outages caused by the Facebook iOS SDK. It’s also important to note two things: Firstly, it’s not the first time this type of crash has been caused by the Facebook SDK. Last year, TechCrunch reported users experienced crashes upon start up or attempting to access features implemented through the Facebook SDK for multiple applications. Next, Facebook isn’t always the one that causes this issue. Other SDKs can too.
So why’s this such a big problem? If any SDK can cause a crash and SDKs seem mandatory in the mobile space, why should we worry about an unsolvable problem? (Spoiler: It’s not unsolvable, though there’s no set solution that doesn’t come with disadvantages.)
Anyway, think about it in this sense: The Slack outage disrupted business functions because team members were no longer able to communicate with one another. This outage is fairly tangible for everyone in the organization: They log on to notify their coworker they finished some work or to request feedback from a supervisor and are instead greeted with an outage notification or inability to connect.
The consequences of the Facebook SDK issue might seem less tangible to someone not in the mobile space: Outages are inevitable, and the only effect is that users are unable to access an application for a short period of time. They can go without music or something minor for a few hours, right?
Wrong. Actually, talking about the issue in this way as a minor outage fails to recognize the severity of the problem. A multi-hour outage for many eCommerce applications could result in a loss of millions of dollars. That’s a massive amount for a few hours and one failing dependency.
Even then, the mobile industry can count itself lucky that the issue could be resolved so quickly. Imagine if Facebook had to build a completely new SDK—that could be as few as 3 days and who knows how many in the worst case.
The goal should always be to improve application stability and minimize these kinds of crashes. It is even worse when an organization cannot predict or inform their users in advance about an outage.
When we as a community use a 3rd Party SDK, we’re placing a lot of trust in that SDK to be stable and secure, and for there to be an unexpected crash interrupts business across a variety of verticals.
Next, this is an outage affecting a large company that smaller developers have little to no control over. When we as a community use a 3rd Party SDK, we’re placing a lot of trust in the SDK to be stable and secure.
So, what are some of the approaches to avoiding or solving crashes caused by 3rd Party SDKs like the Facebook one?
Remove the SDK Entirely
One philosophy is to to minimize our use of 3rd Party SDKs.
There are a few issues with this approach though:
According to the same TechCrunch article, users who utilize the Facebook log-in feature are less likely to delete their accounts and more likely to stay on the application.
Additionally, the Facebook SDK isn’t only used for log-ins. Companies could lose out on ad revenue if they are unable to implement the Facebook SDK.
Finally, the most damning negative is that 3rd Party SDKs trim development time by a massive amount. 3rd Party SDKs allow developers to do amazing things effectively out-the-box. Losing access to these tools would be a huge detriment to any organization.
Comment out the Implementation
Another similar but more temporary solution would be to immediately comment out the offending lines of code, keeping the dependency after the crash is corrected, upon detecting the crash. This would require a wary dev team, responsive at all hours (because remember, this is an unpredictable outage) and a great alerting tool or crash reporting solution.
An obvious issue inherent to this solution is that users would need to update their applications for it to take hold. With version adoption rates, this is a multi-day process that’s only a reactive measure.
There’s another issue hidden here. Some developers reported that even after commenting out Facebook implementations but keeping the dependency still resulted in crashes caused by the SDK. This means that there is still some work being done by the SDK even after its calls are commented out because of +load or other constructor implementations.
Again, when using 3rd Party SDKs, we must place a lot of trust in their stability and security.
Developers should do a static audit of all dependencies, checking for these entry points, constructors and +load. This will let you hold vendors accountable for methods you don’t approve of, or at least keep you aware of such implementations.
Sandbox Your Applications and SDKs
In the case that you’re forced to include an SDK, a last possibility we propose is to sandbox your applications and SDKs.
We can actually take advantage of the same methods SDK developers utilize. With these tools, we can gain back control of our applications through the ability to remotely disable a 3rd party SDK without having to push a new build of your app.
The key is to remember that, as the application module, your code will always win and run first in obj-c terms.
For example, say an SDK has a +load method on UIViewController for monitoring. So if you include a category, your category goes first. You’ll have the chance to swizzle against the vendor SDK—swizzle their call and wrap it with a configuration setting under your control. Now, if your config is on, you can let the vendor call through. If it’s off, you’re safe.
By running this code in a generic module, and you will have fully sandboxed the SDK.
For constructors, which run at load time (before your module), you can link your own module and ensure your module has a higher load command than the vendor.
Now, you have the chance to make changes to how the vendor’s module will load by intercepting dyld calls or running a thread to monitor their load process.
This outline of possible solutions is by no means final, and really only meant to bring us to talk about issues in security and stability that affect almost every mobile application on the market. As an industry, we need to be having these conversations not only internally but also with one another.