Debugging a Non Reproducible Crash

Authored by blog.bam.tech and submitted by almouro
image for Debugging a Non Reproducible Crash

October 10th 2018, our team release a new version of our React Native app. We're happy and proud to deliver new features to our users.

A few hours after the release, we're suddenly receiving an increased number of Android crashes.

Our crash reporting tool, Sentry is on fire!!

All the new errors are of the type JSApplicationIllegalArgumentException Error while updating property 'left' in shadow node of type: RCTView" .

In React Native, this can usually happen if you set a property with a wrong type. But how come we didn't catch this error when testing the app? Our new releases are well tested by every member of the team on a panel of multiple devices.

Also the errors seem quite random, they seem to happen for any combination of property and type of shadow node. These are the first 3 that were occurring for instance:

Error while updating property 'paddingTop' in shadow node of type: RCTView

Error while updating property 'height' in shadow node of type: RCTImageView

Error while updating property 'fill' of a view managed by: RNSVGPath

It also seems to happen on any device and any Android version, based on the Sentry report

A majority of Android 8.0.0 crashes but this is consistent with our user base

So the first step when you want to fix a bug is reproducing, right? Fortunately, thanks to our Sentry logs, we know what the users are doing before triggering the crash.

Welp, for the vast majority of crashes, users are just opening the app and kaboum, the crash happens.

Okay, let's try to replicate then. With the store app installed on 6 different Android devices, let's try to reproduce by opening and quitting the app a few times. No luck crashing the app! All the more impossible to reproduce locally in dev mode.

All right, this seems pointless. The crashes seem quite random anyway. The crash rate is raising at about 10%. Seems like you basically have 1 chance out of 10 to have the app crash on you when starting it.

To be able to reproduce this crash, let's try to understand it, where it comes from, how it thinks...

Well the thing is, as mentioned before, we have a few different errors. And they all have similar yet not entirely similar stacktraces.

Ok then, let's take the first one and analyse:

java.lang.ArrayIndexOutOfBoundsException: length=10; index=-1 at android.support.v4.util.Pools$SimplePool.release(Pools.java:116) at com.facebook.react.bridge.DynamicFromMap.recycle(DynamicFromMap.java:40) at com.facebook.react.uimanager.LayoutShadowNode.setHeight(LayoutShadowNode.java:168) at java.lang.reflect.Method.invoke(Method.java) ... java.lang.reflect.InvocationTargetException: null at java.lang.reflect.Method.invoke(Method.java) ... com.facebook.react.bridge.JSApplicationIllegalArgumentException: Error while updating property 'height' in shadow node of type: RNSVGSvgView at com.facebook.react.uimanager.ViewManagersPropertyCache$PropSetter.updateShadowNodeProp(ViewManagersPropertyCache.java:113) ...

Ok this is where it's happening then, in android/support/v4/util/Pools.java .

Hmm, we're pretty deep into the Android support library, not sure we can deduce much from this at this point.

Another way to find the root cause of the bug is to check the new changes we made in this release. Especially changes that would affect native Android code. 2 hypotheses emerge:

we upgraded Native Navigation, a navigation solution that used native fragments on Android for each screen

we upgrade react-native-svg. There were a few exceptions related to SVG components but some had no relation with it, so maybe not a good bet

At this point since we cannot reproduce the bug, our best bet is:

downgrading one of the 2 libs roll out to 10% users, since it's trivial to do so with the Play Store Check back in a few users if the crash happens in the new release. That way we validate or invalidate our hypothesis.

But how to choose which library to downgrade?

One solution to decide is playing "head or tail", but do we really want to resolve to that?

Getting to the bottom of this

Ok, let's dig deeper in the stacktrace we had previously picked, to see if we can determine which library to select.

/** * Simple (non-synchronized) pool of objects. * * @param The pooled type. */ public static class SimplePool implements Pool { private final Object[] mPool; private int mPoolSize; ... @Override public boolean release(T instance) { if (isInPool(instance)) { throw new IllegalStateException("Already in the pool!"); } if (mPoolSize < mPool.length) { mPool[mPoolSize] = instance; mPoolSize++; return true; } return false; }

This above was where the crash was occurring. The error was java.lang.ArrayIndexOutOfBoundsException: length=10; index=-1 meaning mPool is an array of size 10, but mPoolSize=-1 .

All right, how can get there with mPoolSize=-1 ? In addition to the recycle method above, the only place where mPoolSize is modified is in the acquire method of the SimplePool class:

public T acquire() { if (mPoolSize > 0) { final int lastPooledIndex = mPoolSize - 1; T instance = (T) mPool[lastPooledIndex]; mPool[lastPooledIndex] = null; mPoolSize--; return instance; } return null; }

So the only way mPoolSize can be -1 is if hitting the line mPoolSize-- while mPoolSize=0 . But how is it possible with the condition above mPoolSize > 0 ?

Let's put a breakpoint in Android Studio and check what's going on when we start the app. I mean, there was an if condition, this piece of code could not malfunction!

See, DynamicFromMap has a static reference to a SimplePool

private static final Pools.SimplePool<DynamicFromMap> sPool = new Pools.SimplePool<>(10);

After a few dozen hits of the play button, with carefully placed breakpoints, we can see that SimplePool.acquire and SimplePool.release are called on the mqt_native_modules threads by React Native to manage style properties of React component (below the width of a component)

But they are also called on the main thread!

Above we can see that it is used to update a fill prop on the main thread, typically for a react-native-svg component! Indeed, it turns out react-native-svg was using DynamicFromMap only since version 7 of the lib to improve the performance of native svg animations.

Aaaand the function can actually be called from 2 threads but DynamicFromMap is not using SimplePool in a thread safe manner. "Thread safe", you say?

So JavaScript is single threaded, so JavaScript developers usually have no need to deal with thread safety.

Java, on the other hand, supports the concept of concurrent, or multi-threaded, programs. Several threads can run within a single program and can potentially access a common data structure, which can lead to unexpected results.

Let's take a simple example, in the image below, both Thread A and Thread B both:

read the integer into their memory

Thread B could potentially access the data value before Thread A was done updating it. We expected two separate increments, to result in a final value of 19 . What we potentially get instead, was 18 . Such a scenario, where the final state of data depends on the relative order of thread operations, is called a race condition. The problem with race conditions is that they don't necessarily happen all the time. It's possible in the case above that Thread B had more work to do before incrementing the value, giving enough time to Thread A to update the value. That explains the randomness and the impossibility to reproduce our crash.

A data structure is said to be thread safe, if operations can be done concurrently by many threads, without the risk of race conditions.

When one thread is performing read on a particular data element, no other thread should be allowed to modify or delete this element (this is called atomicity). In our earlier example, if the update cycles were atomic, the race condition would have been avoided. Thread B would have waited for Thread A to complete its increment, and then it would have gone ahead.

In our case, here's what could happen:

Since DynamicFromMap is holding a static reference to the SimplePool , multiple DynamicFromMap called from different threads can concurrently call the acquire method in SimplePool .

In the image above, Thread A is calling the method, evaluating the condition to be true, but hasn't yet decreased the value of mPoolSize (which is shared with Thread B) while Thread B is concurrently calling the method and already evaluating the condition which is then true also. Then each separate call would subsequently decrease the value of mPoolSize and that's how you get an impossible value.

Looking into how to fix it, we found a PR on react-native that was not merged making it thread safe.

We then deployed a patched version of react native rolling it out to our users. The crash was finally fixed, hooray!

Finally, thanks to the help of Janic Duplessis (React Native core contributor), and Mikael Sand ( react-native-svg maintainer), the fix will be included in the next 0.57 minor for React Native

Betsy-DevOps on December 5th, 2018 at 16:06 UTC »

Another strategy here, when you've narrowed it down to one of two library upgrades.... immediately go to each library's github and look for other people reporting the same exception being thrown. In this case, there was already an open PR that fixed their bug.

A good tool for github would be something that automatically compares your stack trace with open PRs to see if one of them touches a line in your stack. Maybe if I get the time over Christmas vacation I'll try that.

tansim on December 5th, 2018 at 15:16 UTC »

tldr thread-safety

rebel_cdn on December 5th, 2018 at 14:56 UTC »

Sounds to me like it was reproduced about 20,000 times. 😊