That is a part of the Testing at scale sequence of articles the place we requested business consultants to share their testing methods. On this article, Ken Yee, Senior Engineer at Netflix, tells us concerning the challenges of testing a playback app at a large scale and the way they’ve developed the testing technique because the app was created 14 years in the past!
Testing at Netflix repeatedly evolves. With a purpose to totally perceive the place it’s going and why it’s in its present state, it’s additionally necessary to know the historic context of the place it has been.
The Android app was began 14 years in the past. It was initially a hybrid software (native+webview), but it surely was transformed over to a completely native app due to efficiency points and the issue in with the ability to create a UI that felt/acted really native. As with most older functions, it’s within the means of being transformed to Jetpack Compose. The present codebase is roughly 1M traces of Java/Kotlin code unfold throughout 400+ modules and, like most older apps, there may be additionally a monolith module as a result of the unique app was one large module. The app is dealt with by a staff of roughly 50 folks.
At one level, there was a devoted cell SDET (Software program Developer Engineer in Check) staff that dealt with writing all machine assessments by following the same old circulate of working with builders and product managers to know the options they had been testing to create take a look at plans for all their automation assessments. At Netflix, SDETs had been builders with a concentrate on testing; they wrote Automation assessments with Espresso or UIAutomator; additionally they constructed frameworks for testing and built-in third occasion testing frameworks. Function Builders wrote unit assessments and Robolectric assessments for their very own code. The devoted SDET staff was disbanded a number of years in the past and the automation assessments at the moment are owned by every of the characteristic subteams; there are nonetheless 2 supporting SDETs who assist out the assorted groups as wanted. QA (High quality Assurance) manually assessments releases earlier than they’re uploaded as a remaining “smoke take a look at”.
Within the media streaming world, one attention-grabbing problem is the large ecosystem of playback units utilizing the app. We prefer to assist a very good expertise on low reminiscence/gradual units (e.g. Android Go units) whereas offering a premium expertise on greater finish units. For foldables, some don’t report a hinge sensor. We assist units again to Android 7.0 (API24), however we’re setting our minimal to Android 9 quickly. Some manufacturer-specific variations of Android even have quirks. In consequence, bodily units are an enormous a part of our testing
As talked about, characteristic builders now deal with all elements of testing their options. Our testing layers appear like this:
Nevertheless, due to our heavy utilization of bodily machine testing and the legacy components of the codebase, our testing pyramid appears to be like extra like an hourglass or inverted pyramid relying on which a part of the code you’re in. New options do have this extra typical testing pyramid form.
Our screenshot testing can be carried out at a number of ranges: UI part, UI display structure, and machine integration display structure. The primary two are actually unit assessments as a result of they don’t make any community calls. The final is an alternative choice to most handbook QA testing.
Unit assessments are used to check enterprise logic that isn’t depending on any particular machine/UI conduct. In older components of the app, we use RxJava for asynchronous code and streams are examined. Newer components of the app use Kotlin Flows and Composables for state flows that are a lot simpler to purpose about and take a look at in comparison with RxJava.
Frameworks we use for unit testing are:
- Strikt: for assertions as a result of it has a fluent API like AssertJ however is written for Kotlin
- Turbine: for the lacking items in testing Kotlin Flows
- Mockito: for mocking any advanced courses not related for the present Unit of code being examined
- Hilt: for substituting take a look at dependencies in our Dependency Injection graph
- Robolectric: for testing enterprise logic that has to work together in a roundabout way with Android providers/courses (e.g., parcelables or Providers)
- A/B take a look at/characteristic flag framework: permits overriding an automation take a look at for a particular A/B take a look at or enabling/disabling a particular characteristic
Builders are inspired to make use of plain unit assessments earlier than switching to Hilt or Robolectric as a result of execution time goes up 10x with every step when going from plain unit assessments -> Hilt -> Robolectric. Mockito additionally slows down builds when utilizing inline mocks, so inline mocks are discouraged. Gadget assessments are a number of orders of magnitude slower than any of those sorts unit assessments. Pace of testing is necessary in giant codebases.
As a result of unit assessments are blocking in our CI pipeline, minimizing flakiness is extraordinarily necessary. There are typically two causes for flakiness: leaving some state behind for the following take a look at and testing asynchronous code.
JVM (Java Digital Machine) Unit take a look at courses are created as soon as after which the take a look at strategies in every class are referred to as sequentially; instrumented assessments compared are run from the beginning and the one time it can save you is APK set up. Due to this, if a take a look at technique leaves some modified world state behind in dependent courses, the following take a look at technique can fail. World state can take many types together with recordsdata on disk, databases on disk, and shared courses. Utilizing dependency injection or recreating something that’s modified solves this challenge.
With asynchronous code, flakiness can at all times occur as a number of threads change various things. Check Dispatchers (Kotlin Coroutines) or Check Schedulers (RxJava) can be utilized to regulate time in every thread to make issues deterministic when testing a particular race situation. It will make the code much less sensible and presumably miss some take a look at situations, however will forestall flakiness within the assessments.
Screenshot testing frameworks are necessary as a result of they take a look at what’s seen vs. testing conduct. In consequence, they’re one of the best alternative for handbook QA testing of any screens which are static (animations are nonetheless troublesome to check with most screenshot testing frameworks until the framework can management time).
We use quite a lot of frameworks for screenshot testing:
- Paparazzi: for Compose UI elements and display layouts; community calls can’t be made to obtain photographs, so you must use static picture assets or a picture loader that attracts a sample for the requested photographs (we do each)
- Localization screenshot testing: captures screenshots of screens within the operating app in all locales for our UX groups to confirm manually
- Gadget screenshot testing: machine testing used to check visible conduct of the operating app
Espresso accessibility testing: that is additionally a type of screenshot testing the place the sizes/colours of assorted components are checked for accessibility; this has additionally been considerably of a ache level for us as a result of our UX staff has adopted the WCAG 44dp normal for minimal contact dimension as an alternative of Android’s 48dp.
Lastly, we’ve got machine assessments. As talked about, these are magnitudes slower than assessments that may run on the JVM. They’re a alternative for handbook QA and used to smoke take a look at the general performance of the app.
Nevertheless, since operating a completely working app in a take a look at has exterior dependencies (backend, community infra, lab infra), the machine assessments will at all times be flaky in a roundabout way. This can’t be emphasised sufficient: regardless of having retries, machine automation assessments will at all times be flaky over an prolonged time frame. Additional beneath, we’ll cowl what we do to deal with a few of this flakiness.
We use these frameworks for machine testing:
- Espresso: majority of machine assessments use Espresso which is Android’s predominant instrumentation testing framework for consumer interfaces
- PageObject take a look at framework: inner screens are written as PageObjects that assessments can management to ease migration from XML layouts to Compose (see beneath for extra particulars)
- UIAutomator: a small “smoke take a look at” set of assessments makes use of UIAutomator to check the totally obfuscated binary that may get uploaded to the app retailer (a.okay.a., Launch Candidate assessments)
- Efficiency testing framework: measures load instances of assorted screens to examine for any regressions
- Community seize/playback framework: permits playback of recorded API calls to cut back instability of machine assessments
- Backend mocking framework: assessments can ask the backend to return particular outcomes; for instance, our dwelling web page has content material that’s fully pushed by advice algorithms so a take a look at can’t deterministically search for particular titles until the take a look at asks the backend to return particular movies in particular states (e.g. “leaving quickly”) and particular rows full of particular titles (e.g. a Coming Quickly row with particular movies)
- A/B take a look at/characteristic flag framework: permits overriding an automation take a look at for a particular A/B take a look at or enabling/disabling a particular characteristic
- Analytics testing framework: used to confirm a sequence of analytics occasions from a set of display actions; analytics are probably the most susceptible to breakage when screens are modified so this is a vital factor to check.
The PageObject design sample began as an online sample, however has been utilized to cell testing. It separates take a look at code (e.g. click on on Play button) from screen-specific code (e.g. the mechanics of clicking on a button utilizing Espresso). Due to this, it helps you to summary the take a look at from the implementation (suppose interfaces vs. implementation when writing code). You’ll be able to simply substitute the implementation as wanted when migrating from XML Layouts to Jetpack Compose layouts however the take a look at itself (e.g. testing login) stays the identical.
Along with utilizing PageObjects to outline an abstraction over screens, we’ve got an idea of “Check Steps”. A take a look at consists of take a look at steps. On the finish of every step, our machine lab infra will routinely create a screenshot. This offers builders a storyboard of screenshots that present the progress of the take a look at. When a take a look at step fails, it’s additionally clearly indicated (e.g., “couldn’t click on on Play button”) as a result of a take a look at step has a “abstract” and “error description” subject.
Netflix was in all probability one of many first corporations to have a devoted machine testing lab; this was earlier than third occasion providers like Firebase Check Lab had been obtainable. Our lab infrastructure has quite a lot of options you’d count on to have the ability to do:
- Goal particular varieties of units
- Seize video from operating a take a look at
- Seize screenshots whereas operating a take a look at
- Seize all logs
Fascinating machine tooling options which are uniquely Netflix:
- Mobile tower so we are able to take a look at wifi vs. mobile connections; Netflix has their very own bodily mobile tower within the lab that the units are configured to connect with.
- Community conditioning so gradual networks might be simulated
- Automated disabling of system updates to units to allow them to be locked at a particular OS degree
- Solely makes use of uncooked adb instructions to put in/run assessments (all this infrastructure predates frameworks like Gradle Managed Gadgets or Flank)
- Working a set of automated assessments in opposition to an A/B assessments
- Check {hardware}/software program for verifying {that a} machine doesn’t drop frames for our companions to confirm their units assist Netflix playback correctly; we even have a qualification program for units to ensure they assist HDR and different codecs correctly.
In the event you’re inquisitive about extra particulars, have a look at Netflix’ tech weblog.
As talked about above, take a look at flakiness is likely one of the hardest issues about inherently unstable machine assessments. Tooling needs to be constructed to:
- Decrease flakiness
- Determine causes of flakes
- Notify groups that personal the flaky assessments
Tooling that we’ve constructed to handle the flakiness:
- Routinely identifies the PR (Pull Request) batch {that a} take a look at began to fail in and notifies PR authors that they brought about a take a look at failure
- Checks might be marked steady/unstable/disabled as an alternative of utilizing @Ignore annotations; that is used to disable a subset of assessments quickly if there’s a backend challenge in order that false positives are usually not reported on PRs
- Automation that figures out whether or not a take a look at might be promoted to Steady by utilizing spare machine cycles to routinely consider take a look at stability
- Automated IfTTT (If This Then That) guidelines for retrying assessments or ignoring non permanent failures or repairing a tool
- Failure report allow us to simply filter failures in response to what machine maker, OS, or cage the machine is in, e.g. this reveals how typically a take a look at fails over a time frame for these environmental elements:
- Failure report lets us triage error historical past to determine the commonest failure causes for a take a look at together with screenshots:
- Checks might be manually set as much as run a number of instances throughout units or OS variations or machine sorts (cellphone/pill) to breed flaky assessments
We’ve a typical PR (Pull Request) CI pipeline that runs unit assessments (contains Paparazzi and Robolectric assessments), lint, ktLint, and Detekt. Working roughly 1000 machine assessments is a part of the PR course of. In a PR, a subset of smoke assessments can be run in opposition to the totally obfuscated app that may be shipped to the app retailer (the earlier machine assessments run in opposition to {a partially} obfuscated app).
Further machine automation assessments are run as a part of our post-merge suite. Every time batches of PRs are merged, there may be further protection supplied by automation assessments that can’t be run on PRs as a result of we attempt to maintain the PR machine automation suite below half-hour.
As well as, there are Every day and Weekly suites. These are run for for much longer automation assessments as a result of we attempt to maintain our post-merge suite below 120 minutes. Automation assessments that go into these are sometimes lengthy operating stress assessments (e.g., are you able to watch a season of a sequence with out the app operating out of reminiscence and crashing?).
In an ideal world, you’ve infinite assets to do all of your testing. In the event you had infinite units, you would run all of your machine assessments in parallel. In the event you had infinite servers, you would run all of your unit assessments in parallel. In the event you had each, you would run the whole lot on each PR. However in the actual world, you’ve a balanced strategy that runs “sufficient” assessments on PRs, postmerge, and so on. to stop points from getting out into the sphere so your prospects have a greater expertise whereas additionally holding your groups productive.
Protection on units is a set of tradeoffs. On PRs, you wish to maximize protection however decrease time. On post-merge/Every day/Weekly, time is much less necessary.
When testing on units, we’ve got a two dimensional matrix of OS model vs. machine sort (cellphone/pill). Structure points are pretty frequent, so we at all times run assessments on cellphone+pill. We’re nonetheless including automation to foldables, however they’ve their very own challenges like with the ability to take a look at layouts earlier than/after/in the course of the folding course of.
On PRs, we usually run what we name a “slender grid” which suggests a take a look at can run on any OS model. On Postmerge/Every day/Weekly, we run what we name a “full grid” which suggests a take a look at runs on each OS model. The tradeoff is that if there may be an OS-specific failure, it could appear like a flaky take a look at on a PR and gained’t be detected till later.
Testing repeatedly evolves as you be taught what works or new applied sciences and frameworks change into obtainable. We’re at the moment evaluating utilizing emulators to hurry up our PRs. We’re additionally evaluating utilizing Roborazzi to cut back device-based screenshot testing; Roborazzi permits testing of interactions whereas Paparazzi doesn’t. We’re increase a modular “demo app” system that permits for feature-level testing as an alternative of app-level testing. Enhancing app testing by no means ends…