That is half 2 of the Testing at scale collection of articles the place we requested business specialists to share their testing methods. On this article, Ryan Harter, Workers Engineer at Dropbox, shares how the form of Dropbox’s testing pyramid modified over time, and what instruments they use to get well timed suggestions.
With multiple billion downloads, the Dropbox app for Android has to keep up a top quality bar for a various set of use instances and customers. With lower than 30 Android engineers, guide testing and #yolo isn’t sufficient to keep up confidence in our codebase, so we make use of quite a lot of totally different testing methods to make sure we are able to frequently serve our customers wants.
Since Dropbox makes it simple to entry your information throughout all your units, the Android app has to help viewing as lots of these information as attainable, together with media information, paperwork, images, and all the variations inside these classes. Moreover, options like Digicam Uploads, which routinely backs up all your most essential images, require deep integration with the Android OS in ways in which have modified considerably over time and throughout Android variations. All of this wants to repeatedly work for our customers, with out them having to fret in regards to the complexity, as a result of the very last thing anybody desires is to fret that they may lose their information.
Whereas the dimensions and distribution of the Android crew at Dropbox has modified all through the years, it’s crucial that we’re in a position to constantly construct and refine options inside the app whereas sustaining the extent of belief from our customers that we’ve change into identified for. To assist underscore how Dropbox has been in a position to foster that belief, I’d prefer to share some ways in which our testing methods have modified over time.
Whereas automated testing has at all times been an essential a part of engineering tradition at Dropbox, it hasn’t at all times been simple on Android. Years in the past Dropbox invested in testing infrastructure that leaned closely on Finish-to-Finish (E2E) testing. Constructed on Android’s instrumentation assessments, we developed check helpers for options within the app following the check robotic sample. This enabled a big suite of assessments to be created that might simulate a person shifting all through the app, however got here with its personal vital prices.
Like many Android initiatives on the time, the Dropbox app began out as a monolithic app module, however that wasn’t sustainable in the long term. Work was performed to decompose the monolith right into a extra modular structure, however the E2E check suite wasn’t prioritized on this effort as a result of advanced interaction of dependencies. This left our E2E check suite as a monolith of its personal, leading to check code that didn’t exist alongside the characteristic code it exercised, permitting them to simply be missed and change into outdated.
Moreover, the lengthy construct occasions that include monolithic modules with many dependencies combined with the assessments being executed on emulators in our customized steady integration (CI) setting meant that the suggestions cycle for these E2E assessments was gradual. This resulted in engineers feeling incentivised to take away failing assessments as an alternative of updating them.
Because the Android ecosystem embraced automated testing an increasing number of, with the introduction of useful libraries like Espresso, Robolectric, and help for unit testing constructed immediately into Gradle, Dropbox saved up with these modifications by shifting from the heavy reliance on E2E assessments in direction of an increasing number of unit assessments, filling out the underside layer of the beforehand inverted testing pyramid. This was a big win for check protection inside the app, and allowed us to roll out high quality assurance practices like code protection baselines, to make sure that we frequently improved the reliability of the product because it moved ahead.
Over time, as unit testing grew to become simpler and simpler and engineers grew to become an increasing number of pissed off with the gradual suggestions cycles of E2E assessments, our testing pyramid grew to become lopsided within the different route. We had confidence in our unit assessments and the infrastructure supporting them, however our E2E assessments aged with out a lot help, changing into an increasing number of unreliable, to the purpose that we principally ignored their failures. Exams that may’t be trusted find yourself changing into a upkeep burden and supply little worth, so we acknowledged that one thing wanted to alter.
Over the previous yr we’ve doubled down on our deal with reliability. We’ve invested in our check infrastructure to make sure that engineers will not be solely in a position to, however incentivised to jot down priceless assessments throughout all layers of the testing pyramid. Along with technical funding in code and tooling, that has additionally required that we take the time to judge the issues we check, and the way we check them, and ensure your entire crew has a greater understanding of which instruments to make use of when.
Unit testing
We proceed to spend most of our efforts writing unit assessments. These are quick, targeted assessments that present fast suggestions, and function our first line of protection in opposition to regressions. We write JUnit assessments at any time when we are able to, and fall again to instrumentation assessments when we have to. Robolectric’s interoperability with AndroidX Take a look at has allowed us to maneuver lots of our instrumentation assessments to JVM-based unit assessments, making it even simpler to fulfill our check protection targets.
Talking of check protection targets, the unit testing layer is the solely layer that we use to find out our code protection. By default we goal 80% check protection, although we’ve a course of to override this goal for circumstances by which unit testing is both not priceless, or infeasible.
- Observe: Whereas we use commonplace JaCoCo tooling to judge our check protection, its lack of deep understanding of Kotlin presents some challenges. As an example, we haven’t but discovered a strategy to inform JaCoCo that the generated accessors, toString and hashcode of behaviorless information courses don’t require check protection. We’ve been experimenting and contemplating options to make sure that we’re not writing brittle assessments that don’t present worth, however for now we’re caught with issuing protection overrides for these instances.
E2E testing
Over the previous a number of months we’ve been renewing funding in our automated E2E check suite. This check suite is ready to alert us to extraordinarily essential points that unit assessments merely can’t establish, like OS integration points or surprising API responses. Due to this fact we’ve labored onerous to enhance our infrastructure to make assessments simpler for engineers to run regionally, we’ve audited and eliminated flaky or invalid assessments, and labored on documentation and coaching to make sure that we help our engineers within the creation and upkeep of our E2E check suite.
Change in E2E check counts earlier than and after check suite enchancment effort.
As I discussed above, our E2E assessments simulate a person shifting all through the app. Which means the duty of defining our E2E check instances is greater than merely an engineering downside. Due to this fact, we developed steerage to assist engineers work with product and design companions to outline check instances that characterize true use instances.
We just lately launched a follow of utilizing a correct Definition of Completed for growth work. This quantities to a guidelines of things that should be accomplished to ensure that a challenge to be thought-about “performed”, which is outlined and agreed upon initially of the challenge. Our commonplace guidelines consists of the declaration of E2E check instances for the challenge, which ensures that we’re including check instances in a considerate method, bearing in mind the worth and objective of these assessments, as an alternative of concentrating on arbitrary protection numbers.
Screenshot testing
One other dimension of our assessments that we’ve ramped up lately is screenshot testing. Screenshot assessments enable us to validate in opposition to visible regressions, guaranteeing that views render correctly in gentle and darkish mode, totally different orientations, and totally different type elements.
In unit assessments we leverage Paparazzi for screenshot testing. This enables us to jot down quick, remoted assessments and we discover it’s greatest suited to testing particular person view or composable layouts, together with our design system parts.
We additionally discover worth executing screenshot assessments in additional full featured instrumentation assessments. For this, we use our personal Dropshots library, which helps screenshot testing on units and emulators. Since Dropshots executes screenshot assessments on actual (or emulated) units, it’s an effective way to validate system integrations like edge-to-edge show, the default window mode on Android 15 units.
Handbook testing
With all the funding we’ve made into automated testing you’d be forgiven for pondering that we do no guide testing, however even immediately that’s merely not possible. There are various workflows for which automated assessments would both be too onerous to jot down, or too onerous to validate. For instance, we’ve each unit and E2E assessments to validate that the app behaves appropriately when rendering file content material, however it may be onerous to programmatically validate file content material, and screenshot assessments can generally show too flaky.
For these instances, we use an online based mostly check case administration instrument to keep up an entire set of guide check instances, and a 3rd celebration testing service to execute the assessments prior to every launch. This enables us to catch points for which we haven’t but written assessments, or which require human judgement.
Testing has confirmed invaluable in figuring out high quality points earlier than they make it to customers, permitting us to earn our buyer’s belief. Provided that worth, we intend to proceed investing in testing to make sure that we are able to proceed to keep up prime quality and reliability. There are some things that we’re wanting ahead to sooner or later.
I’m at present within the strategy of increasing the performance of Dropshots to help a number of system configurations, which can enable us to carry out screenshot assessments throughout a broad vary of units with a single set of assessments. For the reason that Dropbox app works throughout many alternative type elements, will probably be priceless for us to concurrently run our screenshot check suite on quite a lot of units or emulators to stop regressions on much less frequent type elements.
Moreover, we’re starting to experiment with Compose Preview Screenshot Testing, which permits our Compose Preview features to serve double obligation by dashing up growth cycles whereas additionally getting used to guard in opposition to regressions.
Lastly, we intend to proceed guaranteeing that we’ve a great steadiness of the best sorts of assessments. Balancing our testing pyramid to make sure that our funding in testing serves our reliability targets as an alternative of chasing arbitrary protection targets. We’ve already seen the worth {that a} wholesome check suite can present, and we’ll proceed investing on this space to make sure that we proceed to be worthy of belief.