Flaky tests - what they are and how to deal with them?

Published on: 12/14/2023

a dark skin pirate woman, with short dark hair and a wide round hat in a blue low top open back dress with thin vertical blue and teal stripes with a small parrot on her arm sitting behind wooden desk weighting gold coins on an old fashioned scale, camera centered front facing, watercolor

Introduction
What are flaky tests?
How to spot flaky tests?
Example of flaky test
Understand flaky tests
Conclusion

Introduction

As we are building our treasure island we've come to a moment where we need to be certain our safety measures are adequate. But as more and more adversaries are attacking we need to strengthen our defenses. Our tests are not reliable enough.

If you have followed the previous tutorials to the letter you probably had no issues with the tests from last article about creating end to end tests with Playwright.

If on the other hand you have deviated ex. limited number of pages scraped. You might have experiences what is called a flaky test.

What are flaky tests?

Flaky tests are ones that return positive or negative results when external factors change even when state of the application or codebase does not change. One such example can be when test depends on a timed function, when they are run on a different operating system or when changes to underlying data cause the test to fail.

How to spot flaky tests?

These can be easy of very hard to find. If you run a test and it fails you will usually check if re-running it turns the test green. If it does it can indicate a flaky test and it's a false negative. False positive tests pose more problems as they pass and seem fine. In that case they may fail during the build, on the CI or in the deployment or for another developer.

Example of flaky test

We have exactly that kind of tests in our codebase.

Here the test results depend on what is returned by the database. If we change what is inside the DB (ex. by changing the amount of data scraped) this test will fail although the functionality will not be hindered.

What is worse it will not really fail for us at the moment (it's a false-positive) but it can fail for other developers.

So how do we fix it?

Understand flaky tests

The first step to fixing this kind of test is to understand what we actually want to test?

It's not the amount of products we are looking for. We expect to find more products with higher price range than with the lower one. Let's try rewriting it.

To verify that we need to remember the number of products that have been found with the initial search and then compare that value against the number of products that are returned when the max price changes to a higher value.

Then we need to compare those values and make sure that we have more products than we originally received.

Below is the finished version with few console.log statements so you can see what is found. We'll go through the test below and a complete version with no "bloat" is at the end.

There are a few things that are crucial here:

use of numberOfInitialProducts and numberOfProductsAfterPriceChange to remember number of products returned with initial values that will fail the test.
Fail test when either are null from await statements.
Detection of change in the contents of the text done via not parameter with previous value:

Ad. 1

Declare variables that will hold the test values to be compared.

Ad. 2

Read more in the documentation about conditionally failing the test .

Ad. 3

Take a note of use of not in the statement - that way we can wait for the text or value to change on the page in the test.

While writing the test I had tried few things to detect changes on the page: await page.waitForTimeout but as per docs: "Never wait for timeout in production. Tests that wait for time are inherently flaky. Use Locator actions and web assertions that wait automatically.". Waiting for result will rightly lead to flaky test which we absolutely want to avoid.

Afterwards I found that we can wait for the network requests to complete with await page.waitForLoadState('networkidle') and again if you check the documentation this method is discouraged as it waits until there are no network connections for at least 500 ms. So it's the same as previous option. Docs state not to use this method for testing and rely on web assertions to assess readiness instead.

If you read through the docs there are no methods that expect the contents to be changed. By inverting the logic we can wait for the content not to be the same as with previous assertion (effectively be different) to get the new value. That way we get automatic wait assertion by Playwright and test should run smoothly.

And here's the full test without the debug logs:

Conclusion

As you can see we have made our test both more complex and when comparing only lines of code it is much larger. On the other hand we have saved ourselves and our colleagues frustration of finding the test unreliable and a need to fix it later with limited time.

I hope you found it valuable and enlightening.

And as always: You are doing great so keep it up!

PS. if you want to test what you have learned today try to change the sorting changes the order of products test in the same file and send me the test via a contact form.

Back to Articles

Sebastian Pieczynski's website and blog

Published on: yyyy.mm.dd

Back to Articles