Tests help to give us confidence in our coding, but they can also slow us down. How do we strike the right balance between running them often enough to catch problems, but not getting in the way of progress?
Even relatively simple serverless applications can be complex. With complexity comes the risk that the system doesn’t behave as intended. We have two weapons against this risk:
- Testing — trying to catch as many issues before releasing software as we can
- Observability — making sure that when issues are released (no software is ever flawless!) that we can spot and diagnose the issues as soon as possible
Both of these paradigms are about learning. Specifically learning about how the system behaves.
Observability is fascinating and of paramount importance, but the focus for me here is Testing. Testing is a means to attain confidence in a system before releasing it.
Whenever we build software, we need to make decisions about how we will ensure the software is working as intended. This means writing tests which mimic real-world scenarios and then running those tests at the right moment of the delivery lifecycle so that they catch problems.
So what is it about Testing that can give us confidence that our release will go well?
- Lots of tests?
- Lots of code coverage?
- Testing interactions between parts of the system?
Maybe. These are all potential indicators of good testing, but they come at a cost: Time. They cost time in two ways:
- Execution time: tests take time to run, depending on the type of tests they may take a long time to run
- Maintenance time: having lots and lots of tests means there are lots and lots of tests to maintain
So what if it takes time?
The trouble is if tests take too long to run, then one of two things will happen: you will often skip these tests, or, you will be unable to release as often as you should because you are waiting for your tests to finish running.
If you skip the tests, then what’s the point of having them? If you don’t skip the tests then you’re stuck wasting time waiting for them to finish then your delivery is slowed down.
The reason that this matters so much is that poor/slow testing directly affects ALL 4 of the DORA metrics. And the reason DORA metrics matter so much is that DevOps maturity directly correlates with Business success. In other words, if you are smashing DevOps, then you have a much better chance of being successful as a company.
Here are the four DORA metrics:
- Lead Time: Time from the first commit to deploy in production — long-running tests can increase this metric
- Deploy Frequency: How often deployments to production happen — long-running tests inhibit the ability to deploy often and will probably lead to bigger batches and lower deploy frequency
- Time to restore: From when an issue first occurs, how long does it take to restore the service — the longer it takes for tests to run, the more time it takes to get from fix identified to fix deployed
- Change fail percentage: What percentage of deploys to production results in the introduction of a service-affecting issue — tests which do not validate the right behaviour (or aren’t being run!) will result in more issues making it into production
Testing is integral to business success and this leads me to the key point of this post:
Valuable testing is a balance between well-considered, useful test scenarios and fast execution to achieve short cycle times with a high level of confidence for every release.
It’s a balance. The sweet spot in the balance depends on a couple of things:
- How tolerant can we be of failure?
- How fast do we need to react to change?
How tolerant can we be of failure?
A system like “Smart motorways” simply cannot tolerate failures, it might result in people being seriously harmed. Because of the large impact of any potential issues, it makes sense to spend more time ensuring the system works correctly before a change is released.
However, many systems have a much smaller impact if they fail — and so spending a large amount of time testing them before release is a less valuable use of time.
How fast do we need to react to change?
In some circumstances, a system may remain fairly static for long periods. In these cases, it’s acceptable for changes to take a long time, so testing in turn can take longer.
However, in a competitive marketplace, having the ability to quickly adapt to changing landscapes might be the difference between success and failure. In these cases, it may be a bigger risk to a company to spend too much time on testing than the negative impact of Users experiencing issues would be.
Whatever the “right balance” is for you, it is important to understand what to test, when and in how much detail. Let’s start by defining some test phases:
- Unit Tests
- System Integration Test (incl Contract Tests)
- End-to-End Tests
In my opinion, these are the fundamental types and phases of testing (I am deliberately ignoring Nonfunctional Requirements testing, e.g. Performance testing as these are less generic to all use cases).
As you work down the list the following things happen:
- Tests increasingly take longer to write and maintain, for example, a Unit Test will likely be very short and simple with a specific thing it is testing for, whereas an E2E Test will be long, complex and possibly validate a number of things
- Tests increasingly take longer to execute, e.g. an E2E Test will take much longer to run than a Unit Test
- Tests increase in value (as they get closer to reflecting true user scenarios), as careful as we may be when we write Unit Tests, they fundamentally don’t exercise the code in the way that a real User would. The more “realistic” and “user-like” a test is, the more valuable it is
- Tests rely more and more upon environmental conditions to be able to be run, for example in Unit Tests, 3rd parties and external services would be mocked, whereas in an E2E Test you would not mock and would therefore require the environment to be in good working order
To put it simply, the earlier in the development lifecycle that you test something, the cheaper it is to both run the test and fix any problems it finds. However, the less likely that test is to truly mimic how a User will experience the system.
So what does all this mean for my Serverless project? One of the key benefits of Serverless is that you can keep each component part simple. Each FaaS Function should have a single responsibility and single side-effect. Therefore its code should also be simple. And that matters a lot for testing.
Unit Tests for FaaS Functions
In the vast majority of cases a FaaS Function’s process flow will go something like this:
- Input: Receive payload (event, API request)
- Validate the payload (is it structured correctly, does it meet certain business rules)
- Do some business logic
- Side-effect: Perform a side-effect (store something in a database, emit an event, return a response, etc)
Because the FaaS Function is so simple, I would always recommend performing simple Black Box tests. Don’t go unit writing unit tests for every method in the Function, it will just result in brittle tests, which fail every time you make a code change. We don’t want that.
By writing Unit Tests which provide a FaaS Function with a known input (payload) and expecting a known output (side-effect) its possible to achieve high coverage (~100%) with few tests, perhaps 8 or 10, depending on the number of validation scenarios.
The key thing with this approach is that there isn’t much wiggle room. You can’t increase the number of tests unless you test things multiple times (don’t do that!). But also, you can’t reduce the number of tests without resulting in a lack of coverage. It makes decision-making around Unit Tests easy:
- Do the tests hit the coverage target for the project?
- Yes. We’re done.
- No. Add the missing black box test scenarios.
It is at this stage that you are testing Business Logic.
System Integration Testing (SIT)
Now that the FaaS Function code is tested, we need to validate our deployment. There are tools for performing this type of test locally, however, I recommend always performing System Integration Tests in a valid deployment.
SIT Testing post-deploy should be easy enough if you are using Infrastructure as Code and a CI/CD pipeline to ensure your deployments are fully automated. Once the deployment has been completed we need to make sure it works as intended.
So what are we testing? Well, let’s start with what we are NOT testing. We are NOT testing Business Logic. Unit Testing has covered our Business Logic, so re-testing it post-deployment is unnecessary duplication. Instead, we should focus on checking the deployment environment, i.e. the “connective tissue”.
- Permissions: can the cloud services we need be invoked
- Routing: do events/requests end up at the intended destination
The focus here should on the fewest scenarios possible, which cover all of the resources a project uses.
Take, for example, an API endpoint which triggers a Function, which in turn writes a message to a queue.
The internal bits of the Function were already tested by the Unit Tests, so we don’t test them here. That means that only one SIT scenario is required.
Again, Black Box testing is the way to go here. A simple test which calls the API endpoint and then checks the queue for a message is good enough. Don’t even inspect the content of the message (that Business Logic was covered in the Unit Tests).
Make sure that SIT scenarios are as isolated as possible. A universal truth of good tests is that when a test scenario fails, it implicitly indicates which part of a system is at fault.
End-to-End Tests (E2E)
Once all of these tests have been run in theory we should have confidence that the system works! The trouble is, although a combination of Unit Tests and SIT provides good coverage of the system, none of these tests acts as a User.
End to End Tests should mimic how Users interact with the system. To do that each E2E scenario will likely cross several vertical slices. This makes E2E tests expensive, they take a long time to run and can sometimes be flaky.
This is the main area to focus on when choosing your balance…
I’m sure you will have heard of the Testing Pyramid. Essentially it is a pictorial representation of the types of testing I mentioned earlier. It represents the types in a pyramid to reflect that the earlier tests should be the most numerous and later tests should be fewer:
Using the approach described above, the number of Unit and SIT scenarios written is fairly well prescribed. There is a logical maximum number of tests to achieve coverage, without duplication.
Reducing the number of tests is therefore a decision about which parts of the system should I not test. To put it another way “I don’t care if this Business Logic doesn’t work” or “I don’t care if I can’t call this API endpoint”. Sometimes this is going to be ok, especially in a proof of concept or a little hobby project. But in reality, for most systems, there isn‘t much of an option. I need confidence that all my Business Logic is working and that all my “connective tissue” is working.
The Lever to Pull
The biggest room for manoeuvre when designing your test approach is going to be in End to End Tests. To achieve maximum coverage in E2E Tests will often require lots of scenarios, LOTS. Not just a few dozen, but potentially hundreds or thousands.
Think about it, the first scenario can probably cover quite a lot of the system, and maybe so can the second. But by the time you get to scenario 1,000, you will be reaching into a very specific corner of the system. There are diminishing returns when adding more E2E scenarios.
Remember the more E2E scenarios you write:
- the longer it takes to write them
- the more time is required to maintain them
- the longer the tests take to run
- the more likely a defect will result in multiple test failures (making the investigation more difficult)
All these factors will have a significant effect on your DORA metrics. So the decision to take here is:
What is the minimum number of End to End scenarios I need to have enough* confidence that my system is working.
*Enough confidence means that I know the main User journeys are working and that I have enough Observability in place to know quickly if anything goes wrong in Production.
So when we are building Serverless systems our test approach can remain fairly simple.
- Employ a black box testing approach to Unit Tests and SIT
- (Usually) aim for ~100% coverage of code in Unit Tests
- (Usually) aim for ~100% coverage of Connective Tissue in SIT
- (Very Rarely) aim for 100% coverage of code in E2E
Adjust the number of your End to End tests to match your risk profile. In other words, if you are very risk-averse then have a larger number of E2E scenarios, instead of a Testing Pyramid you have an “Irregular Hexagon of Testing”!
Aiming for 100% coverage in Unit Testing is probably controversial. But let me explain my reasoning. I prefer to aim for a high coverage level and then explicitly and deliberately omit certain files or chunks of code from a coverage report. This is more sensible than reducing the global coverage level for a whole project.
Furthermore, there may be projects in which achieving ~100% coverage results in very long-running tests. I hate to break it to you, but in that case, your project is probably too big. The required Cognitive Capacity to maintain the project is likely too high. Split the project up using Domain Driven Design.