I'm a big fan of property-based testing. When I'm writing property-based tests, I usually follow a process like this:
- Think of a simple unit test example, e.g. `assert 1 in my_sort([1, 2, 3])`
- Turn constants into arguments, e.g. `assert x in my_sort([x, y, z])`
- Generalise any irrelevant details, e.g. `assert x in my_sort(prefix + [x] + suffix)`
Property-based testing also complements static typing: if we find ourselves restricting a test generator, like avoiding zero in the article, we can instead try to strengthen our function's argument types (e.g. giving one argument a `PositiveInteger` type). That way, not only have we avoided a crash in our tests; we've also eliminated it from all real usage sites too :)
My favorite part of Hypothesis (also present in similar frameworks) is stateful testing [1]. Instead of just generating inputs to functions for you, it can also generate a sequence of actions for a stateful system. This is useful even when dealing with "pure" functions, like transformation systems. Suppose there is a sequence of transforms that can be selected at runtime and you want to know that they either lead to a valid state or some signal that the state is invalid (an exception is thrown or something like a Result type, containing either a valid value or an error value), the RuleBasedStateMachine will work for that, too.
I've always liked this concept, but I've only ever seen it demonstrated with simple pure functions like gcd and sort. How practical is generating inputs for payroll_get_benefits_adjustment(org, employee, jurisdiction, period, *args, **kwargs)?
If I have to teach Hypothesis (by writing strategies) which pieces of these complex values matter and which combinations of those pieces are needed to cover the space, is it still worth it? Am I chopping up essential complexity and pushing it around, or am I gaining coverage? Brevity? Clarity? Correctness? A promotion?
Will it hit the database a hundred extra times to cover combinations that don't matter when I could have just enumerated ones that do?
Will the next maintainer know if tests aren't failing when they should because my primer to Hypothesis was subtly wrong?
I haven't tried the approach in Python, but example generation is compositional: Generate arbitrary credit cards when you're testing the credit card code. Then later when you're testing a customer code, generate a customer who has 0..n credit cards.
"But isn't it hard to set up your input data exactly right to make the test pass?"
That's the point! You see what your system does when you give it data you didn't expect to give it.
I'll start by just saying I've never introduced property tests without finding a bug in either the specification or implementation. I have repeatedly spent time figuring out where the bug was in my property testing library before realising that it was correct and the obviously correct code I was testing was actually wrong.
I often find it easier in tests to start with the default strategies, then in the test construct the world. An example would be instead of creating strategies for user accounts and monetary transfers I made list of (integer, integer, integer) and interpret that as (from_account, to_account, amount).
You could have some assertions like (I'm afraid I don't know what benefits adjustment means but I know it's just an example)
* This benefits adjustment will always be 0 or positive
* It will only ever return a value or raise one of the following explicit exceptions <- "my code doesn't crash" is a good base assertion
* Benefits adjustments over longer periods are never less (e.g. adjustment for the time range 2020-2023 >= 2020-2021)
* If I create a second organisation with employees, the result of checking the first one never changes.
Your tests don't have to check exact results, otherwise you end up rebuilding the logic you're trying to test just to work out what the result should be. You can check general properties that should remain true.
You can extend this out as well making it more stateful in a sense. I built a UI library for TVs and programs using it may add and remove elements at any time. A powerful test was
1. Given an empty UI
2. And someone makes a series of API calls where each is [add_element(...), remove_element(...), user_left(), user_right(),...]
3. One and only one element is in focus
And another
1. Given an arbitrary UI created with those calls
2. If the user moves right and the focus changes
3. Then when the user moves left they will go back to the item they were previously on
It was very hard to write the library such that it didn't do what we wanted and passed these tests. I found a case where we had an explicit unit test and part of the spec that wouldn't pass while these tests were passing, because the spec was inconsistent.
These tests I think were actually clearer to write than a series of explicit cases too.
> Will it hit the database a hundred extra times to cover combinations that don't matter when I could have just enumerated ones that do?
Yes. The benefit is in trying things you think probably work but don't actually know they do, and cases you wouldn't have thought about but are totally possible. They're not a replacement for explicit tests, they're another tool.
I have had them failing showing there was a problem in some text extraction, which boiled down to the fact that lowercasing a Turkish I in python at least results in more than one character, so positions in the text after it had been lowercased were not the same as before. I had absolutely not considered that.
A finishing thought - if you can't say a general property that should remain true given some arbitrary input, how hard is it to understand what your function will do when you see a call to it?
> I have had them failing showing there was a problem in some text extraction, which boiled down to the fact that lowercasing a Turkish I in python at least results in more than one character, so positions in the text after it had been lowercased were not the same as before. I had absolutely not considered that.
Oof, this reminds me of an obscure problem caught by a property test (ScalaTest + ScalaCheck) a few years ago: we had to parse files containing fixed-width fields, with optional parts, padded with whitespace, etc. The customer couldn't provide any documentation for the format, so we had to figure it out based on a thousand files captured off their system.
One day a colleague ran a build on their machine, and hit a test failure. It turned out that they were using a different JVM (it was written for JDK11, but they were using JDK14); we were generating strings which include/exclude whitespace based on the 'Char::isWhitespace' method; but our parser combinators were matching whitespace using regular expressions.
This discrepancy caused the test failure, since ScalaCheck had generated a string containing a "mongolian vowel separator". That character was considered as whitespace in older Unicode standards, but not newer ones. The Char::isWhitespace method seemed to take that into account, but the regexp matcher wasn't (e.g. see https://unicode-explorer.com/c/180E )
In principle yes it’s worth it even for more complex data.
In practice, hypothesis is not good at this (I would argue it is broken currently, as specifications that are valid can fail for… reasons of implementation details) and the maintainers don’t think there’s any problem with this limitation.
I’ve had much better luck with generative testing in clojure and haskell.
Hypothesis will generate a lot of things for you, it can be used for stuff like names, e-mail adresses, passwords, etc... . Instead of having to populate your test data with stuff like John Doe, Doe street 1111, Doe City, you can just use it to generate stuff on the fly.
It will then tirelessly try different permutations in the neverending search for more bugs.
Hypothesis also has two different ways to generate data:
- One is to write things like in the article: accept data via arguments, and specify a strategy for each up-front.
- Another is to be more ad-hoc: use the 'data' strategy to get a data generator which can be called on-the-fly at the point it's needed inside our test. This is useful if our test is complicated, e.g. with a bunch of conditional logic where each branch needs something different; or the data generation depends on some run-time values (e.g. the number of results we got from the database); etc. Hypothesis will still keep track of the random seeds, perform shrinking, etc. as usual.
Then I talk about randomly sampling from a partition as a way to eventually detect "hidden" partitions you didn't expect. Like Hypothesis does.
When dealing with many parameters, each with several partitions, the variations multiply quickly. To deal with combinatoric explosion, I talk about all pairs testing, also known as "pairwise" testing.
Manual testers should definitely know about equivalence class partitioning, one of the first things I learned when I switched to being a tester.
Also for "pairwise" testing, been a while since I heard that being mentioned!
I think more emphasis should be put on something I only realised recently, which hampered me a lot wrt adopting property testing, and which I still have a hard time with: property tests are not parametrised tests, it’s ok if they can only check some properties of the computation (e.g. post-conditions), that is still valuable.
For the longest time I was convinced I should have “perfect” oracles (unit test style) because that’s what simple examples usually show off, but outside of trivial examples it’s really only possible for limited cases unless you have a complete and perfect oracle (usually an existing known-good implementation).
Indeed, I tend to avoid writing a single test that captures 'all' of a function/unit's behaviour (like the article is doing).
Instead, I would write multiple tests, mostly just asserting one thing each. For example: `test_gcd_always_positive(m, n)`, `test_gcd_divides_arguments(m, n)`, `test_gcd_is_largest_divisor(m, n)`, etc.
I also think in terms of a 'threat model' for our colleagues or past/future selves:
- They're probably not trying to write broken code, so we don't need to catch all incorrect implementations. In other words, they're not adversarial.
- They're probably lazy or mistaken, so we should try to catch incorrect implementations which are simple to write, make obvious mistakes, etc.
For example, having `my_sort = lambda _: []` is a really lazy way to implement the right type signature. Checking that the lengths and set of elements are the same are good tests to catch that, so it seems worth doing. Likewise, checking that the elements are sorted is a good test to avoid the lazy approach of `my_sort = lambda l: l`.
However, the implementation which sorts the set of elements and pads with the largest is pretty convoluted, so a lazy developer will likely just write a basic sort algorithm at that point (e.g. `my_sort = lambda l: [x for y in set(l) for x in l if x == y]`). Also, the test is a bit tricky; if that counting function didn't already exist, I probably wouldn't bother writing that test at all (there are equivalents which might be simpler to write, e.g. for each input element: assert it appears in the output, then remove its first occurrence; then also assert the output ends up empty)
I'd love property-based testing in big data cluster computing applications done in Databricks/Spark. However, (Py)Spark is very high-latency, even in local mode, so Hypothesis would have to generate a list (DataFrame) of thousands of test cases to be evaluated in parallel. That is where I got stuck. Has anybody ever done this successfully?
There are lots of dials that can be tweaked; e.g. timeouts, number of tests, etc.
For example, I've used Hypothesis to test some browser-automation, which uses the ChromeController package to launch a Chrom(ium) browser to take screenshots and print-to-PDF. The tests do things like:
- Generate random HTML
- Write it to a temp file
- Launch Chromium, set its window width+height, and navigate to that file:// address
- Take a PNG screenshot
- Use a PNG library to assert we've got a valid PNG, of the given width + height
There are similar tests for print-to-PDF (checking that it's a valid PDF with at least one page), etc.
The only fiddling I had to do was put `deadline=1000` in the `@settings` decorators. This prevents Hypothesis giving up on a test run too early; it automatically runs the tests fewer times, so it stays within a reasonable time frame.
These sorts of tests are good for sanity-checking that we're plugging things together in the right way; but I wouldn't rely on them checking enough times to e.g. catch arithmetic edge-cases, etc.
The assertions, test suite, etc. are handled by PyTest.
Hypothesis is taking care of generating arguments, and supplying them to the test. The '@given' decorator will wrap up the given test, to make a new one which:
- Calls the original test multiple times, with randomly generated arguments
- If the original test ever fails (throws an exception), the arguments which caused the failure will be "shrunk" to get a list of "smaller" arguments (e.g. smaller numbers; lists with fewer elements, and whose elements have also been shrunk; etc.). The original test will be called with those "smaller" arguments, to see if they also cause a failure: if they all pass (or the "shrunk" list is empty), then the original failing arguments are reported; if the same exception got thrown for some "smaller" arguments, we try again by shrink those into even smaller values (and so on); if a different exception is thrown, both failures are reported.
- The wrapped function also keeps a buffer of "notes", to help with debugging failures. Normal 'printf debugging' will be confusing to follow, since the original function will be called many times, so we'll get many copies of the debug output, and it's hard to know which ones came from successes and which ones from failures. If we use the 'note' function, Hypothesis will put them in a buffer, which is reset for each run. When a counterexample/failure is reported, only those notes from that run will be printed.
- We can tell Hypothesis to skip certain situations; e.g. the test in the article could have caught ZeroDivisionErrors and told Hypothesis to skip. That doesn't count as a failure, but also doesn't count as a success. The test will just be run again with different arguments.
- There is a threshold on the amount of skipped tests; crossing that threshold is treated as a failure, since we aren't seeing enough successful runs to be confident that it's working.
- There's also a timeout, to avoid bad interactions between the strategies/generators, tests, skipping, etc.
There are a few more things Hypothesis will do, like maintaining a database of previous failures, so it can retry those values as a form of regression test. It's a really nice tool :)
Right, it's not going to exhaust the search space of inputs. That would be computationally infeasible for some kinds of inputs.
It also doesn't exhaust the set of all possible code execution branches, for example, like an intelligent fuzzer would.
Saying that it's "just" bruteforcing tests still seems a bit dismissive for what you get, at least to my eyes. Property generators are good at exploring interesting neighbourhoods of the search space. When they find a failure, they're also usually good at reducing the error input to something small.
As an example, it might learn that "@#*CH@R822cr21;'c09J@)RH0 92hr19h" causes an error, but then be able to reduce the input to learn that it's the apostrophe that causes the issue -- a strong hint that you've messed up parameter escaping somewhere in your function.
> Property generators are good at exploring interesting neighbourhoods of the search space.
AFAICT hypothesis is not doing SMT, it's not doing intelligent fuzzing and not doing bruteforcing. What algorithm is it using to explore the search space then?
Having a very effective algorithm is crucially important but the documentation does not compare the tool with other methodologies.
Simply generating random inputs can be highly effective. For a constant investment of human effort, you can get billions of test cases. Quantity has a quality all its own.
The place where I've applied this (not using Hypothesis, mind you) is testing Common Lisp compilers. There are some general techniques for biasing the inputs to encourage certain kinds of programs, but overall just smothering a compiler in a tidal wave of tests exposes all sorts of weird bugs you'd never think of until you saw them. For those sorts of bugs, trying to partition the input space ahead of time is just useless.
One of the amazing things about Hypothesis is that it's very good at generating weird inputs that trigger bugs in cases when it would be impossible to test more than a tiny fraction of all possible inputs.
There is very very clever code that does things like create weird dicts where the keys are weird strings and the values are other weird dicts or arrays or strings..
If it finds a test failure, it then applies a reduction step, where it replaces the extremal test case with something slightly more benign and checks if the test still fails. This allows it to generate a test case which is just hairy enough to trigger the bug, but no more. This makes it easier to understand why that specific test case fails.
- Think of a simple unit test example, e.g. `assert 1 in my_sort([1, 2, 3])`
- Turn constants into arguments, e.g. `assert x in my_sort([x, y, z])`
- Generalise any irrelevant details, e.g. `assert x in my_sort(prefix + [x] + suffix)`
Property-based testing also complements static typing: if we find ourselves restricting a test generator, like avoiding zero in the article, we can instead try to strengthen our function's argument types (e.g. giving one argument a `PositiveInteger` type). That way, not only have we avoided a crash in our tests; we've also eliminated it from all real usage sites too :)