Bombadil: Property-based testing for web UIs

(github.com)

207 points | by Klaster_1 4 days ago

19 comments

  • IanCal 8 hours ago
    I'm a huge fan of property based testing, I've built some runners before, and I think it can be great for UI things too so very happy to see this coming around more.

    Something I couldn't see was how those examples actually work, there are no actions specified. Do they watch a user, default to randomly hitting the keyboard, neither and you need to specify some actions to take?

    What about rerunning things?

    Is there shrinking?

    edit - a suggestion for examples, have a basic UI hosted on a static page which is broken in a way the test can find. Like a thing with a button that triggers a notification and doesn't actually have a limit of 5 notifications.

    • owickstrom 8 hours ago
      Hey, yeah the default specification includes a set of action generators that are picked from randomly. If you write a custom spec you can define your own action generators and their weights.

      Rerunning things: nothing built for that yet, but I do have some design ideas. Repros are notoriously shaky in testing like this (unless run against a deterministic app, or inside Antithesis), but I think Bombadil should offer best-effort repros if it can at least detect and warn when things diverge.

      Shrinking: also nothing there yet. I'm experimenting with a state machine inference model as an aid to shrinking. It connects to the prior point about shaky repros, but I'm cautiously optimistic. Because the speed of browser testing isn't great, shrinking is also hard to do within reasonable time bounds.

      Thanks for the questions and feedback!

      • theptip 6 hours ago
        For re-running, I assume you want to do this all on a review app with a snapshot of the DB, so you start with a clean app state.

        Should be pretty easy to make it deterministic if you follow that precondition.

        (How I had my review apps wired up was I dumped the staging DB nightly and containerized it, I believe Neon etc make it easy to do this kind of thing.)

        Ages ago I wired up something much more basic than this for a Python API using hypothesis, and made the state machine explicit as part of the action generator (with the transitions library), what do you think about modeling state machines in your tests? (I suppose one risk is you don’t want to copy the state machine implementation from inside the app, but a nice fluent builder for simple state machines in tests could be a win.)

        • owickstrom 6 hours ago
          That's true, clean app state gets you far. And that's something I'm going to add to Bombadil once it gets an ability to run many tests (broad exploration, reruns, shrinking), i.e. something in the spec where you can supply reset hooks, maybe just bash commands.

          Regarding state machines: yeah, it can often become an as-complex mirror of the system your testing, if the system has a large complicated surface. If on the other hand the API is simple and encapsulates a lot of complexity (like Ousterhout's "Deep Modules") state machine specs and model-based testing make more sense. Testing a key-value store is a great example of this.

          If you're curious about it, here's a very detailed spec for TodoMVC in Bombadil: https://github.com/owickstrom/bombadil-playground/blob/maste... It's still work-in-progress but pretty close to the original Quickstrom-flavored spec.

    • danbruc 7 hours ago
      How effective is property based testing in practice? I would assume it has no trouble uncovering things like missing null checks or an inverted condition because you can cover edge cases like null, -1, 0, 1, 2^n - 1 with relatively few test cases and exhaustively test booleans. But beyond that, if I have a handful of integers, dates, or strings, then the state space is just enormous and it seems all but impossible to me that blindly trying random inputs will ever find any interesting input. If I have a condition like (state == "disallowed") or (limit == 4096) when it should have been 4095, what are the odds that a random input will ever pass this condition and test the code behind it?

      Microsoft had a remotely similar tool named Pex [1] but instead of randomly generating inputs, it instrumented the code to enable executing the code also symbolically and then used their Z3 theorem proofer to systematically find inputs to make all encountered conditions either true or false and with that incrementally explore all possible execution paths. If I remember correctly, it then generated a unit test for each discovered input with the corresponding output and you could then judge if the output is what you expected.

      [1] https://www.microsoft.com/en-us/research/publication/pex-whi...

      • IanCal 2 hours ago
        In practice I’ve found that property based testing has a very high ratio of value per effort of test written.

        Ui tests like:

        * if there is one or more items on the page one has focus

        * if there is more than one then hitting tab changes focus

        * if there is at least one, focusing on element x, hitting tab n times and then shift tab n times puts me back on the original element

        * if there are n elements, n>0, hitting tab n times visits n unique elements

        Are pretty clear and yet cover a remarkable range of issues. I had these for a ui library, which came with the start of “given a ui build with arbitrary calls to the api, those things remain true”

        Now it’s rare it’d catch very specific edge cases, but it was hard to write something wrong accidentally and still pass the tests. They actually found a bug in the specification which was inconsistent.

        I think they often can be easier to write than specific tests and clearer to read because they say what you actually are testing (a generic property, but you had to write a few explicit examples).

        What you could add though is code coverage. If you don’t go through your extremely specific branch that’s a sign there may be a bug hiding there.

      • spooneybarger 6 hours ago
        An important step with property based testing and similar techniques is writing your own generators for your domain objects. I have used to it to incredible effect for many years in projects.

        I work at Antithesis now so you can take that with a grain of salt, but for me, everything changed for me over a decade ago when I started applying PBT techniques broadly and widely. I have found so many bugs that I wouldn't have otherwise found until production.

      • kqr 5 hours ago
        "Exhaustively covering the search space" or "hitting specific edge cases" is the wrong way to think about property tests, in my experience. I find them most valuable as insanity checks, i.e. they can verify that basic invariants hold under conditions even I wouldn't think of testing manually. I'd check for empty strings, short strings, long strings, strings without spaces, strings with spaces, strings with weird characters, etc. But I might not think of testing with a string that's only spaces. The generator will.
      • kwillets 5 hours ago
        One of the founders of Antithesis gave a talk about this problem last week; diversity in test cases is definitely an issue they're trying to tackle. The example he gave was Spanner tests not filling its cache due to jittering near zero under random inputs. Not doing that appears to be a company goal.

        https://github.com/papers-we-love/san-francisco/blob/master/...

        • wwilson 4 hours ago
          Glad you enjoyed the talk! Making Bombadil able to take advantage of the intelligence in the Antithesis platform is definitely a goal, but we wanted to get a great open source tool into peoples’ hands ASAP first.
      • skybrian 7 hours ago
        One thing you can find pretty quickly with just basic fuzzing on strings is Unicode-related bugs.
      • Mr_RxBabu 2 hours ago
        [dead]
  • inaseer 1 hour ago
    Bombadil takes a fresh approach to UI testing I haven't seen before: online monitoring through LTL formulas. Unlike model-checking (say by TLC), LTL formulas over here unfold in lock-step with the UI and allow users to express interesting temporal properties during testing.

    The other intriguing aspect was how state is modeled (or rather, maybe not explicitly modeled?). A lot of the examples show the state extracted from the DOM and temporal properties indicating what the next (or eventual) state _should be_. If we want to look at the existing state (according to the model/spec) when predicting the next state (similar to how you can use s when specifying s' in TLA+), there seems to be no direct way to do that. It should of course be possible to capture the state at an earlier time in a closure and use it in a thunk at a later point, so it should be possible to work around this but that can be a little awkward, maybe. I'm working on a project in this space (primarily geared towards backend API model-based testing) and the state of the _real_ system isn't globally inspectable unlike a web page so took a different route over there. Having said that, this is a very interesting design choice that's very intriguing (in a good way).

    • owickstrom 48 minutes ago
      Yes, and in fact, capturing "prior" values with bindings and closing over them in temporal operator thunks is how you talk about some relation between s and s' in a Bombadil formula (not having that particular syntax though). It's a deliberate way of embedding this LTL flavor in JS/TS in the most natural and ergonomic way I could think of. I didn't want a deep EDSL or even a new bespoke spec language that people (and LLMs) would have to learn, and to have to write tools for. Now you can write Bombadil specs with a good LSP and be able to import packages off of npm or whathaveyou. Most web devs will probably be comfy with JS or TS, so that's why I chose that style.

      I hope that makes sense?

      Thanks for the nice feedback!

      • inaseer 13 minutes ago
        Yes, I understand why you made these design decisions. And I also agree that sticking to JS/TS keeps things simple (for humans, and LLMs). I generally default to the s and s' way of specifying things (in a C# property-based testing framework I'm working on) but looking at how you approached things here gives me another angle to think about.

        Good work!

        • owickstrom 7 minutes ago
          Cool. Is this publicly available?
  • warpspin 8 hours ago
    I especially like that it's a single executable according to the docs.

    Recently evaluated other testing tools/frameworks and if you're not already running the npm-dependencyhell-shitshow for your projects, most tools will pull in at least 100 dependencies.

    I might be old fashioned but that's just too much for my taste. I love single-use tools with limited scope like e.g. esbuild or now this.

    Will give this a try, soon.

    • owickstrom 7 hours ago
      Glad you noticed! I've been putting quite some energy into keeping things this way. VERY worth it, IMO.
  • thibran 7 hours ago
    I'm doing propety-based test since years for frontend stuff. The hardest part is, that there is so much between the test inputs and the application under test, that I find 50% of the time problems with the frontend test frameworks/libs and not in our code.
    • nz 6 hours ago
      And sometimes you find errors in code that absolutely should never have errors: I found an (as of yet not-root-caused) error in sqlite (no crash or coredump, just returns the wrong data, and only when using sqlite in ram-only-mode). Had to move to postgres for that reason alone. This is part of the reason why I have a strong anti-library bias (and I sound like a lunatic to most colleagues because they "have never had a problem" with $favorite_library -- to which my response is: "how do you _know_?"[0], which often makes me sound like I'm being unreasonably difficult).

      Sometimes, only thing you can do is let the plague spread, and hope that the people who survive start showering and washing their hands.

      [0]: I once interviewed at a company that sold a kind of on-prem VM hosting and storage product. They were shipping a physical machine with Linux and a custom filesystem (so not ZFS), and they bragged about how their filesystem was very fast, much faster than ZFS or Btrfs on SSDs. I asked them, if they were allowed to tell me how they achieved such consistent numbers. They listed a few things, one of which was: "we disabled block-level check-summing". I asked: "how do you prevent corruption?". They said: "we only enable check-summing during the nightly tests". So, a little unsettled, I asked: "you do not do _any_ check-summing at any point in production"? They replied: "Exactly. It's not necessary". So, throwing caution to the wind (at this point I did not care to get the job), I asked: "And you've never had data corruption in production"? They said: "Never. None". To which I replied: "But how do you _know_"? My no-longer-future-coworker thought for a few seconds, and realization flashed across his face. This was a company that had actual customers on 2 continents, and was pulling in at least millions per year. They were probably silently corrupting customer data, while promising that they were the solution -- a hospital selling snake-oil, while thinking it really is medicine.

      • sealeck 2 hours ago
        > I found an (as of yet not-root-caused) error in sqlite (no crash or coredump, just returns the wrong data, and only when using sqlite in ram-only-mode).

        You should report this to the SQLite developers - they are very smart and very interested in fixing SQLite correctness bugs!

    • terpimost 7 hours ago
      Are you talking about user flows and multiple interactions that are happening and data exchange that PBT before that wasn't able to address?
      • thibran 7 hours ago
        PBT allows us to test more combinations without writing hundreds of tests. Yes, it's about user flow inside a single module of our gigantic application.
    • owickstrom 7 hours ago
      Interesting. What kind of properties are you checking?
      • thibran 7 hours ago
        I use quicktheories (Java) and generate a determistic random test scenario, then I generate input values and run the tests. This way I can create tests that should fail or succeed, but differ in the steps executed and in the order with "random input".
        • owickstrom 6 hours ago
          OK. What kind of problems do you hit from third-party libraries with that?
          • thibran 6 hours ago
            Escaping problems or wrong handling of non-visible characters.
  • owickstrom 8 hours ago
    Author here, happy to answer questions about Bombadil! :)
    • degenerate 6 hours ago
      From a project management perspective, the 5 examples don't help me understand how/why I might switch from Playwright/Cypress to this framework. It seems like Bombadil is a much lower-level test framework focusing on DOM properties but in the "Why Bombadil?" introduction you say "maintaining suites of Playwright or Cypress tests takes a lot of work" ... I'd like if there was an example showing how this is true, perhaps a 1:1 example of Playwright vs Bombadil for testing something such as notifications clearing when I click clear. Basically, beefing up examples with real-world ones that Playwright users might have written is a good way to foster adoption.
      • owickstrom 6 hours ago
        This is a great point. Bombadil _is_ also tied to the DOM much like those tools, but as you focus on providing just a set of generators (which can be largely the defaults already supplied by Bombadil), you get a lot of testing from a small spec. You might need to specify some parts in terms of DOM selectors and such, and that has coupling, but I think the power-to-weight ratio is a lot better because of PBT.
    • owickstrom 8 hours ago
    • javierhonduco 4 hours ago
      Super cool project! Curious on how this compares to Meticulous in terms of functionality and approach. Thanks!
    • bombcar 7 hours ago
      All I can think of is "the Token Ring had no power over him" but then I realized that "token ring" has a completely different meaning now in the age of AI.

      Nice name, now who is he?

  • logicprog 6 hours ago
    This looks genuinely awesome! I've been thinking about how to do good property based testing on UIs, and this elegantly solves that problem — I love the language they've designed here. It really feels like model checking or something.
    • owickstrom 6 hours ago
      Thanks, glad you like it! Do you mean the temporal logic aspect of it?
      • logicprog 6 hours ago
        Yeah, the temporal logic aspect of it is exactly what I was referring to :)
        • owickstrom 5 hours ago
          Cool. I think that is a very neat way of expressing properties of UIs (and stateful systems more generally) that works out nicely in testing. There are some gotchas related to the finiteness of testing, but it's manageable.
  • jryio 7 hours ago
    Hey Oskar ~ great project and looks promising. I would be curious to hear what is still work-in-progress for Bombadil.

    It's helpful to know what the tool maintainers see as upcoming or incomplete work. It also saves a consultant like me a lot of time to evaluate new tools for clients if I also know the limitations before diving in. Maybe a section in the manual for "What Bombadil can't do".

    Great work!

    • owickstrom 6 hours ago
      Good feedback! Short answer: a lot of stuff is remaining. It's a very new projects and I've been trying to cover the basics. There's a ton to do around better state space exploration, reporting/debugging (working on this now!), integration with other tools and platforms like CI, etc. But a living section in the README or the Manual for "planned but not yet built" probably makes sense.
  • elcapitan 7 hours ago
    "Bombadil" means that I'll probably skip most of these tests.
  • xaviervn 6 hours ago
    The major breakthrough here is that they managed to write a project in Rust and not mention it on the headline!

    Jokes aside, great project and documentation (manual)! Getting started was really simple, and I quickly understood what it can and cannot do.

    • owickstrom 6 hours ago
      :) Thank you for trying it!
  • rienbdj 4 hours ago
    Congrats to the team!

    Is there a video showing someone spinning this up and finding a bug in a simple app?

    A broken counter app maybe?

  • tom-blk 6 hours ago
    Very cool stuff, will apply this to my next project
    • owickstrom 6 hours ago
      Let me know what bugs you find (in Bombadil itself or your next project!)
  • picardo 7 hours ago
    For most static UI surfaces, I probably wouldn't use it, but I can see a use case in this for testing generative UI workloads.
    • owickstrom 6 hours ago
      Sure, server-side or client-side generated UIs tend to have a lot more interesting complexity to test. But I do want to bring up that with the specification language being TypeScript, you can validate some basic properties even for statically generated sites. I wrote a spell checker for Bombadil that uses https://github.com/wooorm/nspell and a custom dictionary and found a bunch of old typos on my own blog.
  • wittjeff 3 hours ago
    From your title my immediate thought was "cool, maybe this will move us a bit closer to making components (or the testing thereof) cover accessibility thoroughly by default".

    See, the idea with the semantic web, and the ARIA markup equivalents, is that things should have names, roles, values, and states. Devs frequently mess up role/keyboard interaction agreement (example: role=menu means a list will drop on Enter keypress, and arrow keys will change the active element), and with ensuring that state info (aria-expanded, aria-invalid, etc.) is updated when it should be.

    Then I checked the Antithesis website. They don't even have focus state styling on any of the interactive elements. sigh

    • terpimost 2 hours ago
      Hey, sorry for antithesis.com. Thank you for noticing this!

      The bug is subtle: we do have styles for that, but in some change which I probably did we use css variable, which isn't there...

      We will fix that soon.

  • sequoia 7 hours ago
    Struggling to understand what this is or how it works.
    • owickstrom 5 hours ago
      Did you have a look at the intro in the manual? https://antithesishq.github.io/bombadil/1-introduction.html

      If that's not clear, please let me know how we can improve it!

      • maxwg 30 minutes ago
        > Bombadil itself decides what is an interesting event and when to capture state.

        It would be very nice to have a little more details on how this works. Though I guess we can figure it out via code/trialing it out.

        I'll give it a go later today!

        • owickstrom 23 minutes ago
          Go for it! I've been meaning to do an "architecture of Bombadil" blog post that'd likely answer this question. It's not super advanced by any means, but it's a mindset shift to how you might think about browser testing coming from the mainstream frameworks like Playwright.
      • darkvertex 3 hours ago
        A hello-world example of what it looks like, and what running it would validate, would be nice.

        It's all very abstractly described in the README and that intro page you linked.

        • owickstrom 2 hours ago
          Fair enough! I'm probably doing a demo video that'd help with this. Maybe can condense that in text form into the readme as well.
  • css_apologist 6 hours ago
    very cool! does this work? can you describe the kinds of real bugs you've caught with this?
    • owickstrom 6 hours ago
      I'd say it does! Bombadil is very new but its predecessor has found complicated and very real bugs in my work projects, and in the paper we wrote about it, we found bugs in more than half of the TodoMVC example apps: https://arxiv.org/abs/2203.11532
  • orliesaurus 8 hours ago
    Bombadillo Crocodillo

    Ok I will see myself out

    (Yes I know it's actually from the Tolkien book)

  • Darmani 2 hours ago
    Timely! I spent a good chunk of yesterday investigating Bombadil. I'm a huge fan of PBT and a huge fan of temporal logic (though not linear temporal logic) and we're currently trying to level up our QA, so it seemed like a good match.

    Unfortunately, I concluded that Bombadil is a toy. Not as in "Does some very nice things, but missing the features for enterprise adoption." I mean that in a very strong sense, as in: I could not figure out how to do anything real with it.

    The place where this is most obvious is in its action generator type. You give it a function that generates actions, where each action is drawn from a small menu of choices like "scroll up this many pixel" or "click on this pixel." You cannot add new types of actions. If you want to click on a button, you need it to generate a sequence of actions to first scroll down enough, and then look up a pixel that's on that button.

    Except that it selects actions randomly from your list, so you somehow need the action generator, when run the first time, to generate only the scroll action, and then have it, when run the second time, generate only the click action. If you are silly enough as to have an action generator that, you know, actually gives a list of reasonable actions, you'll get a tester that spends all its time going in circles clicking on the nav bar.

    (Something in the docs claimed that actions are weighted, but I found nothing about an actual affordance for doing that. Having weights would make this go from basically impossible to somewhat impossible.)

    (Edit: Found the weighted thing.)

    I am terrified to imagine how to get Bombadil to fill out a form. At least that seems possible -- you can inspect the state of the web page to figure out what's already been filled out. But if you want the state to be based on anything not on the current page, like the action that you took on the previous page, or gasp the state on disk (as for an Electron app), that seems completely impossible. Action generators are based on the current state, and the state must be a pure function of the web page.

    Its temporal logic has a cool time-bounded construct, but it's missing the U (until) operator. One of their few examples is "If you show the user a notification, it disappears within 5 seconds." But I want to say "When you click the Generate button, it says 'Generating...' up until it's finished generating." And I can't.

    (Note: everything above is according to the docs. Hopefully everything I said is a limitation of the docs, not an actual limitation of the framework.)

    I shared my comments with the author yesterday on LinkedIn, but he hasn't responded yet. Maybe I'll hear from him here.

    I have a pretty positive opinion of Antithesis as a company and they seem to be investing seriously in it, and generally see it as a strong positive sign when someone knows what temporal logic is, so I have hopes for this framework. I am nonetheless disappointed that I can't use it now, especially because I was supposed to finish an internal GUI testing tool this week and my god I'm behind.

    • owickstrom 59 minutes ago
      Hey, yeah, so actions are indeed independent and you can't express fixed sequences of them (e.g. fill out thus form by doing X, then Y, then Z). If you want to enforce certain sequentiality you'd need to precondition your generators. This is admittedly limited right now. I do want to add QoL things like `filter` on the generators to make that more ergonomic.

      Also, as you note, you can't implement custom actions. And that's just something that I haven't gotten to yet. It'd be quite straightforward to add a plain function (toStringed) as one of the variants of the Action type and send that to the browser for eval. (Btw, we're taking contributions!)

      Weighting is also important right now. There's no smart exploration in Bombadil yet, only blind random, so manual weighting becomes pretty crucial to have more effective tests (i.e. unlikely to run in circles). I'd like to both make the Bombadil "fuzzer" better at this, but eventually you might want to run Bombadil inside Antithesis instead to get a much better exploration, even in a single campaign.

      The Until operator is also one of those things that I haven't gotten to yet. I actually didn't expect someone to hit this as a major limitation of the tool so quickly, which is why I've focused on other things. Surprising!

      To add some more context, Bombadil is an OSS project with one developer (but we're hiring!) and it's 4 months old and marked experimental. I'm sorry you were disappointed by its current state, but it's very early days, so expect a lot of these things to improve. And your feedback will be taken into account. Thanks!

  • NoraCodes 7 hours ago
    My kingdom for a way to stop this godforsaken industry from stripping Tolkien's fiction for parts.
    • magicalhippo 32 minutes ago
      Indeed. Why not follow Douglas Adams' example and use obscure place names instead[1].

      [1]: https://en.wikipedia.org/wiki/The_Meaning_of_Liff

    • nosman 6 hours ago
      The bad guys are using Tolkien references. The good guys are probably using Hitchhiker's Guide references...
    • jkestner 7 hours ago
      Let's start naming things after Iain Banks ships.
      • patapong 7 hours ago
        I am in support. In general he was really good with names I thought, they always had an otherwordly flair while being clear to pronounce. Skaffen-Amtiskaw, Anaplian, Elethiomel...
        • chrisweekly 6 hours ago
          Yes!

          "Just Another Victim Of The Ambient Morality" is one of my favorites.

      • nozzlegear 5 hours ago
        Elon Musk has already had a head start on that idea for over 10 years, unfortunately:

        https://scifi.stackexchange.com/q/81164

        • rienbdj 4 hours ago
          Weird - Iain m Banks would’ve found Musk’s politics abhorrent
    • akshayshah 6 hours ago
      To be fair, the name was a joking response to https://x.com/patrickc/status/2015562569105465347…and then the temporary skunkworks name stuck, as it always does.
    • bevr1337 2 hours ago
      The AI generated illustration and satire of the poem hurt my soul. Tom doesn’t even have pupils! This output is low quality even for AI generated content.
    • pythonaut_16 7 hours ago
      Makes me want to name a project or company Sauron in response.
      • terpimost 6 hours ago
        The Eye of Sauron for some Observability tool
        • dymk 6 hours ago
          Palantir
      • ffsm8 6 hours ago
        I've worked at a company that had a team call themselves Sauron before

        So occasionally I got mails by "some colleague on behalf of Sauron" back then

      • object-a 6 hours ago
    • threatofrain 6 hours ago
      All the companies you’re thinking of are likely paying the Tolkien estate a very fat fee. This repo will likely have to rename.
      • NoraCodes 5 hours ago
        Yeah. Unfortunately the people running the estate today appear to share ~none of the late Professor's values.
    • testdelacc1 5 hours ago
      I do appreciate you quoting Shakespeare.
    • eclectician 7 hours ago
      We can go strip Shakespeare instead.
      • el_nahual 5 hours ago
        (For those who don't get it, "My kingdom for ___" is from Richard III: "my horse, my horse, my kingdom for a horse")
    • jzelinskie 7 hours ago
      I'm just waiting for them to exhaust LotR and move on to Roverandom
    • sifar 6 hours ago
      No, not bombadil. I'll donate mine too. :)
    • paulnpace 6 hours ago
      Bad actors use Tolkien. Good actors use Orwell.
    • _doctor_love 5 hours ago
      [dead]