December 5, 2022

Live training: How to benchmark an existing site structure using Treejack

If you missed our live training, don’t worry, we’ve got you covered! In this session, our product experts Katie and Aidan discuss why, how and when to benchmark an existing structure using Treejack.

They also talk through some benchmarking use cases, demo how to compare tasks between different studies, and which results are most helpful.

Share this article
Author
Sarah
Flutey

Related articles

View all blog articles
Learn more
1 min read

How to Spot and Destroy Evil Attractors in Your Tree (Part 1)

Usability guru Jared Spool has written extensively about the 'scent of information'. This term describes how users are always 'on the hunt' through a site, click by click, to find the content they’re looking for. Tree testing helps you deliver a strong scent by improving organisation (how you group your headings and subheadings) and labelling (what you call each of them).

Anyone who’s seen a spy film knows there are always false scents and red herrings to lead the hero astray. And anyone who’s run a few tree tests has probably seen the same thing — headings and labels that lure participants to the wrong answer. We call these 'evil attractors'.In Part 1 of this article, we’ll look at what evil attractors are, how to spot them at the answer end of your tree, and how to fix them. In Part 2, we’ll look at how to spot them in the higher levels of your tree.

The false scent — what it looks like in practice

One of my favourite examples of an evil attractor comes from a tree test we ran for consumer.org.nz, a New Zealand consumer-review website (similar to Consumer Reports in the USA). Their site listed a wide range of consumer products in a tree several levels deep, and they wanted to try out a few ideas to make things easier to find as the site grew bigger.We ran the tests and got some useful answers, but we also noticed there was one particular subheading (Home > Appliances > Personal) that got clicks from participants looking for very different things — mobile phones, vacuum cleaners, home-theatre systems, and so on:

pic1

The website intended the Personal appliance category to be for products like electric shavers and curling irons. But apparently, Personal meant many things to our participants: they also went there for 'personal' items like mobile phones and cordless drills that actually lived somewhere else.This is the false scent — the heading that attracts clicks when it shouldn’t, leading participants astray. Hence this definition: an evil attractor is a heading that draws unwanted traffic across several unrelated tasks.

Evil attractors lead your users astray

Attracting clicks isn’t a bad thing in itself. After all, that’s what a good heading does — it attracts clicks for the content it contains (and discourages clicks for everything else). Evil attractors, on the other hand, attract clicks for things they shouldn’t. These attractors lure users down the wrong path, and when users find themselves in the wrong place they'll either back up and try elsewhere (if they’re patient) or give up (if they’re not). Because these attractor topics are magnets for the user’s attention, they make it less likely that your user will get to the place you intended. The other evil part of these attractors is the way they hide in the shadows. Most of the time, they don’t get the lion’s share of traffic for a given task. Instead, they’ll poach 5–10% of the responses, luring away a fraction of users who might otherwise have found the right answer.

Find evil attractors easily in your data

The easiest attractors to spot are those at the answer end of your tree (where participants ended up for each task). If we can look across tasks for similar wrong answers, then we can see which of these might be evil attractors.In your Treejack results, the Destinations tab lets you do just that. Here’s more of the consumer.org.nz example:

Pic2

Normally, when you look at this view, you’re looking down a column for big hits and misses for a specific task. To look for evil attractors, however, you’re looking for patterns across rows. In other words, you’re looking horizontally, not vertically. If we do that here, we immediately notice the row for Personal (highlighted yellow). See all those hits along the row? Those hits indicate an attractor — steady traffic across many tasks that seem to have little in common. But remember, traffic alone is not enough. We’re looking for unwanted traffic across unrelated tasks. Do we see that here? Well, it looks like the tasks (about cameras, drills, laptops, vacuums, and so on) are not that closely related. We wouldn’t expect users to go to the same topic for each of these. And the answer they chose, Personal, certainly doesn’t seem to be the destination we intended. While we could rationalise why they chose this answer, it is definitely unwanted from an IA perspective. So yes, in this case, we seem to have caught an evil attractor red-handed. Here’s a heading that’s getting steady traffic where it shouldn’t.

Evil attractors are usually the result of ambiguity

It’s usually quite simple to figure out why an item in your tree is an evil attractor. In almost all cases, it’s because the item is vague or ambiguous — a word or phrase that could mean different things to different people. Look at our example above. In the context of a consumer-review site, Personal is too general to be a good heading. It could mean products you wear, or carry, or use in the bathroom, or a number of things. So, when those participants come along clutching a task, and they see Personal, a few of them think 'That looks like it might be what I’m looking for', and they go that way.Individually, those choices may be defensible, but as an information architect, are you really going to group mobile phones with vacuum cleaners? The 'personal' link between them is tenuous at best.

Destroy evil attractors by being specific

Just as it’s easy to see why most attractors attract, it’s usually easy to fix them. Evil attractors trade in vagueness and ambiguity, so the obvious remedy is to make those headings more concrete and specific. In the consumer-site example, we looked at the actual content under the Personal heading. It turned out to be items like shavers, curling irons, and hair dryers. A quick discussion yielded Personal care as a promising replacement — one that should deter people looking for mobile phones and jewellery and the like.In the second round of tree testing, among the other changes we made to the tree, we replaced Personal with Personal Care. A few days later, the results confirmed our thinking. Our former evil attractor was no longer luring participants away from the correct answers:

Pic3

Testing once is good, testing twice is magic

This brings up a final point about tree testing (and about any kind of user testing, really): you need to iterate your testing —  once is not enough.The first round of testing shows you where your tree is doing well (yay!) and where it needs more work so you can make some thoughtful revisions. Be careful though. Even if the problems you found seem to have obvious solutions, you still need to make sure your revisions actually work for users, and don’t cause further problems. The good news is, it’s dead easy to run a second test, because it’s just a small revision of the first. You already have the tasks and all the other bits worked out, so it’s just a matter of making a copy in Treejack, pasting in your revised tree, and hooking up the correct answers. In an hour or two, you’re ready to pilot it again (to err is human, remember) and send it off to a fresh batch of participants.

Two possible outcomes await.

  • Your fixes are spot-on, the participants find the correct answers more frequently and easily, and your overall score climbs. You could have skipped this second test, but confirming that your changes worked is both good practice and a good feeling. It’s also something concrete to show your boss.
  • Some of your fixes didn’t work, or (given the tangled nature of IA work) they worked for the problems you saw in Round 1, but now they’ve caused more problems of their own. Bad news, for sure. But better that you uncover them now in the design phase (when it takes a few days to revise and re-test) instead of further down the track when the IA has been signed off and changes become painful.

Stay tuned for more on evil attractors

In Part 1, we’ve covered what evil attractors are and how to spot them at the answer end of your tree: that is, evil attractors that participants chose as their destination when performing tasks. Hopefully, a future version of Treejack will be able to highlight these attractors to make your analysis that much easier.

In Part 2, we’ll look at how to spot evil attractors in the intermediate levels of your tree, where they lure participants into a section of the site that you didn’t intend. These are harder to spot, but we’ll see if we can ferret them out.Let us know if you've caught any evil attractors red-handed in your projects.

Learn more
1 min read

"Could I A/B test two content structures with tree testing?!"

"Dear Optimal Worshop
I have two huge content structures I would like to A/B test. Do you think Treejack would be appropriate?"
— Mike

Hi Mike (and excellent question)!

Firstly, yes, Treejack is great for testing more than one content structure. It’s easy to run two separate Treejack studies — even more than two. It’ll help you decide which structure you and your team should run with, and it won’t take you long to set them up.

When you’re creating the two tree tests with your two different content structures, include the same tasks in both tests. Using the same tasks will give an accurate measure of which structure performs best. I’ve done it before and I found that the visual presentation of the results — especially the detailed path analysis pietrees — made it really easy to compare Test A with Test B.

Plus (and this is a big plus), if you need to convince stakeholders or teammates of which structure is the most effective, you can’t go past quantitative data, especially when its presented clearly — it’s hard to argue with hard evidence!

Here’s two example of the kinds of results visualizations you could compare in your A/B test: the pietree, which shows correct and incorrect paths, and where people ended up:

treejack pietree

And the overall Task result, which breaks down success and directness scores, and has plenty of information worth comparing between two tests:

treejack task result

Keep in mind that running an A/B tree test will affect how you recruit participants — it may not be the best idea to have the same participants complete both tests in one go. But it’s an easy fix — you could either recruit two different groups from the same demographic, or test one group and have a gap (of at least a day) between the two tests.

I’ve one more quick question: why are your two content structures ‘huge’?

I understand that sometimes these things are unavoidable — you potentially work for a government organization, or a university, and you have to include all of the things. But if not, and if you haven’t already, you could run an open card sort to come up with another structure to test (think of it as an A/B/C test!), and to confirm that the categories you’re proposing work for people.

You could even run a closed card sort to establish which content is more important to people than others (your categories could go from ‘Very important’ to ‘Unimportant’, or ‘Use everyday’ to ‘Never use’, for example). You might be able to make your content structure a bit smaller, and still keep its usefulness. Just a thought... and of course, you could try to get this information from your analytics (if available) but just be cautious of this because of course analytics can only tell you what people did and not what they wanted to do.

All the best Mike!

Learn more
1 min read

Collating your user testing notes

It’s been a long day. Scratch that - it’s been a long week! Admit it. You loved every second of it.

Twelve hour days, the mad scramble to get the prototype ready in time, the stakeholders poking their heads in occasionally, dealing with no-show participants and the excitement around the opportunity to speak to real life human beings about product or service XYZ. Your mind is exhausted but you are buzzing with ideas and processing what you just saw. You find yourself sitting in your war room with several pages of handwritten notes and with your fellow observers you start popping open individually wrapped lollies leftover from the day’s sessions. Someone starts a conversation around what their favourite flavour is and then the real fun begins. Sound familiar? Welcome to the post user testing debrief meeting.

How do you turn those scribbled notes and everything rushing through your mind into a meaningful picture of the user experience you just witnessed? And then when you have that picture, what do you do next? Pull up a bean bag, grab another handful of those lollies we feed our participants and get comfy because I’m going to share my idiot-proof, step by step guide for turning your user testing notes into something useful.

Let’s talk

Get the ball rolling by holding a post session debrief meeting while it’s all still fresh your collective minds. This can be done as one meeting at the end of the day’s testing or you could have multiple quick debriefs in between testing sessions. Choose whichever options works best for you but keep in mind this needs to be done at least once and before everyone goes home and forgets everything. Get all observers and facilitators together in any meeting space that has a wall like surface that you can stick post its to - you can even use a window! And make sure you use real post its - the fake ones fall off!

Mark your findings (Tagging)

Before you put sharpie to post it, it’s essential to agree as a group on how you will tag your observations. Tagging the observations now will make the analysis work much easier and help you to spot patterns and themes. Colour coding the post its is by far the simplest and most effective option and how you assign the colours is entirely up to you. You could have a different colour for each participant or testing session, you could have different colours to denote participant attributes that are relevant to your study eg senior staff and junior staff, or you could use different colours to denote specific testing scenarios that were used. There’s many ways you could carve this up and there’s no right or wrong way. Just choose the option that suits you and your team best because you’re the ones who have to look at it and understand it. If you only have one colour post it eg yellow, you could colour code the pen colours you use to write on the notes or include some kind of symbol to help you track them.

Processing the paper (Collating)

That pile of paper is not going to process itself! Your next job as a group is to work through the task of transposing your observations to post it notes. For now, just stick them to the wall in any old way that suits you. If you’re the organising type, you could group them by screen or testing scenario. The positioning will all change further down the process, so at this stage it’s important to just keep it simple. For issues that occur repeatedly across sessions, just write them down on their own post its- doubles will be useful to see further down the track.In addition to  holding a debrief meetings, you also need to round up everything that was used to capture the testing session/s. And I mean EVERYTHING.

Handwritten notes, typed notes, video footage and any audio recordings need to be reviewed just in case something was missed. Any handwritten notes should be typed to assist you with the completion of the report. Don’t feel that you have to wait until the testing is completed before you start typing up your notes because you will find they pile up very quickly and if your handwriting is anything like mine…. Well let’s just say my short term memory is often required to pick up the slack and even that has it’s limits. Type them up in between sessions where possible and save each session as it’s own document. I’ll often use the testing questions or scenario based tasks to structure my typed notes and I find that makes it really easy to refer back to.Now that you’ve processed all the observations, it’s time to start sorting your observations to surface behavioural patterns and make sense of it all.

Spotting patterns and themes through affinity diagramming

Affinity diagramming is a fantastic tool for making sense of user testing observations. In fact it’s just about my favourite way to make sense of any large mass of information. It’s an engaging and visual process that grows and evolves like a living creature taking on a life of its own. It also builds on the work you’ve just done which is a real plus!By now, testing is over and all of your observations should all be stuck to a wall somewhere. Get everyone together again as a group and step back and take it all in. Just let it sit with you for a moment before you dive in. Just let it breathe. Have you done that? Ok now as individuals working at the same time, start by grouping things that you think belong together. It’s important to just focus on the content of the labels and try to ignore the colour coded tagging at this stage, so if session one was blue post its don’t group all the blue ones together just because they’re all blue! If you get stuck, try grouping by topic or create two groups eg issues and wins and then chunk the information up from there.

You will find that the groups will change several times over the course of the process  and that’s ok because that’s what it needs to do.While you do this, everyone else will be doing the same thing - grouping things that make sense to them.  Trust me, it’s nowhere near as chaotic as it sounds! You may start working as individuals but it won’t be long before curiosity kicks in and the room is buzzing with naturally occurring conversation.Make sure you take a step back regularly and observe what everyone else is doing and don’t be afraid to ask questions and move other people’s post its around- no one owns it! No matter how silly something may seem just put it there because it can be moved again. Have a look at where your tagged observations have ended up. Are there clusters of colour? Or is it more spread out? What that means will depend largely on how you decided to tag your findings. For example if you assigned each testing session its own colour and you have groups with lot’s of different colours in them you’ll find that the same issue was experienced by multiple people.Next, start looking at each group and see if you can break them down into smaller groups and at the same time consider the overall picture for bigger groups eg can the wall be split into say three high level groups.Remember, you can still change your groups at anytime.

Thinning the herd (Merging)

Once you and your team are happy with the groups, it’s time to start condensing the size of this beast. Look for doubled up findings and stack those post its on top of each other to cut the groups down- just make sure you can still see how many there were. The point of merging is to condense without losing anything so don’t remove something just because it only happened once. That one issue could be incredibly serious. Continue to evaluate and discuss as a group until you are happy. By now clear and distinct groups of your observations should have emerged and at a glance you should be able to identify the key findings from your study.

A catastrophe or a cosmetic flaw? (Scoring)

Scoring relates to how serious the issues are and how bad the consequences of not fixing them are. There are arguments for and against the use of scoring and it’s important to recognise that it is just one way to communicate your findings.I personally rarely use scoring systems. It’s not really something I think about when I’m analysing the observations. I rarely rank one problem or finding over another. Why? Because all data is good data and it all adds to the overall picture.I’ve always been a huge advocate for presenting the whole story and I will never diminish the significance of a finding by boosting another. That said, I do understand the perspective of those who place metrics around their findings. Other designers have told me they feel that it allows them to quantify the seriousness of each issue and help their client/designer/boss make decisions about what to do next.We’ve all got our own way of doing things, so I’ll leave it up to you to choose whether or not you score the issues. If you decide to score your findings there are a number of scoring systems you can use and if I had to choose one, I quite like Jakob Nielsen’s methodology for the simple way it takes into consideration multiple factors. Ultimately you should choose the one that suits your working style best.

Let’s say you did decide to score the issues. Start by writing down each key finding on it’s own post it and move to a clean wall/ window. Leave your affinity diagram where it is. Divide the new wall in half: one side for wins eg findings that indicate things that tested well and the other for issues. You don’t need to score the wins but you do need to acknowledge what went well because knowing what you’re doing well is just as important as knowing where you need to improve. As a group (wow you must be getting sick of each other! Make sure you go out for air from time to time!) score the issues based on your chosen methodology.Once you have completed this entire process you will have everything you need to write a kick ass report.

What could possibly go wrong? (and how to deal with it)

No process is perfect and there are a few potential dramas to be aware of:

People jumping into solution mode too early

In the middle of the debrief meeting, someone has an epiphany. Shouts of We should move the help button! or We should make the yellow button smaller! ring out and the meeting goes off the rails.I’m not going to point fingers and blame any particular role because we’ve all done it, but it’s important to recognise that’s not why we’re sitting here. The debrief meeting is about digesting and sharing what you and the other observers just saw. Observing and facilitating user testing is a privilege. It’s a precious thing that deserves respect and if you jump into solution mode too soon, you may miss something. Keep the conversation on track by appointing a team member to facilitate the debrief meeting.

Storage problems

Handwritten notes taken by multiple observers over several days of testing adds up to an enormous pile of paper. Not only is it a ridiculous waste of paper but they have to be securely stored for three months following the release of the report. It’s not pretty. Typing them up can solve that issue but it comes with it’s own set of storage related hurdles. Just like the handwritten notes, they need to be stored securely. They don’t belong on SharePoint or in the share drive or any other shared storage environment that can be accessed by people outside your observer group. User testing notes are confidential and are not light reading for anyone and everyone no matter how much they complain. Store any typed notes in a limited access storage solution that only the observers have access to and if anyone who shouldn’t be reading them asks, tell them that they are confidential and the integrity of the research must be preserved and respected.

Time issues

Before the storage dramas begin, you have to actually pick through the mountain of paper. Not to mention the video footage, and the audio and you have to chase up that sneaky observer who disappeared when the clock struck 5. All of this takes up a lot of time. Another time related issue comes in the form of too much time passing in between testing sessions and debrief meetings. The best way to deal with both of these issues  is to be super organised and hold multiple smaller debriefs in between sessions where possible. As a group, work out your time commitments before testing begins and have a clear plan in place for when you will meet.  This will prevent everything piling up and overwhelming you at the end.

Disagreements over scoring

At the end of that long day/week we’re all tired and discussions around scoring the issues can get a little heated. One person’s showstopper may be another person’s mild issue. Many of the ranking systems use words as well as numbers to measure the level of severity and it’s easy to get caught up in the meaning of the words and ultimately get sidetracked from the task at hand. Be proactive and as a group set ground rules upfront for all discussions. Determine how long you’ll spend discussing an issue and what you will do in the event that agreement cannot be reached. People want to feel heard and they want to feel like their contributions are valued. Given that we are talking about an iterative process, sometimes it’s best just to write everything down to keep people happy and merge and cull the list in the next iteration. By then they’ve likely had time to reevaluate their own thinking.

And finally...

We all have our own ways of making sense of our user testing observations and there really is no right or wrong way to go about it. The one thing I would like to reiterate is the importance of collaboration and teamwork. You cannot do this alone, so please don’t try. If you’re a UX team of one, you probably already have a trusted person that you bounce ideas off. They would be a fantastic person to do this with. How do you approach this process? What sort of challenges have you faced? Let me know in the comments below.

Seeing is believing

Explore our tools and see how Optimal makes gathering insights simple, powerful, and impactful.