The Roller Coaster Ride of Statistical Significance

By Adam Figueira

April 13, 2012

Last week, San Francisco blogger and entrepreneur Andrew Chen published an insightful article describing a phenomenon common to many industries and practices. Military strategists speak of it as "always being prepared to fight the last war." SEOs recount that "what worked yesterday doesn't work today—precisely because it worked yesterday."

And now for website optimization specialists, Chen has coined the colorful "Law of Shitty Clickthroughs." A key insight of the law is that visitors quickly respond to novelty but then gradually become desensitized to the tactic.

If you're involved in website testing, you've likely experienced this in the form of tests that gain—and then lose—statistical significance.

How Can Significance Ever Drop? 

It's important to understand that not all tests are created equal. If your test involves simple modifications such as changes to button colors, statistical significance (once achieved) is unlikely to fall.

But let's suppose you were one of the first retailers to test Facebook's "Like" button on your product pages. Out of the gate, a large portion of your visitors probably clicked "Like," corresponding to a dramatic difference in performance (i.e., Facebook fans or the amount of inbound traffic from Facebook.com) compared to your Control. Several weeks later, though, significance began to drop—to p90, p80, and then completely off the radar. The sharp differences between your Test and Control groups were no longer apparent, having begun to converge once again.

What happened? Your test experienced the "Novelty Effect," or the roller coaster ride of statistical significance.

 

In a traditional test, where an effect is observed, p-level does not decrease. In a novel test, initial behavior may be atypical. As visitors become desensitized to the tactic being tested, the behavior of your Test group falls back in line with that of the Control. 

 Testing for the Real World 

One way to think about this roller coaster ride is to understand the difference between statistical significance and practical significance. In a laboratory experiment that measures the effect of a new miracle drug, questionnaires are employed to weed out diversity and a fixed sample size is determined before the test begins. But website testing takes place in the real world and in real time—not within the four walls of a research lab or the glossy pages of a college textbook.

As marketing expert and Monetate advisor Bryan Eisenberg has written, companies need to test for the real world, not a perfect world. In addition to performance figures, that means understanding the qualitative context for each of your campaigns. Ask two questions:

  1. Are you testing a new or novel experience, or just simple changes?
  2. What non-random factors could contribute to your performance (e.g., selling snow shovels during a blizzard)?

In a novel test, conversion rate may increase suddenly and rapidly. However, as visitors become desensitized to the tactic being tested, the behavior of your Test group falls back in line with that of the Control, and may even drop further. 

To mitigate any risk, consider these tactics:

  1. Increase the required p-level of your test (e.g., if you typically use p95, use p99 instead)
  2. Disable settings that automatically push winning tests to 100% of site traffic
  3. Monitor significance and look for a stable trend before trying to draw any conclusions
  4. Bookmark this post and come back to share your experience

Most importantly, keep on testing.

Adam Figueira is a former product marketing director at Monetate.

Experience the future of ecommerce personalization

Book your personalized demo today.