Jul 05 2007

Determining your Sample Size

Published by Wendi at 4:59 pm under statistics, A/B testing

Robbin Steif asked me today how long she needed to let her test run before she could call it a day and assume that there is really no difference between the treatments (since she isn’t seeing one right now). She sent over the following screen shot of her outcomes in Google’s Website Optimizer from the last two weeks:

GA Website Optimizer A/B Test

As you can see, right now she isn’t seeing any lift in her conversion rate. Actually she is seeing similar values and a small drop. But is the drop significant? Does she have enough data to support an outcome at this point?


Before you run a test of significance you first need to know if you have enough data to support the test in the first place. For population proportions the formula for sample size “n” is:

 

n = z2(pq/δ2)

where

p = % of Success (conversions in this example)

q = % of Failures (i.e. 1 – p)

*note: use the Conversation Rate from your control landing page

To finish out this equation you need to make a few assumptions.

Assumptions

1. The Confidence Level - α (alpha): the level of certainty that you are willing to accept

2. Error - δ (delta): the margin of error that you are willing to accept

With these assumptions set, lastly you need to calculate the Z value based on your Confidence Level. It’s easy to do in excel with the NORMSINV() formula. Since we are determining the existence of a “difference” among the conversion rates versus if the conversion rate is specifically higher or lower than the control we need to divide alpha in half for a two-sided test structure.

=ABS(NORMSINV(α/2))

In this example our Z = 1.96. Now we have all the pieces in our formula to calculate the needed sample size.

n = z2(pq/δ2)

= (1.96)2 * [(.0472*.9528)/(.01)2]

 

= 1728

Thus Robbin is going to need 1,728 page views before she can make the determination that the treatments she is testing did or did not make a difference in her conversion rate. You can download this excel file I put together (nothing fancy) where you can toggle the alpha and delta values so that you can see how each one impacts the needed sample size.

I also included a reference to the maximum sample size one would need if you don’t have a control to set your “p” and “q” values. It’s rather astonishing but it you are a conservative then you can always fall back on this calculation and know that if you get approximately 10,000 samples you are good to go.

Part II of this question is – Is there a difference? This is different than asking how many samples you need to determine if there is a difference. Applying a hypothesis test is needed to actually determine if the difference in the conversion rates are statistically significant. You can read more about how to do this on my previous post about A/B testing. You can find a downloadable excel file in this post that you can toggle various sample sizes and determine if the conversion rates are different – statistically speaking.

 

Until next time… safe analyzing.

 

*UPDATE* 7/16/07:  Make sure that you use the first page view per visitor - “Unique Page Views” in Google analytics Terms when making this calculation.  The sample size calculation assumes each event is independent  of each other.

Thank You to Mike & Chris!

15 Responses to “Determining your Sample Size”

  1. Steveon 05 Jul 2007 at 8:26 pm

    Thanks Wendi, at this rate I may as well just automatically “star” all your postings in Google Reader… ;-)

    I would interpret this to mean that Robbin possibly does have enough info to make a determination (3 x ~ 250 + other combos?). Does that work? Or should we have 1700 for those three only?

    Assuing that to be the case. Should she stick with the original or???

    I see myself torn between the original and #1.
    #1 Reduces the error range - not by much. That seems to imply to me that it is a more guaranteed combination? And hence has a value in and of itself???

    Or am I dreaming? :-)

    I could always handle the math side of statistics (I can just about read ancient greek purely from studying engineering @ uni. ;-) ). Knowing how/when to apply it was ever my problem…

    Cheers, and Thanks!

  2. Wendion 06 Jul 2007 at 9:23 am

    Hi Steve, Thanks for the positive feedback and glad you find this blog useful.
    Actually Robbin needs a full 1700 with those three combinations. I don’t believe that she was running any others. But yes, she just needs a total of 1700 for the entire sample. So if she did run other combinations then she would have been closer to what she needed.
    As for picking the right one in the end, well that is up to her since there didn’t seem to be any impact from the treatments. It would be more of a personal preference question at that point. Maybe over time the conversion will experience a lift just by refreshing the page anyway - who knows. But right now if it takes an act of congress to update the page, I would waiver against not making any changes. In some organizations there tends to be this struggle for IT resources so if that exists, based on the tested outcomes she may not be wise to make any changes at this point. But really we are too early in the game to make a sound decision. I guess it’s just a waiting game at this point.

    Take Care, Wendi

  3. Steveon 09 Jul 2007 at 3:05 am

    Hi Wendi, thanks again (very muchly!) for that.

    Can I please query one more issue (hopefully… ;-) )
    You’ve set the margin of error to 0.01. I gather from reading elsewhere that implies an margin of error of 1%.
    If we increase that error to 1.5% we drop the number of pages required to 768. Which she already has (783).

    So the question: Why would we continue to 1% when by accepting just an extra 1/2 a percent, we’re already there? Obviously office/client politics, but are there formulaic reasons why that would be preferable? Is it tied to the Z-value in some way?

    If I may be allowed a bonus question? :-)
    To try and put it terms I am familiar with: Are we already at a level of confidence of 95% +/- 0.32%?
    Or have I totally got it all wrong? (again)

    I think I’m actually starting to understand this stuff… :-)

    Cheers! and again, Thanks!

  4. Robbin Steifon 09 Jul 2007 at 8:09 pm

    Wendi - is the “margin of error” a margin of the control, i.e. 1% of the 4.72% conversion rate, or additive, 5.72%? I am guessing the former, since the latter seems too detectable to be considered a tie.

  5. Wendion 09 Jul 2007 at 10:17 pm

    Hi Steve - Yes, by toggling the width of your precision you can reduce the amount of work needed to accomplish your goal. It just depends on how much you are willing to give up in your precision. Remember that the margin of error is plus or minus so the more you give the less precision you have. And yes, you are almost correct in your margin of error estimate for the current conditions. It’s actually +/- 3.2% not 0.32%. In many cases you will see polls refer to the standard +/- 3% error rate and right now Robbin is pretty much inline with the standard needs of acceptability.

    Hope this helps.
    Wendi

  6. Wendion 09 Jul 2007 at 10:26 pm

    Robbin - The margin of error is the window that the true conversion rate should fall - with 95% confidence. In other words, you are 95% confident that the true (population) conversion rate is somewhere between [3.72%,5.72%] based on your sampling (test).

    Cheers! Wendi

  7. Wendion 10 Jul 2007 at 7:29 am

    All Readers - I made a booboo - I mis-typed the the formula above. I forgot to include the squaring of the delta. The excel file that is correct and the actual value calculated and originally posted was also correct. The post has been updated and is now correct. Sorry for this.

    Thanks Robbin!!

  8. Michael Helblingon 14 Jul 2007 at 3:29 pm

    I am really enjoying reading your posts. This one was excellent, but I have one issue. You mention that the number of page views should be 1,728. In reality shouldn’t it be 1,728 visits or sessions that enter on this page? Since a single visit could realistically see that page 2-4 times, I would think you would only want to count the first page view in a session toward the sample size.

  9. Steveon 15 Jul 2007 at 4:56 pm

    Apologies for late reply. Thanks Wendi. I did wonder about the 0.32 - it felt wrong. :-)

    Helps hugely! Again (x 2, x3 , x4 …) thanks!
    Cheers!

  10. Wendion 16 Jul 2007 at 7:46 am

    Hi Michael, Very good point…. but:
    Since Robin is measuring CTR - Click Thru Rate where
    CTR = Clicks / Page Views (aka Impressions)
    Then looking at Page Views makes more sense. Actually in Google Analytics you can pull Unique Page Views which is what you should look at to be most accurate. Robbin could look at the number of reloads and back those out as well.

    Thanks for the comment.

    Cheers!
    Wendi

  11. chrisgon 16 Jul 2007 at 1:06 pm

    I think Michael’s point is that the power calculation formula assumes independence of all of the observations, and two views of the same page by one person are not independent observations. To get the best possible estimate of sample sizes needed, you’d have to use only one data point per visit. Even though your statistic (CTR) is based on page views, your probability stats, which is what your power calculations are based on, have to use no more than one event (page view) per visit. If it were my analysis, I would set it up to use only the first view of the page in a visit, both for the sample size formulas and the estimate of CTR itself.

    Regards, Chris

  12. Wendion 16 Jul 2007 at 1:43 pm

    Hey Chris. Absolutely. I am following you and Mike. That is why I included the usage of Unique Page Views that Google Analytics provides in my reply to Mike (and is what Robbin is using on their site). I agree totally with both of you. I’ll post and update to make that point clear in the post.

    Thanks again!
    Wendi

  13. Erikon 28 Aug 2007 at 5:49 pm

    Thanks Wendi, This has been a really helpful article as I get prepared for a testing program. I have a question regarding the number of variants (or ‘combinations’ as google calls them).

    I read the above N as 1728 total views, or an average of 576 views for each of the three variants. What happens if we are testing 6 different variants or even 9 (as in some multivariate testing)? If your N is fixed at 1728, the required views per variant would decrease as variants are added. At some point it seems that you would get too few views per variant to have a statistically significant test.

    It seems to me the number of variants should be factored into the total sample size. In reading about the Power formula, it seems it is based on a 2 sample test. If that’s the case should the above formula be multiplied by (r / 2), where ‘r’ is the number of variants? Hope I’m on the right track with this, if not please let me know,
    Thanks!

  14. Rachelon 26 Sep 2007 at 7:32 am

    I am still a bit confused on the sample size. Does she need 1700 visitors for each experiment or in total for all three? Right now in total she is at 783 and I’m trying to figure out if she needs ~1000 more or 4300 more. Thanks.

  15. seby kallarakkalon 26 Nov 2007 at 8:49 pm

    Hi Wendi,
    Great post. Thanks for sharing. I came across your blog few days back and have been catching up on the older posts. Your perspective on using stats for analytics is proving to be really useful.

Trackback URI | Comments RSS

Leave a Reply

You must be logged in to post a comment.