Archive for June, 2007

Jun 17 2007

Comparing Population Proportions – A/B Testing

Published by Wendi under statistics, web analytics

Comparing Population Proportions – A/B Testing

Many metrics in web analytics are conveyed as percentages, or population proportions as statisticians like to call it. As I mentioned in my previous post on the Statistics for People Who (Think They) Hate Statistics, percentages are useful in the real world (business data) and I was surprised there was not a section dedicated to this topic. So I thought I would cover a post on comparing population proportions; namely conversion rate for landing pages.

Landing page optimization is one aspect of testing in web analytics. It’s great. You can test almost anything – layout, content, color, tag line, call to action, media, etc… In my scenario we were testing the tag line. Since we are only testing one aspect of the page you can refer to this testing methodology as A/B testing. This is very much different than multivariate testing where numerous “variables” or parts of the page are tested at once and I’ll leave that for another conversation. So for now, we tested one variable – the tag line. The call to action that was defined for a measure of success was a submission of an online lead generation form. Since the form was small in nature, the form was a portion of the landing page and there was no interim steps/pages that may have increased conversion failure.

In testing, you must first define your hypothesis. The hypothesis in this case is that landing page #1 out performed landing page #2. In metrics terms, we are saying that the conversion rate for landing page #1 was better than landing page #2 (with a statistical significance).

Null Hypothesis H0: p1 = p2 (or can be written as p1 - p2 = 0)

“conversion was not different”

Alternative Hypothesis Ha: p1 p2 (or can be written as (p1 - p2 0)

“conversion is different”

Alpha = .05

The delivery of the pages were equally distributed among both pages but there were slight differences and that difference will be included in our calculation.

Landing page #1

Delivered 6,906 times

Conversion yield = 1.71%

Landing page #2

Delivered 6,534 times

Conversion yield = 1.44%

Some might just stop here and say, landing page #1 out performed landing page #2 and move on. But is that really a valid inference? Let’s see.

To test two population proportions you use the following equation:

two population proportions test

The p with the ^ on top is referred to as “p-hat”. “P-hat” is the sample population proportion (the %’s from your data) and is used to estimate the true population proportion.

All the calculations from the above formula can be easily done in excel and can be seen in a sample file here.

The calculated z-score is 1.2563. In excel you can calculate the p-value by utilizing the NORMSDIST() formula. You can determine the critical region or sometimes referred to the “rejection region” for the null hypothesis just by the z-score but from an interpretation standpoint, it’s easier to compare the p-value to the previously defined alpha. Calculating the p-value will help understand whether the difference between the two percentages are statistically different. The p-value is just 1-NORMSDIST(z-score).

= 2*[1 - NORMSDIST(1.2563)]

= 2*[0.105]

= .209

Now, our alpha value was set at .05 per our testing criteria listed previously. Since our p-value > alpha; 0.209 > .05, we Fail to Reject the Null Hypothesis. What does that mean? It means that the difference between the two conversion rates are not statistically significant. Thus technically, even though the conversion rate for landing page #1 was higher than landing page #2 there really wasn’t much of a difference to warrant one having a “better” tag line.

Until next time… safe analyzing.

4 responses so far

Jun 17 2007

Statistics Book Review

Published by Wendi under statistics

I have had a few weeks to read through the book I purchased and wanted to share my thoughts on it’s level of readability, coverage, and ease.  At a first glance, its grabs the attention of the readers and puts you at ease within context.  Statistics isn’t scary and the authors try to remind the readers of that through out the book.  They throw in warm, fuzzy happy faces for a difficulty rating (cute for high school, maybe not so much for the adult readers) for each chapter/topic.

The tips intertwined in the chapters are nice and sometimes inform the reader of advanced topics beyond the book.  But I might find some tips confusing had I not had prior knowledge of the topic at hand (easy enough though, just ignore what you don’t grasp - those tips are not applicable to the content of the book anyway).  All in all, the book is basic in nature but it does go beyond what I was expecting.  For one example the book covers Factorial ANOVAs (analysis of variance).  They don’t go into deep detail of a factorial ANOVA but I was surprised there was a dedicated section to this advanced method.

One thing I found missing was testing with population proportions.  From an applied statistics perspective, I would find a chapter on population proportions to be very helpful in the business world.  I wouldn’t say this missing chapter would be the show stopper for not recommending it but I might hold off and see if they are adding this in round 3 due out  this year.

Also, the ordering of the chapters seems a little odd to me in that they jump into correlation coefficients in chapter 5 then skip around and pause discussing linear regression until chapter 14 (they reintroduce correlation coefficients in chapter 13).  This maybe something that will be changed in the third edition (per some comments I see on Amazon.com).

Overall, the content is easy to read and comprehend, but there is certainly some room for improvement (as most books are always a work in progress).  If you are looking to understand how to do everything in excel for work, I might suggest getting the Excel Edition but keep in mind that excel doesn’t hold the tools for advanced statistical analysis.

Until next time… safe analyzing.

No responses yet

Jun 11 2007

Relationships Take 2

Published by Wendi under statistics

As promised, I wanted to take a deeper dive into the regression data posted in my previous discussion. To bring you back to the topic, I was discussing the relationship between the percent of searches for new SEM keywords and site bounce rate. In my example I regressed these two variables and determined there appeared to be a strong linear relationship between the two. Of course there are other ways of analyzing the response to new keywords in your paid search campaign but with this approach you can derive a statistically sound relationship at a high campaign level which can point you in the direction to look deeper. That way you do not have to look at every single keyword and minimize any unnecessary work and more importantly you can possibly even save time in determining your course of action.

Initially, I calculated the variables of a regression line. Those include the slope and y-intercept. From a calculation standpoint it easier to calculate the slope first then derive the y-intercept as a calculation based on the slope. But technically you don’t have to do this by hand using long algebra as I have included in my excel sample. Instead excel has built-in formulas that make your life so much easier. SLOPE(known Y’s, known X’s) and INTERCEPT(known Y’s, known X’s).

In addition to the linear regression variables, excel provides a quick formula for deriving the Pearson r correlation coefficient. CORREL(known Y’s, known X’s) . From this calculation, it very easy to calculate the coefficient of determination which is what will tell you how strong your relation is as expressed in the correlation coefficient. Most people can eyeball the strength but if you want to get down to an actual measurement of strength I would advise you to calculate the coefficient of determination; especially since it’s relatively easy to compute. All you have to do is square the CORREL() value and your done. Easy as that. So not only do you know the exact relationship being expressed between two variables but you also know how strong that relationship really is.

OK, back to my example. So in my sample I calculated the correlation coefficient and I found that the r = 0.9297. By inspection the rule of thumb of correlation strength is roughly (this applies to negative and positive r values):

· Between .8 and 1.0 very strong

· Between .6 and .8 moderately strong

· Between .4 and .5 moderate

· Between .2 and .4 moderately weak

· Between 0 and .2 very weak

In my case, I have a pretty strong relationship by initial review, but how strong is it really? Calculating the coefficient of determination will tell us how strong. r2 = 0.8645. Interpreting this value means that 86.4% of the variance in bounce rate can be explained by percent of searches of the new SEM terms. Or you can look at it in the opposite fashion and say that 13.4% of variability in bounce rate is unexplained at this point. Technically speaking, 86% coverage of variability to pretty darn good. This would give me enough reason to dig a little deeper into the new search terms to find the true culprits of increasing the bounce rate on my site.

Until next time… safe analyzing.

No responses yet

Jun 03 2007

Relationships are a thing of the past…

Published by Wendi under statistics, web analytics

Or are they? I would argue that at the heart of statistics is a line. A line that best fits a sample of data. What is this line, well it’s called a Least Squares Regression Line.

A <linear> regression line can be used to describe a collage of data points that have a linear relationship … no more, no less. There is power in knowing whether your data has a linear relationship or not and if it does how good a fit this line really is to the data.

Say you wanted to understand why you were seeing various spikes in bounce rates on your site and you had a hunch that the paid keywords that you recently added were the culprit. To prove, or better yet hopefully disprove that this hypothesis is to true, run a regression on a few numbers to see if there is a relationship between the two.

If you were awake in algebra II back in high school, you may remember the equation of a line, but just in case you didn’t here it is…

y = mx + b where m = slope and b = y-intercept

In most statistics books you probably won’t find reference to a line with the same letters but what they substitue still retians the same meaning. Since statisticians like to separate themselves from mathematicians we like to have our own way of writting an equation. So from here on out I’ll refer to the slope as b1 and the y-intercept as b0.

So, back to my problem: I think that my new keywords I added to my SEM campaign are driving my site bounce rate up but I am really not for sure. To check this out, I took % of Searches (delivery of impressions) and Bounce Rates for a 10 week period since the new keywords were added and regressed site bounce rates against % of searches. What I found out was what I feared, there was a direct relationship between the two. So in essence, these new keywords had a negative impact on my site traffic. Yikes! I am going to take those down right now!

Ok, back to the regression details, to run your regression there are a few simple calculations that you need to prepare that can be inserted into the bigger formula. In my excel file Least Squares Regression you can walk through each step with formulas as I walk through them here:

Regression Variables

In the excel file all the variables and calculations are laid out piece by piece and initially it helps to calculate each X*Y, X*X (aka X2), etc… then sum or multiply through where needed. Excel even provides shortcut formulas but it helps to understand what they are doing first before you use them. I have also included a few excel short cut formulas for calculating the slope, y-intercept, and the correlation coefficient.

Take some time to dig through the excel file and next time I am going to go further into what all the pieces mean.

Until next time… safe analyzing.

3 responses so far

Jun 03 2007

Books, To be continued…

Published by Wendi under statistics

I have had several inquiries about good books on entry statistics and to be honest I don’t own really any basic statistics books. A majority of the books I have collected over the years have been those chosen by my professors and read like, well, math books (sorry to all my wonderful professors - but really, if you would have chosen anything less, I guess I would have had no reason to come listen to all of your insightful lectures). So from the long list of those that I have studied from, I am not sure if I would really recommend any of them as a starter book.

I do know of a great on-line resource book published by StatSoft that I have referenced in the past. The great thing is the on-line book is free and it walks you through basic concepts of statistical terminology as well as basic statistics and goes through more advanced methodologies as well. The down side is that the examples included in the material are focused on the StatSoft software but the theories and practices a great starter. Especially if you are not sure if you are ready to invest into a hardback resource.

In an effort to stock my library with more “user friendly” statistics books I recently purchased Statistics for People Who (Think They) Hate Statistics with SPSS Student Version 13.0 2nd Edition. Once it arrives and I have had some time to thumb through it a bit I will let you know how this books reads. The one reason why I chose this book was it’s title… very catchy. Ok, not really but it is catchy. I chose it because it comes with supporting software that is a great tool for analyzing web data from a behavioral standpoint. SPSS stands for Statistical package for the Social Sciences and I have always believed that analyzing web data is the same as analyzing a data set from a social sciences experiment. I look forward to reviewing the book and will let you know my thoughts very soon. Note: If the book is good, there is a 3rd edition soon to be released so if you are interested in this book you may want to wait for the 3rd edition.

For those of you who are eager to purchase now…. As a rule of thumb, I find that statistics books pertaining to “business” statistics seem to be fairly basic in nature.

Happy reading.

Until next time… safe analyzing.

4 responses so far