Archive for the 'statistics' Category

Jul 09 2007

The Butterfly Effect.. or is it just coincidence?

The NY Times had an interesting article today about the eerie postings of death predictions in Wikipedia like the most recent one regarding the death of Nancy Benoit. The article moves into discussing the fine line between real-time late breaking news and predicting future events. I have to admit that I find this article disturbing but yet on some level intriguing. The article gets better once you can get past all the weird death notable mentions. One thing it reminded me of was the notion of Bill Tancer’s ‘searchonomics’ theory.

Bill Tancer, GM of Hitwise, initially proposed thoughts back in 2005 on ‘searchonomics’ and predicting consumer interest or rather public fear of possible a epidemic outbreak based on search history on the technical term “H5N1” and it’s more consumer friendly version “bird flu”. He has also dabbled with more fun data and predicting winners for American Idol and the UK version of Dancing with the Stars – both of which he was right on the money with predicting the winners.

I have found ‘searchonomics’ rather an interesting phenomenon that I thought I’d start my own predictions to see if there is any predictive power on the 2008 presidential candidates.

Unfortunately I don’t work for Hitwise, nor do I own a membership either; so I am limited to free versions of similar data – which limits my visibility a little. Using Google Trends you can see the early few months of the year on some of the top Democratic candidates:

Google Trends Democratic Presidential Candidates

 

Based on the traffic so far, it looks like it’s going to be a close race at this point. I’ll wait to make my predictions on the democratic side as soon as Google Trends decides to update their data a bit more (or if someone is willing to pull data in some fancier tool with more up to date data and send it my way, I might be able to make my prediction sooner).

Is this ‘searchenomics’ phenomenon the result of a “Butterfly Effect” or is it just a set of data points that are merely related by coincidence?

Until next time… safe analyzing.

No responses yet

Jul 05 2007

Determining your Sample Size

Published by Wendi under statistics, A/B testing

Robbin Steif asked me today how long she needed to let her test run before she could call it a day and assume that there is really no difference between the treatments (since she isn’t seeing one right now). She sent over the following screen shot of her outcomes in Google’s Website Optimizer from the last two weeks:

GA Website Optimizer A/B Test

As you can see, right now she isn’t seeing any lift in her conversion rate. Actually she is seeing similar values and a small drop. But is the drop significant? Does she have enough data to support an outcome at this point?


Before you run a test of significance you first need to know if you have enough data to support the test in the first place. For population proportions the formula for sample size “n” is:

 

n = z2(pq/δ2)

where

p = % of Success (conversions in this example)

q = % of Failures (i.e. 1 – p)

*note: use the Conversation Rate from your control landing page

To finish out this equation you need to make a few assumptions.

Assumptions

1. The Confidence Level - α (alpha): the level of certainty that you are willing to accept

2. Error - δ (delta): the margin of error that you are willing to accept

With these assumptions set, lastly you need to calculate the Z value based on your Confidence Level. It’s easy to do in excel with the NORMSINV() formula. Since we are determining the existence of a “difference” among the conversion rates versus if the conversion rate is specifically higher or lower than the control we need to divide alpha in half for a two-sided test structure.

=ABS(NORMSINV(α/2))

In this example our Z = 1.96. Now we have all the pieces in our formula to calculate the needed sample size.

n = z2(pq/δ2)

= (1.96)2 * [(.0472*.9528)/(.01)2]

 

= 1728

Thus Robbin is going to need 1,728 page views before she can make the determination that the treatments she is testing did or did not make a difference in her conversion rate. You can download this excel file I put together (nothing fancy) where you can toggle the alpha and delta values so that you can see how each one impacts the needed sample size.

I also included a reference to the maximum sample size one would need if you don’t have a control to set your “p” and “q” values. It’s rather astonishing but it you are a conservative then you can always fall back on this calculation and know that if you get approximately 10,000 samples you are good to go.

Part II of this question is – Is there a difference? This is different than asking how many samples you need to determine if there is a difference. Applying a hypothesis test is needed to actually determine if the difference in the conversion rates are statistically significant. You can read more about how to do this on my previous post about A/B testing. You can find a downloadable excel file in this post that you can toggle various sample sizes and determine if the conversion rates are different – statistically speaking.

 

Until next time… safe analyzing.

 

*UPDATE* 7/16/07:  Make sure that you use the first page view per visitor - “Unique Page Views” in Google analytics Terms when making this calculation.  The sample size calculation assumes each event is independent  of each other.

Thank You to Mike & Chris!

15 responses so far

Jun 17 2007

Comparing Population Proportions – A/B Testing

Published by Wendi under statistics, web analytics

Comparing Population Proportions – A/B Testing

Many metrics in web analytics are conveyed as percentages, or population proportions as statisticians like to call it. As I mentioned in my previous post on the Statistics for People Who (Think They) Hate Statistics, percentages are useful in the real world (business data) and I was surprised there was not a section dedicated to this topic. So I thought I would cover a post on comparing population proportions; namely conversion rate for landing pages.

Landing page optimization is one aspect of testing in web analytics. It’s great. You can test almost anything – layout, content, color, tag line, call to action, media, etc… In my scenario we were testing the tag line. Since we are only testing one aspect of the page you can refer to this testing methodology as A/B testing. This is very much different than multivariate testing where numerous “variables” or parts of the page are tested at once and I’ll leave that for another conversation. So for now, we tested one variable – the tag line. The call to action that was defined for a measure of success was a submission of an online lead generation form. Since the form was small in nature, the form was a portion of the landing page and there was no interim steps/pages that may have increased conversion failure.

In testing, you must first define your hypothesis. The hypothesis in this case is that landing page #1 out performed landing page #2. In metrics terms, we are saying that the conversion rate for landing page #1 was better than landing page #2 (with a statistical significance).

Null Hypothesis H0: p1 = p2 (or can be written as p1 - p2 = 0)

“conversion was not different”

Alternative Hypothesis Ha: p1 p2 (or can be written as (p1 - p2 0)

“conversion is different”

Alpha = .05

The delivery of the pages were equally distributed among both pages but there were slight differences and that difference will be included in our calculation.

Landing page #1

Delivered 6,906 times

Conversion yield = 1.71%

Landing page #2

Delivered 6,534 times

Conversion yield = 1.44%

Some might just stop here and say, landing page #1 out performed landing page #2 and move on. But is that really a valid inference? Let’s see.

To test two population proportions you use the following equation:

two population proportions test

The p with the ^ on top is referred to as “p-hat”. “P-hat” is the sample population proportion (the %’s from your data) and is used to estimate the true population proportion.

All the calculations from the above formula can be easily done in excel and can be seen in a sample file here.

The calculated z-score is 1.2563. In excel you can calculate the p-value by utilizing the NORMSDIST() formula. You can determine the critical region or sometimes referred to the “rejection region” for the null hypothesis just by the z-score but from an interpretation standpoint, it’s easier to compare the p-value to the previously defined alpha. Calculating the p-value will help understand whether the difference between the two percentages are statistically different. The p-value is just 1-NORMSDIST(z-score).

= 2*[1 - NORMSDIST(1.2563)]

= 2*[0.105]

= .209

Now, our alpha value was set at .05 per our testing criteria listed previously. Since our p-value > alpha; 0.209 > .05, we Fail to Reject the Null Hypothesis. What does that mean? It means that the difference between the two conversion rates are not statistically significant. Thus technically, even though the conversion rate for landing page #1 was higher than landing page #2 there really wasn’t much of a difference to warrant one having a “better” tag line.

Until next time… safe analyzing.

4 responses so far

Jun 17 2007

Statistics Book Review

Published by Wendi under statistics

I have had a few weeks to read through the book I purchased and wanted to share my thoughts on it’s level of readability, coverage, and ease.  At a first glance, its grabs the attention of the readers and puts you at ease within context.  Statistics isn’t scary and the authors try to remind the readers of that through out the book.  They throw in warm, fuzzy happy faces for a difficulty rating (cute for high school, maybe not so much for the adult readers) for each chapter/topic.

The tips intertwined in the chapters are nice and sometimes inform the reader of advanced topics beyond the book.  But I might find some tips confusing had I not had prior knowledge of the topic at hand (easy enough though, just ignore what you don’t grasp - those tips are not applicable to the content of the book anyway).  All in all, the book is basic in nature but it does go beyond what I was expecting.  For one example the book covers Factorial ANOVAs (analysis of variance).  They don’t go into deep detail of a factorial ANOVA but I was surprised there was a dedicated section to this advanced method.

One thing I found missing was testing with population proportions.  From an applied statistics perspective, I would find a chapter on population proportions to be very helpful in the business world.  I wouldn’t say this missing chapter would be the show stopper for not recommending it but I might hold off and see if they are adding this in round 3 due out  this year.

Also, the ordering of the chapters seems a little odd to me in that they jump into correlation coefficients in chapter 5 then skip around and pause discussing linear regression until chapter 14 (they reintroduce correlation coefficients in chapter 13).  This maybe something that will be changed in the third edition (per some comments I see on Amazon.com).

Overall, the content is easy to read and comprehend, but there is certainly some room for improvement (as most books are always a work in progress).  If you are looking to understand how to do everything in excel for work, I might suggest getting the Excel Edition but keep in mind that excel doesn’t hold the tools for advanced statistical analysis.

Until next time… safe analyzing.

No responses yet

Jun 11 2007

Relationships Take 2

Published by Wendi under statistics

As promised, I wanted to take a deeper dive into the regression data posted in my previous discussion. To bring you back to the topic, I was discussing the relationship between the percent of searches for new SEM keywords and site bounce rate. In my example I regressed these two variables and determined there appeared to be a strong linear relationship between the two. Of course there are other ways of analyzing the response to new keywords in your paid search campaign but with this approach you can derive a statistically sound relationship at a high campaign level which can point you in the direction to look deeper. That way you do not have to look at every single keyword and minimize any unnecessary work and more importantly you can possibly even save time in determining your course of action.

Initially, I calculated the variables of a regression line. Those include the slope and y-intercept. From a calculation standpoint it easier to calculate the slope first then derive the y-intercept as a calculation based on the slope. But technically you don’t have to do this by hand using long algebra as I have included in my excel sample. Instead excel has built-in formulas that make your life so much easier. SLOPE(known Y’s, known X’s) and INTERCEPT(known Y’s, known X’s).

In addition to the linear regression variables, excel provides a quick formula for deriving the Pearson r correlation coefficient. CORREL(known Y’s, known X’s) . From this calculation, it very easy to calculate the coefficient of determination which is what will tell you how strong your relation is as expressed in the correlation coefficient. Most people can eyeball the strength but if you want to get down to an actual measurement of strength I would advise you to calculate the coefficient of determination; especially since it’s relatively easy to compute. All you have to do is square the CORREL() value and your done. Easy as that. So not only do you know the exact relationship being expressed between two variables but you also know how strong that relationship really is.

OK, back to my example. So in my sample I calculated the correlation coefficient and I found that the r = 0.9297. By inspection the rule of thumb of correlation strength is roughly (this applies to negative and positive r values):

· Between .8 and 1.0 very strong

· Between .6 and .8 moderately strong

· Between .4 and .5 moderate

· Between .2 and .4 moderately weak

· Between 0 and .2 very weak

In my case, I have a pretty strong relationship by initial review, but how strong is it really? Calculating the coefficient of determination will tell us how strong. r2 = 0.8645. Interpreting this value means that 86.4% of the variance in bounce rate can be explained by percent of searches of the new SEM terms. Or you can look at it in the opposite fashion and say that 13.4% of variability in bounce rate is unexplained at this point. Technically speaking, 86% coverage of variability to pretty darn good. This would give me enough reason to dig a little deeper into the new search terms to find the true culprits of increasing the bounce rate on my site.

Until next time… safe analyzing.

No responses yet

Jun 03 2007

Relationships are a thing of the past…

Published by Wendi under statistics, web analytics

Or are they? I would argue that at the heart of statistics is a line. A line that best fits a sample of data. What is this line, well it’s called a Least Squares Regression Line.

A <linear> regression line can be used to describe a collage of data points that have a linear relationship … no more, no less. There is power in knowing whether your data has a linear relationship or not and if it does how good a fit this line really is to the data.

Say you wanted to understand why you were seeing various spikes in bounce rates on your site and you had a hunch that the paid keywords that you recently added were the culprit. To prove, or better yet hopefully disprove that this hypothesis is to true, run a regression on a few numbers to see if there is a relationship between the two.

If you were awake in algebra II back in high school, you may remember the equation of a line, but just in case you didn’t here it is…

y = mx + b where m = slope and b = y-intercept

In most statistics books you probably won’t find reference to a line with the same letters but what they substitue still retians the same meaning. Since statisticians like to separate themselves from mathematicians we like to have our own way of writting an equation. So from here on out I’ll refer to the slope as b1 and the y-intercept as b0.

So, back to my problem: I think that my new keywords I added to my SEM campaign are driving my site bounce rate up but I am really not for sure. To check this out, I took % of Searches (delivery of impressions) and Bounce Rates for a 10 week period since the new keywords were added and regressed site bounce rates against % of searches. What I found out was what I feared, there was a direct relationship between the two. So in essence, these new keywords had a negative impact on my site traffic. Yikes! I am going to take those down right now!

Ok, back to the regression details, to run your regression there are a few simple calculations that you need to prepare that can be inserted into the bigger formula. In my excel file Least Squares Regression you can walk through each step with formulas as I walk through them here:

Regression Variables

In the excel file all the variables and calculations are laid out piece by piece and initially it helps to calculate each X*Y, X*X (aka X2), etc… then sum or multiply through where needed. Excel even provides shortcut formulas but it helps to understand what they are doing first before you use them. I have also included a few excel short cut formulas for calculating the slope, y-intercept, and the correlation coefficient.

Take some time to dig through the excel file and next time I am going to go further into what all the pieces mean.

Until next time… safe analyzing.

3 responses so far

Jun 03 2007

Books, To be continued…

Published by Wendi under statistics

I have had several inquiries about good books on entry statistics and to be honest I don’t own really any basic statistics books. A majority of the books I have collected over the years have been those chosen by my professors and read like, well, math books (sorry to all my wonderful professors - but really, if you would have chosen anything less, I guess I would have had no reason to come listen to all of your insightful lectures). So from the long list of those that I have studied from, I am not sure if I would really recommend any of them as a starter book.

I do know of a great on-line resource book published by StatSoft that I have referenced in the past. The great thing is the on-line book is free and it walks you through basic concepts of statistical terminology as well as basic statistics and goes through more advanced methodologies as well. The down side is that the examples included in the material are focused on the StatSoft software but the theories and practices a great starter. Especially if you are not sure if you are ready to invest into a hardback resource.

In an effort to stock my library with more “user friendly” statistics books I recently purchased Statistics for People Who (Think They) Hate Statistics with SPSS Student Version 13.0 2nd Edition. Once it arrives and I have had some time to thumb through it a bit I will let you know how this books reads. The one reason why I chose this book was it’s title… very catchy. Ok, not really but it is catchy. I chose it because it comes with supporting software that is a great tool for analyzing web data from a behavioral standpoint. SPSS stands for Statistical package for the Social Sciences and I have always believed that analyzing web data is the same as analyzing a data set from a social sciences experiment. I look forward to reviewing the book and will let you know my thoughts very soon. Note: If the book is good, there is a 3rd edition soon to be released so if you are interested in this book you may want to wait for the 3rd edition.

For those of you who are eager to purchase now…. As a rule of thumb, I find that statistics books pertaining to “business” statistics seem to be fairly basic in nature.

Happy reading.

Until next time… safe analyzing.

4 responses so far

May 25 2007

Predicting their Next Move

Published by Wendi under statistics, predictive analysis

Most sites have a target goal or call to action, may it be visiting a landing page or submitting an online form or better yet, purchasing products. What ever the end goal is there are paths that users take to navigate through the site and hopefully ultimately find what they are looking for before leaving. Conducting a pathing experiment to understand how users click through from page to page on your site is a good exercise.

Knowing where they have been is great but what if you could predict where new visitors might go in the future to determine the best placements for promotions, coupons, contact forms, etc…

Say for example, you know that users who visit the CD content area then click through to CD Players most often then visit several other pages like Music CDs, CD disk covers, and DVDs. The order in which they visit these other pages down the road is not much of a factor other than they visited those pages at some point in their session. If this was the only path users could take on this site then were done. Place some promos for discount, blank CDs and were done!

Ok, so realistically you have an insane number of possible paths user can take on your site and you want to know when would be a good time to include a internal promotion (like a discount coupon that links to a product page with savings) given that they “show interest” in those types of products.

So here lies the problem: How to you know what page would have the highest probability to be seen as a next page given that they have seen some group of previous pages that are related to the promo. Statisticians call this Conditional Probability.

Conditional probability is denoted by P(B|A) and is read as the “probability of event B given event A”.

Since we know that certain pages on the site have been viewed we know a little more about the visitor and what types of interests they may have on the site. Thus the probability of what they might see next has been affected.

Definition: P(B|A) = P(A and B) / P(A)

This means that the probability of event B given that event A occurred is the intersection of the probability of both events occurring divided by the probability of just event A occurring.

So what do I do with this do you say? Well, here is an example of the application.

Say you have the frequency of top paths for some pages of interest. Example of fictitious data is below and our goal is to determine the placement for a promo linked to Page C:


Calculate the percentages of all the joint probabilities as well as marginal probabilities as shown in the resulting data cube below:


So by inspection of the data you can determine the probability at which event occurs individually or jointly based on the summary data. But what we really want to know is the conditional probability of event B given event A. Thus we want to know that if the visitor sees a certain selection of pages what is the probability that they will then view Page A, Page B, or Page C. And in the end we will have determined the best placement for our promotion.

In my example, to calculate each of the conditional probabilities you would take the joint probabilities and divide them into the marginal probabilities.

P(PgA | Pg1/Pg2) = P(Pg1/Pg2 and PgA) / P(Pg1/Pg2)

= 0.161 / 0.415

= 0.389

Here is a chart of all the resulting conditional probabilities:


To visualize this data better let’s look at the conditional probabilities in a basic bar chart in excel:

Conditional Probabilities - http://sheet.zoho.com


You can see that path Page 2 / Page 3 has the highest probability that the visitor would then click through to the goal page C.

So now you can go back to your marketing manager and tell them that the promotion should be posted on Page 3 for those visitors who have viewed page 2 previously.

Until next time… safe analyzing.

4 responses so far

May 20 2007

Statistics 101

Published by Wendi under statistics

Hello & Welcome!

There are a lot of web analytics tool packages available in the market and some even come with *advanced* metrics built right into the interface. You can refer to these metrics as statistical measurements that explain user behavior that can even predict what they will do next. But what do they all mean and how can I get the best insights from what I have access too?

In this blog I want to take the statistician’s approach to web analytics and discuss how anyone can leverage data currently available to dig deeper and pull out amazing things. I want to look beyond the basics of web analytics and develop a way of thinking that is beyond the canned reports included in the packages.

Stat 101

In this first post I’d like to cover a few very basic metrics that every analyst should embrace that will be used to build on in the future. In most introductory courses of statistics one of the first things you learn are the types of methodologies, one being Descriptive Statistics. In descriptive statistics you look at the data set to describe what it looks like and try to describe the basic trends in the data. There are several steps to this methodology that include collecting the data, summarizing the data, understanding the underlying distribution, and visually displaying the data in a graph. I won’t go into detail for all these steps in this post but rather I’ll focus on the summary statistics that are used in this process. To summarize the data you look to understand the location, dispersion and shape of the distribution.

Six basic statistical measures used in summarizing the location or central tendency of the data set are:

· Minimum = Smallest value in the data set; X(1) where {X(1), X(2), …, X(n)} is the ordered data set

· 1st Quartile (25th Percentile) = X.25; data value at the boundary of 25% of the data

· Mean = (average) Σ(X1,X2, ….. ,Xn) / N

· Median (50th Percentile) = if N is even then Xn/2; if n is odd (Xn-1/2 + Xn/2)/2; data halfway through the ordered data

· 3rd Quartile (75th Percentile) = X.75; data value at the boundary of 75% of the data

· Maximum = Largest value in the data set; X(n) where {X(1), X(2), …, X(n)} is the ordered data set

Application

Now that you know a few new metrics too look at like 1st and 3rd quartiles; why would you look at them and what are they telling you? For example, let’s take “Time on Page”. Say you have a sample data set that looks like the following:

Day Time on Page
1 2.45
2 1.75
3 2.66
4 4.98
5 1.69
6 1.89
7 2.33
8 2.48
9 2.15
10 2.01

Summary Statistics:

Minimum 1.69
Q1 2.21
Mean 2.44
Median 2.24
Q3 2.41
Maximum 4.98

As you inspect the data it maybe easy enough to see a large spike of time on site in Day 4; but think if this data set was real and was 100 X’s the volume. Then a quick glance of the data isn’t so easy any more. This is where summary statistics comes in handy.

Looking at the summary statistics, the first thing that should come to mind is ‘why is the maximum so high compared to the other values? This should lead to deeper inspection of the site metric and review of any major changes to determine the large spike in Day 4. Maybe there was a special campaign, press release, or release of new features that caused this large value. In any case, knowing why there was a large jump in a particular metric can build a path for better insights. This can even bring up the notion of ‘outliers’ but I’ll leave that for another discussion.

Till’ next time…. I wish you safe analyzing.

7 responses so far