Archive for May, 2007

May 25 2007

Predicting their Next Move

Published by Wendi under statistics, predictive analysis

Most sites have a target goal or call to action, may it be visiting a landing page or submitting an online form or better yet, purchasing products. What ever the end goal is there are paths that users take to navigate through the site and hopefully ultimately find what they are looking for before leaving. Conducting a pathing experiment to understand how users click through from page to page on your site is a good exercise.

Knowing where they have been is great but what if you could predict where new visitors might go in the future to determine the best placements for promotions, coupons, contact forms, etc…

Say for example, you know that users who visit the CD content area then click through to CD Players most often then visit several other pages like Music CDs, CD disk covers, and DVDs. The order in which they visit these other pages down the road is not much of a factor other than they visited those pages at some point in their session. If this was the only path users could take on this site then were done. Place some promos for discount, blank CDs and were done!

Ok, so realistically you have an insane number of possible paths user can take on your site and you want to know when would be a good time to include a internal promotion (like a discount coupon that links to a product page with savings) given that they “show interest” in those types of products.

So here lies the problem: How to you know what page would have the highest probability to be seen as a next page given that they have seen some group of previous pages that are related to the promo. Statisticians call this Conditional Probability.

Conditional probability is denoted by P(B|A) and is read as the “probability of event B given event A”.

Since we know that certain pages on the site have been viewed we know a little more about the visitor and what types of interests they may have on the site. Thus the probability of what they might see next has been affected.

Definition: P(B|A) = P(A and B) / P(A)

This means that the probability of event B given that event A occurred is the intersection of the probability of both events occurring divided by the probability of just event A occurring.

So what do I do with this do you say? Well, here is an example of the application.

Say you have the frequency of top paths for some pages of interest. Example of fictitious data is below and our goal is to determine the placement for a promo linked to Page C:


Calculate the percentages of all the joint probabilities as well as marginal probabilities as shown in the resulting data cube below:


So by inspection of the data you can determine the probability at which event occurs individually or jointly based on the summary data. But what we really want to know is the conditional probability of event B given event A. Thus we want to know that if the visitor sees a certain selection of pages what is the probability that they will then view Page A, Page B, or Page C. And in the end we will have determined the best placement for our promotion.

In my example, to calculate each of the conditional probabilities you would take the joint probabilities and divide them into the marginal probabilities.

P(PgA | Pg1/Pg2) = P(Pg1/Pg2 and PgA) / P(Pg1/Pg2)

= 0.161 / 0.415

= 0.389

Here is a chart of all the resulting conditional probabilities:


To visualize this data better let’s look at the conditional probabilities in a basic bar chart in excel:

Conditional Probabilities - http://sheet.zoho.com


You can see that path Page 2 / Page 3 has the highest probability that the visitor would then click through to the goal page C.

So now you can go back to your marketing manager and tell them that the promotion should be posted on Page 3 for those visitors who have viewed page 2 previously.

Until next time… safe analyzing.

4 responses so far

May 20 2007

Statistics 101

Published by Wendi under statistics

Hello & Welcome!

There are a lot of web analytics tool packages available in the market and some even come with *advanced* metrics built right into the interface. You can refer to these metrics as statistical measurements that explain user behavior that can even predict what they will do next. But what do they all mean and how can I get the best insights from what I have access too?

In this blog I want to take the statistician’s approach to web analytics and discuss how anyone can leverage data currently available to dig deeper and pull out amazing things. I want to look beyond the basics of web analytics and develop a way of thinking that is beyond the canned reports included in the packages.

Stat 101

In this first post I’d like to cover a few very basic metrics that every analyst should embrace that will be used to build on in the future. In most introductory courses of statistics one of the first things you learn are the types of methodologies, one being Descriptive Statistics. In descriptive statistics you look at the data set to describe what it looks like and try to describe the basic trends in the data. There are several steps to this methodology that include collecting the data, summarizing the data, understanding the underlying distribution, and visually displaying the data in a graph. I won’t go into detail for all these steps in this post but rather I’ll focus on the summary statistics that are used in this process. To summarize the data you look to understand the location, dispersion and shape of the distribution.

Six basic statistical measures used in summarizing the location or central tendency of the data set are:

· Minimum = Smallest value in the data set; X(1) where {X(1), X(2), …, X(n)} is the ordered data set

· 1st Quartile (25th Percentile) = X.25; data value at the boundary of 25% of the data

· Mean = (average) Σ(X1,X2, ….. ,Xn) / N

· Median (50th Percentile) = if N is even then Xn/2; if n is odd (Xn-1/2 + Xn/2)/2; data halfway through the ordered data

· 3rd Quartile (75th Percentile) = X.75; data value at the boundary of 75% of the data

· Maximum = Largest value in the data set; X(n) where {X(1), X(2), …, X(n)} is the ordered data set

Application

Now that you know a few new metrics too look at like 1st and 3rd quartiles; why would you look at them and what are they telling you? For example, let’s take “Time on Page”. Say you have a sample data set that looks like the following:

Day Time on Page
1 2.45
2 1.75
3 2.66
4 4.98
5 1.69
6 1.89
7 2.33
8 2.48
9 2.15
10 2.01

Summary Statistics:

Minimum 1.69
Q1 2.21
Mean 2.44
Median 2.24
Q3 2.41
Maximum 4.98

As you inspect the data it maybe easy enough to see a large spike of time on site in Day 4; but think if this data set was real and was 100 X’s the volume. Then a quick glance of the data isn’t so easy any more. This is where summary statistics comes in handy.

Looking at the summary statistics, the first thing that should come to mind is ‘why is the maximum so high compared to the other values? This should lead to deeper inspection of the site metric and review of any major changes to determine the large spike in Day 4. Maybe there was a special campaign, press release, or release of new features that caused this large value. In any case, knowing why there was a large jump in a particular metric can build a path for better insights. This can even bring up the notion of ‘outliers’ but I’ll leave that for another discussion.

Till’ next time…. I wish you safe analyzing.

7 responses so far