Jul 02 2007
Stop Collecting So Much Data…
… and stop misusing data mining - is Peter Fader’s message to CIO’s. CIO Insight interview with Peter highlights the strengths and weaknesses of applied data mining in the business world and I have to agree with some of his thoughts; especially on the topic of utilizing probabilities to measure the propensity of behavior.
Measuring the probability of users actions can be strong and powerful if used properly. And can be easily done in Excel.
The trap I see so many people fall into is trying to analyzing too many variables at once and not taking the time to even look at what they are throwing into the model. If you really want to you could probably find relationships between how fast the sun rises and the stock market closing rates but does that really make any logical sense? Then why would you try to build relationships between buying behaviors and the fact that they own an Apple iPhone if you are selling shoes? So many marketers want to know every little detail about their customer – demographics, psychographics, what kind of car they drive, etc…
When you throw too much data at a problem you will have a hard time with independency and you need to take careful consideration the structure of your data otherwise your predictions can lead to false outcomes.
Some rules of thumb from my perspective:
- Enhance your data with the VOC – take surveys online or telephonically (mailed surveys are costly and too time consuming). This is a great way to get anecdotal data you don’t see in click stream data.
- Familiarize yourself with all the variables and truly understand what they mean – not what you think they mean.
- Don’t use variables that you can’t reproduce easily. If it’s too hard to calculate, find, or collect from the database then you probably shouldn’t use it. It’s impractical.
- Only include variables that make sense and add questionable variables later and determine if they degrade or enhance the predictability. In the end you may not even find a reason to test out the use of those questionable variables. *Make sure to not include variables that are variations of each other. If you include % of visits this month then don’t include the frequency of visits this month too. This can cause problems with multi-collinearity.
- Save enough data for testing! Minimum split is 90/10 but recommend at least 80/20 split. That is at a minimum use 90% of your data is used for development and the remaining 10% is for validation of the model. You need to know how predictive your model is before you take it to the market.
- If you want to get fancy, look at a repeated measures DOE structure for analyzing transactional data.
Until next time… safe analyzing.