Jul 02 2007
Stop Collecting So Much Data…
… and stop misusing data mining - is Peter Fader’s message to CIO’s. CIO Insight interview with Peter highlights the strengths and weaknesses of applied data mining in the business world and I have to agree with some of his thoughts; especially on the topic of utilizing probabilities to measure the propensity of behavior.
Measuring the probability of users actions can be strong and powerful if used properly. And can be easily done in Excel.
The trap I see so many people fall into is trying to analyzing too many variables at once and not taking the time to even look at what they are throwing into the model. If you really want to you could probably find relationships between how fast the sun rises and the stock market closing rates but does that really make any logical sense? Then why would you try to build relationships between buying behaviors and the fact that they own an Apple iPhone if you are selling shoes? So many marketers want to know every little detail about their customer – demographics, psychographics, what kind of car they drive, etc…
When you throw too much data at a problem you will have a hard time with independency and you need to take careful consideration the structure of your data otherwise your predictions can lead to false outcomes.
Some rules of thumb from my perspective:
- Enhance your data with the VOC – take surveys online or telephonically (mailed surveys are costly and too time consuming). This is a great way to get anecdotal data you don’t see in click stream data.
- Familiarize yourself with all the variables and truly understand what they mean – not what you think they mean.
- Don’t use variables that you can’t reproduce easily. If it’s too hard to calculate, find, or collect from the database then you probably shouldn’t use it. It’s impractical.
- Only include variables that make sense and add questionable variables later and determine if they degrade or enhance the predictability. In the end you may not even find a reason to test out the use of those questionable variables. *Make sure to not include variables that are variations of each other. If you include % of visits this month then don’t include the frequency of visits this month too. This can cause problems with multi-collinearity.
- Save enough data for testing! Minimum split is 90/10 but recommend at least 80/20 split. That is at a minimum use 90% of your data is used for development and the remaining 10% is for validation of the model. You need to know how predictive your model is before you take it to the market.
- If you want to get fancy, look at a repeated measures DOE structure for analyzing transactional data.
Until next time… safe analyzing.
Hi Wendi,
Nice post. I just read that article in CIO Insight after reading Jim Novo’s blog. I think you illustrate well how false dependencies can be made.
Your point:
1. is right on: I have been an advocate for adding attitudinal analysis to the behavioral one for a long time. But wouldn’t “reasons” for visits add to the extra variables? Isn’t there a risk to add to false dependencies? This being said, I am a big proponent of free will, and believe there is still some left in us, the consumers. This means I find a lot of explanation in the “why” people say they do what they do.
3. Nice advise. Too much friction definitely don’t help the whole process.
4. “variables that make sense”, yes, but I think this is the whole question here: how does one recognize what makes sense, I mean, with the stuff that’s not obvious (i.e. the wheather, etc.)? Isn’t there an element of discovery and learning?
5. Could you explain a little more?
6. ?
Hi Jacques, Thanks for the thoughts.
4. When building models you don’t want to include variables that may not make sense when you try to explain it. In some cases you have to be cognoscent of protected classes in law. You can’t build a credit score with demographics like age, race, etc…. Also, I am thinking of this from a practicality standpoint. Try to make the model as simple as possible. That makes it easier to implement to future events. But you are right in it does take away some of the surprise element.
5. When you use data mining techniques to build models you need to test the predictive accuracy, computational speed, robustness, scalability, and interpretability (point #4 above). Think of this as taking a test twice with the same questions. You’d expect to get a better score the second time around on the same set of questions but if you take a test on the same topic but with two sets of questions you may or may not do better the second time around. So many people don’t hold back enough data to conduct a sound validation of the model. The validation process helps identify the accuracy of the overall model.
6. There is a DOE (design of experiment) that I think applies better to transactional data or data that has an order/sequence which is called Repeated Measures. This technique is not used enough (in my opinion). So many people just use aggregate data which may be enough but the strength in knowing when something will happen with better precision is golden. Peter Fader touches on this in his interview.
Hope this helps!
Wendi