Data Mining Lab: March 2008

Thursday, March 27, 2008

AAAI Social Information Processing Symposium Summary

I apologize for not getting back sooner with results and thoughts from the symposium. Like I said in my previous post, Matt and I attended the AAAI 2008 Social Information Processing Symposium. Matt presented on Social Capital in the Blogosphere and it seemed to be well received by the community. They followed up on our presentation about social capital with a number of questions regarding possible actions and experiments that could be taken within our framework for measuring social capital. It furthered our opinion that the work we have done provides an intuitive way to understand a seemingly abstract topic like social capital. There is still a lot of work to be done in determining what constitutes and explicit and implicit link within the blogosphere, but we are on our way.

Several other thoughts specifically related to our research.

Bonding links (a relationship with someone similar to you) should be easier to make than a bridging links (a relationship with someone different that you). Thus in a social network representation you should probably see a substantial amount more bonding than bridging taking place.
There is a cost associated with forming a bonding or bridging link that we have not addressed up to this point. In general, this cost involves both the type (bonding/bridging) of link and also the individual social capital of the person you are attempting to form a link with.
Nearly everything in the social information processing domain, when graphed, seems to follow a power law. Does individual social capital follow this distribution as well, ie. do certain individuals have much more social capital than the population at large? If so is there anyway to leverage the social capital of all the individuals found in the long tail of the graph? For example, the wisdom of the masses approach is working wonderfully in Wikipedia, where information may in some cases be more accurate than that of the so-called authorities. It's all theory, but just some ideas I've been thinking about.
High cost, high reward. Blogging can take a lot of time, because good writing takes time. It takes substantially more time than other activities we heard about in the conference such as tagging, posting pictures or rating a product. But with the high cost comes the high rewards, as blogging has become mainstream it has become a powerful tool for advancing ideas, products, companies and careers. Ultimately we need to get some type of reward for our involvement in a social network even if it is just personal fulfillment or our activity will dwindle.

Other cool things about the symposium.

Stanford is the most beautiful campus I've ever been on.
We heard about a cool new project called Freebase that seems to have the potential to someday replace Wikipedia as the best source for free, open content information. It looks smooth, provides easy ways to query for information and has an awesome API. It could be the next big thing. Matt posted his thoughts about it as well at his blog.
Gustavo Glusman presented one of the coolest social network graphs I've ever seen (the flickrverse) which can be found here.
Meeting a wide variety of people. There was representation from both academia and industry from a variety of locations. Plenty of people were from California, but there was also representation from other parts of the United States, the UK, Germany, China, Taiwan, Switzerland and probably more that I am forgetting. It was a great group to become involved in.
Getting to learn from those who know more than I do. Everyone had their own expertise in specific social networks and with specific ideas. I learned a lot about social networking principles that are somewhat different than those found here in the blogosphere.

Matt posted some of his thoughts about specific papers on his blog here and here for those who are interested. For any out there who would consider attending next year, go for it, it was a wonderful experience.

Wednesday, March 26, 2008

Hello from Stanford

Matt Smith and I will be at the AAAI 2008 Spring Symposium which is being held at Stanford University from now until Friday. We are attending the Social Information Processing session and will be presenting on a paper we co-authored with Christophe entitled Social Capital in the Blogosphere: A Case Study. The morning and afternoon sessions were great and I'll give a rundown later on. Matt will be presenting in about an hour and a half so I'll post about how that went and everything else we are learning here.

Saturday, March 22, 2008

On Correlation versus Causation

Among the common mistakes made by data miners who lack training in statistics is the confusion between correlation and causation, often arising from the difference between observational studies and random controlled trials. To start with, here are some simple definitions.

There is a relation of (positive) correlation between two random variables when high values of one are likely to be associated with high values of the other.
There is a relation of cause and effect between two random variables when one is a determinant of the other.

It is easy to see from these definitions that:

Causation implies correlation, but correlation does not necessarily imply causation.
Correlation is easy to establish, causation is not.

As a matter of fact:

Random controlled trials establish causation.
Observational studies only bring out correlation.

One of the main reasons correlation, although often appealing, cannot be "safely" acted upon as if it were causation lies in the potential presence of confounding variables. A confounding variable is one that affects the variable of interest but it has either not been considered or not been controlled for (wherein the name "confounding" or "lurking"). Consider the following simple example of a confounding effect. Suppose that a very profitable customer C has placed your company BetterSoft in competition with another company GoodSoft to test your relative abilities to develop good software. The task is to design an algorithm to solve a class P of problems. Your company produces algorithm A, while the other company produces algorithm B. Both you and your competitor are asked to run your own batch of 350 tests and report how often your algorithm gave acceptable solutions (as defined by C). GoodSoft comes out on top with a score of 83% against only 78% for your algorithm. Just as C is about to award its lucrative contract to GoodSoft, you realize that the problems in class P are not all of the same complexity. In fact, it turns out that there are two clear levels of difficulty: simple and complex. You ask the customer to collect more detailed data from GoodSoft and yourself, namely splitting the 350 test problems into simple an complex problems. The results, when complexity is thus factored in, are as follows.

Simple problems: A solves 81 out of 87 and B solves 234 out of 270
Complex problems: A solves 192 out of 263 and B solves 55 out of 80

Should this additional information change C's decision as to which company to hire? Of course. Although GoodSoft does better overall, it actually has worse performance on both the simple problems and the complex problems. Such situation are also known as examples of Simpson's paradox. It is a paradox because although mathematically correct, it is somewhat counterintuitive. The "variable" complexity in this example is a confounding variable, because it interacts with the calculated outcome in a way that may easily be overlooked, but may have an adverse effect on conclusions reached. Another well-known instance of Simpson's paradox arises in some of our US presidential elections where the winner does not carry a popular vote (i.e., the tally of individual votes gives the opponent as winner). The confounding variable in this case is the electoral college.
In the medical domain, for example, it is critical to discover causal rather than only correlational relationships. One cannot take the risk of treating the "wrong" cause of a particular ailment. The same is also true of many other situations outside of medicine. Hence, data miners should do well to understand confounding effects and use that knowledge both in the design of the experiments they run and the conclusions they draw from experiments in general. I have been guilty of "jumping the gun" myself and reporting results that clearly ignored possible confounding effects.

Let me turn now to more mundane business applications. In reaction to a (much shorter) comment I posted about the difference between correlation and causation here, a couple of individuals reacted as follows (I reproduce some of that conversation here for completeness):

(Jaime) - I don't think that insurance companies or any other business that would use data mining would or necessarily should care about the difference between correlation and causation in factors they don't have any control. (exceptions, of course, for anything medical or legal). If they can determine that people with freckles have less car accidents, why shouldn't they offer people with freckles lower rates?
(Will) - Jamie makes a good point. The question of correlation versus causation will be of only philosophical interest to a data mining practitioner, assuming that the underlying behavior being modeled does not change (and this will often be a safe bet). An illustration should make this subtlety clear. Suppose that insurance data indicates that people who play the board game Monopoly are better life insurance risk than people who do not. An insurance company might very well like to take advantage of such knowledge. Is their necessarily a causal arrow between these two items? No, of course not. Monopoly might not "make" someone live longer, and living longer may not "make" someone play Monpoly. Might there exist another characteristic which gives rise to both of these items (such as being a home-body who avoids death by automobile)? Yes, quite possibly. The insurance company does not care, as long as the relationship continues to hold.

This brings up an interesting point of course. Is the matter of causation versus correlation only a philosophical one, with little bearing in practice (a little bit like the No Free Lunch Theorem is a great theoretical result but seems to have little real impact in practical applications of machine learning; but that is another discussion: here for details on this unrelated but interesting topic). Let me try to address this here a little bit (much of this is also found in my response on the above blog). The statistician (I use the term loosely, I am not a statistician myself) seeks the true cause, the one that remains valid through time. On the other hand, the (business) practitioner seeks mainly utility or applicability, which may become invalid over time but serves him/her right for some reasonable amount of time. Under this view of the world, I think it is possible to reconcile the two perspectives. Indeed, one can see that the statements "assuming that the underlying behavior being modeled does not change" and "as long as the relationship continues to hold" may be interpreted (in some way, see below) as effectively equivalent to what statisticians regard as "controlling for variables". By taking this kind of dynamic approach where the relationship (or behavior) is "continuously" monitored for validity and the action is taken only as long as that relationship holds, the user is, in effect, relieved from the problem of lurking variables. Let me illustrate on Will's example. Statisticians would indeed argue that there may be a confounding variable that explains the insurance company's finding, one that has nothing to do with playing monopoly. Will proposed one: "being a home-body". I'll continue the argument with that one. In this case, it may therefore be that there are more home-body monopoly players than not; and it is the "home-bodyness" (if such a word exists) that explains the lower risk for life insurance (and not the monopoly-playing). Now, a statistician would be right in this case, and if one had to come up with the "correct" answer and build a model that remains accurate for now AND the future, you would have to accept the statistician's approach and build your model using home-bodyness rather than monopoly playing. There is little arguing here. I think that what Will and Jaime are be getting at is that there is a way to, in some sense, side-step this issue; namely: monitor the relationship. Indeed, if I keep on looking and checking that the correlation continues to hold, then I don't care about any confounding effect. If there are none, then the correlation also manifests a causation and I am safe; if there are some confounding effects, they will become manifest over time as the observed correlation is weakened. Hence, I can choose at that time to invalidate my model. But in the meantime, it served me right, was accurate, and I did not worry about controlling anything. Going back to the example, as long as the correlation is strong, I am OK. If it turns out that it is home-bodyness that causes the lower risk, I may eventually see more and more non monopoly players with low risk who also turn out to be home-bodies. In this case the originally observed correlation will decrease telling me that I may wish to discontinue the use of my model.

The distinction may be viewed as only of philosophical interest, at least in the context of such business cases. Again, in medicine, one may have a different perspective as also pointed out by Jaime and Will. One of the drawbacks of the "correlation-driven" approach is that when the model is no longer valid (as seen by the decreasing correlation value), the practitioner has no idea what may be the cause and is thus left with no information as to where to go next. Then again, as suggested by Jaime and Will, maybe he/she does not care. From a strictly business standpoint, he/she was able to quickly build a model with high utility (even if only for a shorter period of time) instead of having to expand a lot of resources to build a "causation" model, with the risk of not doing any better as not all confounding can ever be controlled for! (In fact, there are even situations where the controlled experiments that would be necessary cannot be run; see here for a fun example).

After all is said and done, and more has been said than done :-), one should be aware of confounding effects (or Simpson's paradox), and know how to deal with them: 1) stick to strictly random controlled experiments; or 2) use observations but handle with careful and continuous monitoring.

Wednesday, March 5, 2008

Ian Ayres' Super Crunchers Book

I recently came across Ian Ayres' book: Super Crunchers. It is a nice read. Ayres essentially makes the case for number crunching (data mining for many of us) in all aspects of business and social life. The book describes a large number of case studies where number crunching has been successfully applied (e.g., wine quality, teaching methods, medical practices, etc.), often providing answers that challenge traditional wisdom. The examples are rather compelling. Most of the studies rely exclusively on random controlled trials and the use of regression techniques. Yet, I think this is a great book for people starting in data mining or looking for good reasons to begin. (The other nice thing is that the book is very cheap: less than $20 on Amazon!). Enjoy!

Spring Research Conference

The 22nd Annual Spring Research Conference for the College of Physical and Mathematical Sciences is coming up on Saturday, March 15th. The current presentation schedule for the conference can be found here. Six members of the lab will be presenting research at the conference.

We're happy about our representation and it is shaping up to be a great conference.

Data Mining Lab