Data Mining Lab: February 2008

Wednesday, February 27, 2008

Data Mining Lab -- Experience is Key

My name is Nathan Davis. I've been a member of the Data Mining Lab at BYU for 3 years now, and have had a wonderful experience. In fact, I've had a lot of great experiences, many of which have prepared me for future work and research.

With respect to work, I've had a chance to conduct real world data mining for large industry partners. In addition to learning, through experience, about the technical aspects of the data mining process, the lab has also given me an opportunity to learn about business aspects, by meeting face-to-face with industry partner representatives. Most recently we were able to meet with a Vice President of a large retail company to discuss several issues relevant to the research we conduct!

Further, the lab has provided me with great research experience. Dr. Giraud-Carrier is a tremendous academic, with a great deal of interest in his students and research assistants. Under his tutelage I've published academic papers and will soon be completing a Masters degree. I even had the opportunity to travel to the Netherlands to present at an academic conference.

Currently I'm conducting a software engineering internship with Google, and my experience in the lab is helping me to be successful. For anyone interested gaining experience that will help them succeed academically and professionally, I'd highly recommend dropping by the lab and finding out about the great experiences that await you.

Tuesday, February 26, 2008

10 Reasons Why Data Mining is Fun and Rewarding

1. You can train your computer to do things you can't.
2. The methods are complicated, but the applications are intuitive.
3. It can save/make lots of money.
4. Data mining has applications in nearly any area you can think of.
5. You get to deal with data sets larger than you could ever process in your mind.
6. There are big developments taking place in the industry.
7. Data mining algorithms attempt to model how things work in biology and the real world. (ie. Neural networks/genetic algorithms)
8. There is no one size fits all solution when it comes to data mining.
9. You help make the statement "I have more data than I know what to do with" obsolete.
10. Your results can make an immediate impact in whatever industry you are involved in.

Why do you like data mining today? What got you interested in the first place?

Friday, February 22, 2008

Data Mining in the Workplace

I graduate in a few months and so I've been job hunting lately. I attended the Technical Career Fair here at BYU a few weeks back and I was impressed by the number of companies that were interested in data mining. With the exception of one or two companies, they all either were currently involved in data mining or were interested in becoming involved in the near future. I think that as more and more companies amass mounds of data, they are realizing that collecting data for data's sake is useless and that they can get much more out of their data than they have in the past. Data mining is no longer 'a hiss and a byword'. I am witnessing firsthand that it is the direction that many companies are taking to improve the efficiency of their operations.

Wednesday, February 20, 2008

Our Lab in Utah CEO Magazine

The BYU Data Mining Lab is featured in an article published in this months Utah CEO Magazine. The article, found here, includes expert opinions from our own Professor Christophe Giraud-Carrier on why finding a champion for data mining within a company is important and how successful data mining is defined. In addition, the article contains a short feature on the lab which explains the benefits students and businesses gain from being involved with the lab. It is exciting to see outside recognition for the great work that goes on here everyday.

Wednesday, February 13, 2008

Social Connections in Decline

Robert Putnam, an influential social capital researcher, visited BYU nearly two years ago to discuss how social connections are on the decline. Here is good summary of Putnam's talk on BYU NewsNet. His research during the past decade has shown a negative trend in that people are socially connecting less these days. The speech gave fuel to the research on social networks that we had been involved in and has been a strong motivation to our current work on social capital.

Figure 1. "The TV Connection" shows that group membership tends to decline as television viewing increases among those having twelve or more years of education. (see The Strange Disappearance of Civic America)

Empirical studies on group membership, like the study shown in the plot above contribute to the evidence which Putnam uses to support this claim.

(Note: This article was originally posted on dmine.blogspot.com)

Data Mining Search Engine

I recently learned at the Data Mining Research blog about a data mining search engine. The search engine, which can be found here, allows search queries to be performed so that the results come largely from a list of data mining sites. It might prove to be a useful tool for focusing your research on trusted data mining sites, or for discovering new resources in our field of interest. Give it a shot, I don't have much experience with custom Google Search engines, but it seems useful.

Monday, February 11, 2008

Resolving Blog Entities

Problem: How do you determine whether a particular url is associated with a feed? For example, if another blog posted a link to datamining.blogspot.com, how would you determine the feed (http://datamininglab.blogspot.com/feeds/posts/default) associated with that url?

Solution: In our research we perform two operations to determine whether a url has an associated feed. First, we determine whether the url represents an actual feed. This can usually be determined by submitting an http request and checking the content-type header included in the response. If the content-type is "application/rss+xml", "application/atom+xml","application/rdf+xml" or "text/xml" then you are probably dealing with a feed.
Second, you need to check to see if the url is not a feed, but is associated with a feed. This would be the case in situations where a url was to the front page or a specific entry of a blog. If the content-type in the http response, as describe in step one, was not a feed, then you would parse the "link" tags found between the "head" tags. If a "link" tag has a "rel=alternate" attribute then you can check the type attribute to see if it has a value equal to "application/rss+xml" or "application/atom+xml" similar to what we did in step one. If it does, then you can parse the value of the href attribute to retrieve the feed url associated with the url. For example, on the main page of our blog, if you look at the page source, you will see link tags to both the rss and atom feeds associated with our blog.
There are certainly other ways for resolving blog entities, but this seems to work fairly consistently. Feel free to chime in if you have any ideas on how to better accomplish this task.

Wednesday, February 6, 2008

Lab Spotlight

At first glance, the Data Mining Lab looks like your average computer science research lab. But conducting research in the data mining lab is not your average research experience. I'll explain what I mean by citing three areas that help make our lab experience great.

The People - French, Spanish, Portuguese, Korean, Hmong, Tagalog, Chinese. All languages spoken by members of the data mining lab. It is one example of the diverse capabilities belonging to members of the lab. We love computer science. We love data mining. We love python, open source and neural nets (if its possible to love a neural net). But we also appreciate politics, religion, cooking and anything else that is meaningful...or at least interesting. I remember heated discussions last semester about state funded private school vouchers, the political primaries and caucuses and the grammatically correct way to use the word "good." Being in the lab everyday has enriched my views on the world and made me a more rounded person.

The Activities - All work and no play? Makes you want to run away. Which is why we appreciate the many activities of the data mining lab. Each week we have a lab meeting/potluck where we discuss our progress. Curry, kimchi or burritos would make any meeting more exciting. Occasionally we will attend the campus devotionals/forums or the department colloquiums together as a lab. For example, last week our lab went and listened to Paul Rusesabagina tell his inspirational story, which was portrayed in the movie Hotel Rwanda. In addition to these weekly activities, each semester we meet together at our advisor Christophe's home for a lab social. The food is always fabulous and we even get to bring along our families which helps us bond even more. All of these activities help to make our lab experience unique.

The Research Environment- When it comes down to it, the reason we are all here is because we enjoy researching data mining, and the data mining lab is the perfect place to do it. We are given flexibility to research what is most interesting to us and are given the tools to be successful. This is mostly due to our adviser Christophe, who is flexible and supportive of our aspirations, while also helping us to investigate the feasibility and usefulness of the research topics we are considering. When we run into problems in our research, there is almost always someone in the lab with a helpful idea or suggestion. You may begin discussing a question with one lab member, but it usually isn't long before the whole lab is involved. There are also plenty of opportunities to publish papers, present at conferences and work with real companies. I can't think of a better environment for conducting research than we have here.

These are just a few of the reasons why it is awesome to be a member of the data mining lab. This is the place to be.

Tuesday, February 5, 2008

Meta-learning

I just finished reading Rice's seminal paper on algorithm selection [Rice, J.R. (1976). The Algorithm Selection Problem. Advances in Computers, 15:65-118]. For obvious reasons, it does not talk about meta-learning (look at the date!) but meta-learning is clearly one natural approach to solving the algorithm selection problem.

Kate Smith-Miles recently wrote a very nice survey paper (to appear in ACM Computing Surveys) where she uses Rice's framework to review and describe most known attempts at algorithm selection.

Rice does indeed offer a very clean formalism for the problem of algorithm selection, where a problem X from some problem space P is mapped, via some feature extraction process, to a representation f(X) in some feature space F, and the selection algorithm S maps f(X) to some algorithm Y in some algorithm space A, so that the performance of Y on X (for some adequately chosen performance measure) is in some sense optimal. Hence, as pointed out, "the selection mapping now depends only on the features f(X), yet the performance mapping still depends on the problem X" and, of course, "the determination of the best (or even good) features is one o the most important, yet nebulous, aspects of the algorithm selection process."

Rice is also quick to point out that "ideally, those problems with the same features would have the same performance for any algorithm being considered." I actually also pointed that out in my recent paper[Giraud-Carrier, C. (2005). The Data Mining Advisor: Meta-learning at the Service of Practitioners. In Proceedings of the 4th International Conference on Machine Learning Applications, 113-119] where I stated that unless for all X and X' (X <> X'), f(X)=f(X') implies p(X)=p(X') (where p is the performance measure) then the meta-training set may be noisy and meta-learning may in turn be sub-optimal.

Rice's framework naturally covers various forms of selection (e.g., best algorithm, best algorithm for a subclass of problems, etc.) as well as multi-criteria performance measures.

Another important point brought out by Rice, and often overlooked in the machine learning community, is that "most algorithms are developed for a particular class of problems even though the class is never explicitly defined. Thus the performance of algorithms is unlikely to be understood without some idea of the problem class associated with their development. Foster Provost and I called that the Strong Assumption of Machine Learning in our paper on the justification of meta-learning [Giraud-Carrier, C. and Provost, F. (2005). Towards a Justification of Meta-learning: Is the No Free Lunch Theorem a Show-stopper. In Proceedings of the ICML-05 Workshop on Meta-learning, 12-19]. I (and others) have often argued that the notion of delimiting the class of problems on which an algorithm performs well is critical to advances in machine learning.

Anyways, although Rice offers no specific method to solve the algorithm selection problem, the paper is highly relevant and very well-written. A must read for anyone interested in meta-learning.

Monday, February 4, 2008

Social Capital Simulation

Our recent work has explored the concept of social capital, which I have discussed previously. Our social capital metrics, namely bonding and bridging (popularized by Robert Putnam), utilize the hybrid network methodology that we have developed for online communities.

To understand our metrics, I have created a basic social capital simulation (an excel spreadsheet) having five nodes. The simulation allows for you to change the connection strengths in both the implicit affinity network (IAN) or explicit social network (ESN). Changing these values will give you an idea of how social capital fluctuates as the social network changes.

The figure above shows the initial configuration of the simulation. The dashed blue lines represent the IAN and the solid pink lines represent the ESN. The thicker the lines the stronger the connection. The weights for the IAN were randomly assigned, while the ESN weights were all set to one, thus creating a clique.

Initially, the bonding and bridging social capital are both 1, since everyone in the network is connected. To see how the social capital fluctuates, change the blue and/or pink values, again representing the IAN and the ESN weights respectively, in the spreadsheet.

(Note: This article was originally posted on dmine.blogspot.com)

Google Reader API

Problem: To perform social network analysis on blog data you need consistent data over a period of time. Periodically retrieving the content directly from the blog's feed has its limitations because you can only retrieve current blog content. Thus if you decide to begin retrieving content from a specific blog, you have no way at getting at the archived blog content.

Solution: Use the unofficial Google Reader API to retrieved archived feed content. The API was first documented two years ago at Nial Kennedy's blog and its reality was confirmed by several Google employees associated with the project. Little information has been published since as to an official release of the API, but the unofficial API still works great for retrieving archived feed content.

In our research the framework we use for interacting with the API is pyrfeed. The creators or pyrfeed also did some additional documentation on the capabilities of the API. The Google Code site has two downloadable files. The Google Reader stand alone is a simple interface for interacting with the API to perform simple actions such as feed retrieval. The other file, which is the full pyrfeed release, also provides gui and command line interfaces for interacting with the API and automated blog content storage in a mysqlite3 database. An example how to interact with the Google Reader stand alone package can be seen below.

In summary, if you are looking for a simple way to retrieve archived blog content, the Google Reader API and pyrfeed framework are cheap and easy tools for doing so. The blogosphere is at your fingertips.

Data Mining Lab