<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-5365131201156021161</id><updated>2011-12-31T14:25:50.687-08:00</updated><category term='pyrfeed'/><category term='Implicit Affinity Networks'/><category term='Social Information Processing'/><category term='Meta Learning'/><category term='Utah CEO Magazine'/><category term='Implicit Connections'/><category term='Data Mining Research'/><category term='Computer Science'/><category term='Google Reader'/><category term='Robert Putnam'/><category term='Data mining Applications'/><category term='Random Controlled Trials'/><category term='AAAI'/><category term='Top 10'/><category term='RSS Feed'/><category term='Machine Learning'/><category term='Data Mining'/><category term='Correlation'/><category term='Research Conference'/><category term='BYU'/><category term='Number Crunching'/><category term='Google Reader API'/><category term='data mining careers'/><category term='Artificial Intelligence'/><category term='Explicit Connections'/><category term='Stanford'/><category term='Genealogy'/><category term='social capital'/><category term='Blog Content'/><category term='Causation'/><category term='Knowledge Discovery'/><category term='Blog Entity Resolution'/><category term='Record Linkage'/><category term='Case Studies'/><category term='ATOM Feed'/><category term='Data Mining Lab'/><category term='Data Mining Search Engine'/><category term='Ian Ayres'/><category term='Blogs'/><category term='Research Opportunities'/><category term='Web Mining'/><category term='Data Mining Tools'/><category term='Blogosphere'/><category term='Transfer Learning'/><category term='Social Network Analysis'/><category term='Family History'/><category term='Spring Research Conference'/><category term='Problems and Solutions'/><title type='text'>Data Mining Lab</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>28</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-746712889331892036</id><published>2009-05-16T11:10:00.000-07:00</published><updated>2009-06-01T13:04:48.063-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Machine Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Social Network Analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='Data mining Applications'/><title type='text'>Sense Networks:  Mining Location Data</title><content type='html'>&lt;span class="Apple-style-span"  style="font-size:small;"&gt;There is an &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_0"&gt;enormous&lt;/span&gt; amount of location data being generated by cell phones, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;wi&lt;/span&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;fi&lt;/span&gt; enabled devices, and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;gps&lt;/span&gt; devices every minute.  Sense Networks, a company founded by &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_4"&gt;Columbia&lt;/span&gt; University and MIT faculty members, is a company that mines this location data in real-time to discover behavioral patterns of mobile phone users.  One simple mobile application that the company has produced is &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;CitySense&lt;/span&gt;, which produces an activity "heat map" of a city, showing the user where all of the busiest locations are in real-time.  "&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;Citysense&lt;/span&gt; shows the overall activity level of [a] city, top activity &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_7"&gt;hot spots&lt;/span&gt;, and places with unexpectedly high activity, all in real-time. Then it links to Yelp and Google to show what venues are operating at those locations."  &lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;   At the heart of Sense Networks' technology is the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;MVE&lt;/span&gt; algorithm: &lt;/span&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Sense Networks attributes 487,500 dimensions to every place in a city, thus identifying a unique and complex 'DNA' which describes it completely...  Proprietary &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;MVE&lt;/span&gt; (Minimum Volu&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;me Embedding) algorithms reduce the dimensionality of location and temporal data to 2 dimensions while retaining over 90% of the information.&lt;/span&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_I3Dp_Ci1Ie0/SiQzJU5gPuI/AAAAAAAAFG4/eK3O6tW9CB4/s1600-h/citysense.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 241px;" src="http://4.bp.blogspot.com/_I3Dp_Ci1Ie0/SiQzJU5gPuI/AAAAAAAAFG4/eK3O6tW9CB4/s320/citysense.png" alt="" id="BLOGGER_PHOTO_ID_5342451293289987810" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(102, 102, 102);"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;   The company eventually plans to produce an application that learns the movement patterns of a mobile phone user over time, subsequently providing &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_10"&gt;recommendations&lt;/span&gt; for places to visit when the user visits a new city.  For example, if you like to visit ice-cream shops in your hometown, the application will automatically learn this behavior.  When you  go to visit another city in another state, the application can automatically "sense" and report to you where the most popular ice-cream shops are in that city based on location data from other ice-cream lovers.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;   The application of this kind of technology to social networks and consumer-enriching applications is exciting, but the privacy implications can be frightening.  Sense Networks has a special executive called the CPA (the "Chief Privacy Advocate") who deals with privacy concerns.  Their philosophy is to give a user complete ownership over the data they choose to share, as well as a provision for the user to easily delete at any time the data they have already chosen to share.&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-746712889331892036?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/746712889331892036/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=746712889331892036' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/746712889331892036'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/746712889331892036'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2009/05/sense-networks-mining-location-data.html' title='Sense Networks:  Mining Location Data'/><author><name>Reed</name><uri>http://www.blogger.com/profile/07279614188035127478</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_I3Dp_Ci1Ie0/SiQzJU5gPuI/AAAAAAAAFG4/eK3O6tW9CB4/s72-c/citysense.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-4959936479221363452</id><published>2008-12-17T11:46:00.000-08:00</published><updated>2008-12-17T11:51:55.247-08:00</updated><title type='text'>Lab Members Recognized for Medical Research</title><content type='html'>&lt;a href="http://cs.byu.edu/article/2008-11-20-cs_students_take_third_place_international_competition_present_paper_washington_dc"&gt;See the story here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-4959936479221363452?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/4959936479221363452/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=4959936479221363452' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/4959936479221363452'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/4959936479221363452'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/12/lab-gets-recognition-for-medical.html' title='Lab Members Recognized for Medical Research'/><author><name>Matt Smith</name><uri>http://www.blogger.com/profile/00366225861010849516</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://bp0.blogger.com/_I3Dp_Ci1Ie0/SArVcllGRfI/AAAAAAAAC08/SpDdj7CISus/S220/matt_picture2.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-8677197728121020677</id><published>2008-12-09T14:40:00.000-08:00</published><updated>2008-12-10T06:25:26.080-08:00</updated><title type='text'>Metalearning Book Available</title><content type='html'>We are pleased to announce that, after much work, the book:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;Metalearning: Applications to Data Mining&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;co-authored by Christophe and 3 of his colleagues (Pavel Brazdil, Carlos Soares and Ricardo Vilalta) is now available from Springer.&lt;br /&gt;&lt;br /&gt;See &lt;a href="http://www.springer.com/978-3-540-73262-4"&gt;http://www.springer.com/978-3-540-73262-4&lt;/a&gt; for details.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-8677197728121020677?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/8677197728121020677/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=8677197728121020677' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/8677197728121020677'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/8677197728121020677'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/12/metalearning-book.html' title='Metalearning Book Available'/><author><name>Christophe Giraud-Carrier</name><uri>http://www.blogger.com/profile/17672899844586725651</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-5962267663884711164</id><published>2008-10-10T14:39:00.000-07:00</published><updated>2008-10-10T14:44:08.987-07:00</updated><title type='text'>Information Pathways in Social Networks</title><content type='html'>The first talk presented in the social network session of KDD 2008 was for an interesting paper by G. Kossinets, J. Kleinberg, and D. Watts titled &lt;span style="font-style:italic;"&gt;The Structure of Information Pathways in a Social Communication Network&lt;/span&gt; (&lt;a href="http://www.cs.cornell.edu/home/kleinber/kdd08-bb.pdf"&gt;PDF&lt;/a&gt;).  Although I was not at KDD I was able to watch it online at &lt;a href="http://videolectures.net/kdd08_kleinberg_sipscn/"&gt;videolectures.net&lt;/a&gt;.&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_I3Dp_Ci1Ie0/SO_Jl56QgHI/AAAAAAAAEoc/8PB_Qz1tVEY/s400/Picture+96.png" alt="" id="BLOGGER_PHOTO_ID_5255640943202173042" border="0" /&gt;Kleinberg, the presenter, made some interesting observations having to do with our "rhythmic" everyday conversations.  The approach to analyzing communication within these social networks is focused on the frequency of correspondence, rather than the content conveyed.&lt;br /&gt;&lt;br /&gt;They measure "distance" between individuals by measuring the minimum time required for information to pass from one node to another.  A methodology based on Lamport's work and vector clocks in the area of distributed computing.&lt;br /&gt;&lt;br /&gt;Using this metric they are able to filter a busy network (one having edges for all communication packets) in a simplified network that contains only the edges that are minimum-delay paths between a pair of nodes.  They call this simplified network view the &lt;span style="font-style: italic;"&gt;network backbone&lt;/span&gt;.  Below is an example of such a network (along with the caption) taken from the paper.&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_I3Dp_Ci1Ie0/SO_FccVD3pI/AAAAAAAAEoU/0WXsHnTj1kA/s400/Picture+83.png" alt="" id="BLOGGER_PHOTO_ID_5255636382596193938" border="0" /&gt;The nodes further outside of the center of the graph are more "out-of-date" with respect to node &lt;span style="font-style: italic;"&gt;v&lt;/span&gt;, since they communicate less frequently.&lt;br /&gt;&lt;br /&gt;I found the approach to be novel and useful.  As with nearly any analysis technique, caution should be used in selecting the time-period and group size to be studied.  Recency and frequency issues come into play as correspondence is aggregated.  However, this pursuit offers another approach for more fully understanding information flow.&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-style: italic;font-size:85%;" &gt;Originally published by Matt on his blog at: &lt;/span&gt;&lt;a href="http://dmine.blogspot.com/"&gt;&lt;span style="font-style: italic;font-size:85%;" &gt;http://dmine.blogspot.com&lt;/span&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-5962267663884711164?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/5962267663884711164/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=5962267663884711164' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/5962267663884711164'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/5962267663884711164'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/10/information-pathways-in-social-networks.html' title='Information Pathways in Social Networks'/><author><name>Matt Smith</name><uri>http://www.blogger.com/profile/00366225861010849516</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://bp0.blogger.com/_I3Dp_Ci1Ie0/SArVcllGRfI/AAAAAAAAC08/SpDdj7CISus/S220/matt_picture2.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_I3Dp_Ci1Ie0/SO_Jl56QgHI/AAAAAAAAEoc/8PB_Qz1tVEY/s72-c/Picture+96.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-7336160658692894856</id><published>2008-10-09T13:30:00.000-07:00</published><updated>2008-10-09T13:36:10.159-07:00</updated><title type='text'>AMIA Competition Finalists!</title><content type='html'>Jun, Yao and Matt participated in the &lt;a href="http://www.amia.org/mbrcenter/wg/kddm/contest.asp"&gt;2008 Data Mining Competition: Discovering Knowledge in NHANES Data&lt;/a&gt;, sponsored by the AMIA Knowledge Discovery and Data Mining Working Group, and were selected as one of the finalists by the judging panel. They will be presenting their results in a dedicated session of the AMIA Annual Symposium in Washington, DC, in November 2008. Congratulations!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-7336160658692894856?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/7336160658692894856/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=7336160658692894856' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/7336160658692894856'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/7336160658692894856'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/10/amia-competition-finalists.html' title='AMIA Competition Finalists!'/><author><name>Christophe Giraud-Carrier</name><uri>http://www.blogger.com/profile/17672899844586725651</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-2897787370094744416</id><published>2008-09-25T18:25:00.000-07:00</published><updated>2008-10-09T13:45:44.363-07:00</updated><title type='text'>A Couple of Interesting Papers</title><content type='html'>Here are a couple of papers that others might also find interesting.&lt;br /&gt;&lt;br /&gt;Title: Information-Theoretic Definition of Similarity (&lt;a href="http://www.cs.ualberta.ca/~lindek/papers/sim.pdf"&gt;PDF&lt;/a&gt;)&lt;br /&gt;Conference: &lt;span style="font-style:italic;"&gt;ICML 1998&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The paper provides a general similarity measure applicable across many domains. The authors insist that their formulation satisfies "universality" and "theoretical justification". Previous similarity measures are domain-specific. The formula is:&lt;br /&gt;&lt;pre&gt;&lt;code&gt;sim(A,B) = log P(common(A,B)) / log P(description(A,B))&lt;/code&gt;&lt;/pre&gt;where common(A,B) is a proposition that states the commonalities between A and B, and description(A,B) is a proposition that describes what A and B are.&lt;br /&gt;&lt;br /&gt;Title: An Introduction to Quantum Computing.&lt;br /&gt;Author: Norson S. Yanofsky&lt;br /&gt;&lt;br /&gt;The paper gives a taste of quantum computing targeted at computer science undergraduates (and even advanced high school students). Some of the (fun) basic points in Quantum Computing include the following. A quantum can exist in SEVERAL states AT THE SAME TIME (Superposition), but when it is measured, it collapses to either 0 or 1. (in the case of a 2 (qu)bit quantum computer). When two quantums are added, their magnitude can be decreased (Interference).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-2897787370094744416?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/2897787370094744416/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=2897787370094744416' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/2897787370094744416'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/2897787370094744416'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/09/paper-for-fun-introduction-to-quantum.html' title='A Couple of Interesting Papers'/><author><name>좋은세상</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-6366432299358140644</id><published>2008-04-16T09:13:00.000-07:00</published><updated>2008-04-16T09:46:57.159-07:00</updated><title type='text'>Picture of our Blog</title><content type='html'>&lt;div id="bscope_widget"&gt;&lt;a target=_blank href="http://www.bscopes.com/scopefeed.html?feedid=662"&gt;&lt;img id="bscope_widget" alt="powered by bscopes.com" border="1" height="320" width="320"  src="http://www.bscopes.com/ascope.php?feedid=662"&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Here is a picture of the our blog provided by &lt;a href="http://www.bscopes.com"&gt;bscopes.com&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_I3Dp_Ci1Ie0/SAYqp7vHCvI/AAAAAAAAC0o/qlE1uOp0WAo/s1600-h/Picture+34.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp2.blogger.com/_I3Dp_Ci1Ie0/SAYqp7vHCvI/AAAAAAAAC0o/qlE1uOp0WAo/s400/Picture+34.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5189882520489364210" /&gt;&lt;/a&gt;Here is the legend they provide to help interpret the graph. It is clear that the all of the blogs referencing our blog are not listed (due to the sparsity of data collected at bscopes).  As is, the value of this graph limited to showing the number of links within each of our blog entries.  Despite the current limitations, I find the idea of providing a web service that produces a visual representation of a blog interesting.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-6366432299358140644?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/6366432299358140644/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=6366432299358140644' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/6366432299358140644'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/6366432299358140644'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/04/picture-of-our-blog.html' title='Picture of our Blog'/><author><name>Matt Smith</name><uri>http://www.blogger.com/profile/00366225861010849516</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://bp0.blogger.com/_I3Dp_Ci1Ie0/SArVcllGRfI/AAAAAAAAC08/SpDdj7CISus/S220/matt_picture2.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp2.blogger.com/_I3Dp_Ci1Ie0/SAYqp7vHCvI/AAAAAAAAC0o/qlE1uOp0WAo/s72-c/Picture+34.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-5837105795376484534</id><published>2008-04-02T09:10:00.001-07:00</published><updated>2008-04-02T09:54:41.613-07:00</updated><title type='text'>Recipe for Kim-Chee Fried Rice</title><content type='html'>One of our lab's fun traditions is our weekly potluck lunch. Once a week, everyone brings something from home that we put together and share for lunch. This is a great time to socialize and talk informally about our research or anything else. People typically bring left-overs, but Jun, our in-house Korean lab member, makes a point of bringing a Korean dish that he prepares for us on purpose every week. It is typically some curry dish or kim-chee. As we do not wish to keep this to ourselves, a picture of Jun's kim-chee with his recipe are found below. Enjoy!&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_hEYm4Ck_VEU/R_OyABqoDRI/AAAAAAAAAAQ/8OugNY4J0zY/s1600-h/Picture+031.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; cursor: pointer;" src="http://bp1.blogger.com/_hEYm4Ck_VEU/R_OyABqoDRI/AAAAAAAAAAQ/8OugNY4J0zY/s400/Picture+031.jpg" alt="" id="BLOGGER_PHOTO_ID_5184683309550538002" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;1. Prepare for Kim-chee. (You can purchase in either Korean market or Asian Market)&lt;br /&gt;2. Mix with vegetables or beef (I recommend to put chopped Onion, garlic, and Tuna)&lt;br /&gt;3. Fry them with Olive oil for 4-5 minutes.&lt;br /&gt;4. Done!!! (Isn't it so easy?)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-5837105795376484534?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/5837105795376484534/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=5837105795376484534' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/5837105795376484534'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/5837105795376484534'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/04/recipe-for-kim-chee-fried-rice.html' title='Recipe for Kim-Chee Fried Rice'/><author><name>좋은세상</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp1.blogger.com/_hEYm4Ck_VEU/R_OyABqoDRI/AAAAAAAAAAQ/8OugNY4J0zY/s72-c/Picture+031.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-3543273370599155271</id><published>2008-04-02T07:20:00.001-07:00</published><updated>2008-04-02T08:21:45.474-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Stanford'/><category scheme='http://www.blogger.com/atom/ns#' term='AAAI'/><category scheme='http://www.blogger.com/atom/ns#' term='Social Information Processing'/><category scheme='http://www.blogger.com/atom/ns#' term='Research Conference'/><title type='text'>Pictures from AAAI Symposium</title><content type='html'>Here are a few more pictures from the &lt;a href="http://www.isi.edu/%7Elerman/sss07/"&gt;AAAI Social Information Processing Spring Symposium&lt;/a&gt; at Stanford University.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_wzHaXkbDbnQ/R_OXPeEYj6I/AAAAAAAAAec/rU-1HtijCDU/s1600-h/Picture+021.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp2.blogger.com/_wzHaXkbDbnQ/R_OXPeEYj6I/AAAAAAAAAec/rU-1HtijCDU/s320/Picture+021.jpg" alt="" id="BLOGGER_PHOTO_ID_5184653888058855330" border="0" /&gt;&lt;/a&gt;An example of the beautiful architecture&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_wzHaXkbDbnQ/R_OXv-EYj7I/AAAAAAAAAek/g5-nAfDR-Og/s1600-h/Picture+025.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp0.blogger.com/_wzHaXkbDbnQ/R_OXv-EYj7I/AAAAAAAAAek/g5-nAfDR-Og/s320/Picture+025.jpg" alt="" id="BLOGGER_PHOTO_ID_5184654446404603826" border="0" /&gt;&lt;/a&gt;A beautiful field out in front of the symposium location&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_wzHaXkbDbnQ/R_OYC-EYj8I/AAAAAAAAAes/RpYkqLvndqQ/s1600-h/Picture+017.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp0.blogger.com/_wzHaXkbDbnQ/R_OYC-EYj8I/AAAAAAAAAes/RpYkqLvndqQ/s320/Picture+017.jpg" alt="" id="BLOGGER_PHOTO_ID_5184654772822118338" border="0" /&gt;&lt;/a&gt;The building where the symposium was held&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-3543273370599155271?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/3543273370599155271/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=3543273370599155271' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/3543273370599155271'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/3543273370599155271'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/04/pictures-from-aaai-symposium.html' title='Pictures from AAAI Symposium'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp2.blogger.com/_wzHaXkbDbnQ/R_OXPeEYj6I/AAAAAAAAAec/rU-1HtijCDU/s72-c/Picture+021.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-6403407336504890705</id><published>2008-03-27T16:02:00.000-07:00</published><updated>2008-04-02T07:26:45.728-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Stanford'/><category scheme='http://www.blogger.com/atom/ns#' term='AAAI'/><category scheme='http://www.blogger.com/atom/ns#' term='Social Information Processing'/><category scheme='http://www.blogger.com/atom/ns#' term='Research Conference'/><title type='text'>AAAI Social Information Processing Symposium Summary</title><content type='html'>I apologize for not getting back sooner with results and thoughts from the symposium.  Like I said in my previous post, Matt and I attended the &lt;a href="http://www.isi.edu/%7Elerman/sss07/"&gt;AAAI 2008 Social Information Processing Symposium&lt;/a&gt;. Matt presented on &lt;a href="http://dml.cs.byu.edu/matthewsmith/publications/IAN-SIP08.pdf"&gt;Social Capital in the Blogosphere&lt;/a&gt; and it seemed to be well received by the community. They followed up on our presentation about social capital with a number of questions regarding possible actions and experiments that could be taken within our framework for measuring social capital. It furthered our opinion that the work we have done provides an intuitive way to understand a seemingly abstract topic like social capital. There is still a lot of work to be done in determining what constitutes and explicit and implicit link within the blogosphere, but we are on our way.&lt;br /&gt;&lt;br /&gt;Several other thoughts specifically related to our research.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Bonding links (a relationship with someone similar to you) should be easier to make than a bridging links (a relationship with someone different that you). Thus in a social network representation you should probably see a substantial amount more bonding than bridging taking place.&lt;/li&gt;&lt;li&gt;There is a cost associated with forming a bonding or bridging link that we have not addressed up to this point. In general, this cost involves both the type (bonding/bridging) of link and also the individual social capital of the person you are attempting to form a link with.&lt;/li&gt;&lt;li&gt;Nearly everything in the social information processing domain, when graphed, seems to follow a power law. Does individual social capital follow this distribution as well, ie. do certain individuals have much more social capital than the population at large? If so is there anyway to leverage the social capital of all the individuals found in the long tail of the graph? For example, the wisdom of the masses approach is working wonderfully in Wikipedia, where information may in some cases be more accurate than that of the so-called authorities. It's all theory, but just some ideas I've been thinking about.&lt;/li&gt;&lt;li&gt;High cost, high reward. Blogging can take a lot of time, because good writing takes time. It takes substantially more time than other activities we heard about in the conference such as tagging, posting pictures or rating a product. But with the high cost comes the high rewards, as blogging has become mainstream it has become a powerful tool for advancing ideas, products, companies and careers. Ultimately we need to get some type of reward for our involvement in a social network even if it is just personal fulfillment or our activity will dwindle.&lt;/li&gt;&lt;/ul&gt;Other cool things about the symposium.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Stanford is the most beautiful campus I've ever been on. &lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_wzHaXkbDbnQ/R_OM2-EYj5I/AAAAAAAAAeU/AZaaUlrm_N0/s1600-h/Picture+007.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://bp0.blogger.com/_wzHaXkbDbnQ/R_OM2-EYj5I/AAAAAAAAAeU/AZaaUlrm_N0/s200/Picture+007.jpg" alt="" id="BLOGGER_PHOTO_ID_5184642472035782546" border="0" /&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;We heard about a cool new project called &lt;a href="http://freebase.com/"&gt;Freebase&lt;/a&gt; that seems to have the potential to someday replace Wikipedia as the best source for free, open content information. It looks smooth, provides easy ways to query for information and has an awesome API. It could be the next big thing. Matt posted his thoughts about it as well at his &lt;a href="http://dmine.blogspot.com/2008/03/freebase.html"&gt;blog&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;Gustavo Glusman presented one of the coolest social network graphs I've ever seen (the flickrverse) which can be found &lt;a href="http://photos4.flickr.com/9723603_f910a72200_o.jpg"&gt;here.&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Meeting a wide variety of people. There was representation from both academia and industry from a variety of locations. Plenty of people were from California, but there was also representation from other parts of the United States, the UK, Germany, China, Taiwan, Switzerland and probably more that I am forgetting. It was a great group to become involved in.&lt;/li&gt;&lt;li&gt;Getting to learn from those who know more than I do. Everyone had their own expertise in specific social networks and with specific ideas. I learned a lot about social networking principles that are somewhat different than those found here in the blogosphere.&lt;/li&gt;&lt;/ul&gt;Matt posted some of his thoughts about specific papers on his blog &lt;a href="http://dmine.blogspot.com/2008/03/social-information-processing.html"&gt;here&lt;/a&gt; and &lt;a href="http://dmine.blogspot.com/2008/04/sip-recap-thursday_01.html"&gt;here&lt;/a&gt; for those who are interested. For any out there who would consider attending next year, go for it, it was a wonderful experience.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-6403407336504890705?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/6403407336504890705/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=6403407336504890705' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/6403407336504890705'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/6403407336504890705'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/03/aaai-social-information-processing.html' title='AAAI Social Information Processing Symposium Summary'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp0.blogger.com/_wzHaXkbDbnQ/R_OM2-EYj5I/AAAAAAAAAeU/AZaaUlrm_N0/s72-c/Picture+007.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-5723777854080779325</id><published>2008-03-26T15:59:00.000-07:00</published><updated>2008-04-06T11:50:49.162-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Stanford'/><category scheme='http://www.blogger.com/atom/ns#' term='AAAI'/><category scheme='http://www.blogger.com/atom/ns#' term='Spring Research Conference'/><category scheme='http://www.blogger.com/atom/ns#' term='Social Information Processing'/><title type='text'>Hello from Stanford</title><content type='html'>Matt Smith and I will be at the AAAI 2008 Spring Symposium which is being held at Stanford University from now until Friday. We are attending the Social Information Processing session and will be presenting on a paper we co-authored with Christophe entitled &lt;a href="http://dml.cs.byu.edu/matthewsmith/publications/IAN-SIP08.pdf"&gt;Social Capital in the Blogosphere: A Case Study&lt;/a&gt;. The morning and afternoon sessions were great and I'll give a rundown later on. Matt will be presenting in about an hour and a half so I'll post about how that went and everything else we are learning here.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_I3Dp_Ci1Ie0/R_kbNPtzbKI/AAAAAAAACy4/rWoqb8K6uQY/s1600-h/stanford+009.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp2.blogger.com/_I3Dp_Ci1Ie0/R_kbNPtzbKI/AAAAAAAACy4/rWoqb8K6uQY/s400/stanford+009.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5186206360264731810" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_wzHaXkbDbnQ/R_OGX-EYj4I/AAAAAAAAAeM/MqbnyJDA6UA/s1600-h/Picture+012.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp0.blogger.com/_wzHaXkbDbnQ/R_OGX-EYj4I/AAAAAAAAAeM/MqbnyJDA6UA/s400/Picture+012.jpg" alt="" id="BLOGGER_PHOTO_ID_5184635342390071170" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-5723777854080779325?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/5723777854080779325/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=5723777854080779325' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/5723777854080779325'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/5723777854080779325'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/03/hello-from-stanford.html' title='Hello from Stanford'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp2.blogger.com/_I3Dp_Ci1Ie0/R_kbNPtzbKI/AAAAAAAACy4/rWoqb8K6uQY/s72-c/stanford+009.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-9216342288089920052</id><published>2008-03-22T15:06:00.000-07:00</published><updated>2008-03-28T19:17:35.087-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Correlation'/><category scheme='http://www.blogger.com/atom/ns#' term='Random Controlled Trials'/><category scheme='http://www.blogger.com/atom/ns#' term='Causation'/><title type='text'>On Correlation versus Causation</title><content type='html'>Among the common mistakes made by data miners who lack training in statistics is the confusion between correlation and causation, often arising from the difference between observational studies and random controlled trials. To start with, here are some simple definitions.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;There is a relation of (positive) correlation between two random variables when high values of one are likely to be associated with high values of the other.&lt;/li&gt;&lt;li&gt;There is a relation of cause and effect between two random variables when one is a determinant of the other.&lt;/li&gt;&lt;/ul&gt;It is easy to see from these definitions that:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Causation implies correlation, but correlation does not necessarily imply causation.&lt;/li&gt;&lt;li&gt;Correlation is easy to establish, causation is not.&lt;/li&gt;&lt;/ul&gt;As a matter of fact:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Random controlled trials establish causation.&lt;/li&gt;&lt;li&gt;Observational studies only bring out correlation.&lt;/li&gt;&lt;/ul&gt;One of the main reasons correlation, although often appealing, cannot be "safely" acted upon as if it were causation lies in the potential presence of confounding variables. A confounding variable is one that affects the variable of interest but it has either not been considered or not been controlled for (wherein the name "confounding" or "lurking"). Consider the following simple example of a confounding effect. Suppose that a very profitable customer C has placed your company BetterSoft in competition with another company GoodSoft to test your relative abilities to develop good software. The task is to design an algorithm to solve a class P of problems. Your company produces algorithm A, while the other company produces algorithm B. Both you and your competitor are asked to run your own batch of 350 tests and report how often your algorithm gave acceptable solutions (as defined by C). GoodSoft comes out on top with a score of 83% against only 78% for your algorithm. Just as C is about to award its lucrative contract to GoodSoft, you realize that the problems in class P are not all of the same complexity. In fact, it turns out that there are two clear levels of difficulty: simple and complex. You ask the customer to collect more detailed data from GoodSoft and yourself, namely splitting the 350 test problems into simple an complex problems. The results, when complexity is thus factored in, are as follows.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Simple problems: A solves 81 out of 87 and B solves 234 out of 270&lt;/li&gt;&lt;li&gt;Complex problems: A solves 192 out of 263 and B solves 55 out of 80&lt;/li&gt;&lt;/ul&gt;Should this additional information change C's decision as to which company to hire? Of course. Although GoodSoft does better overall, it actually has worse performance on both the simple problems and the complex problems. Such situation are also known as examples of Simpson's paradox. It is a paradox because although mathematically correct, it is somewhat counterintuitive. The "variable" complexity in this example is a confounding variable, because it interacts with the calculated outcome in a way that may easily be overlooked, but may have an adverse effect on conclusions reached. Another well-known instance of Simpson's paradox arises in some of our US presidential elections where the winner does not carry a popular vote (i.e., the tally of individual votes gives the opponent as winner). The confounding variable in this case is the electoral college.&lt;br /&gt;In the medical domain, for example, it is critical to discover causal rather than only correlational relationships. One cannot take the risk of treating the "wrong" cause of a particular ailment. The same is also true of many other situations outside of medicine. Hence, data miners should do well to understand confounding effects and use that knowledge both in the design of the experiments they run and the conclusions they draw from experiments in general. I have been guilty of "jumping the gun" myself and reporting results that clearly ignored possible confounding effects.&lt;br /&gt;&lt;br /&gt;Let me turn now to more mundane business applications. In reaction to a (much shorter) comment I posted about the difference between correlation and causation &lt;a href="http://abbottanalytics.blogspot.com/2008/01/data-mining-interesting-ethical.html"&gt;here&lt;/a&gt;, a couple of individuals reacted as follows (I reproduce some of that conversation here for completeness):&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;(Jaime) - I don't think that insurance companies or any other business that would use data mining would or necessarily should care about the difference between correlation and causation in factors they don't have any control. (exceptions, of course, for anything medical or legal). If they can determine that people with freckles have less car accidents, why shouldn't they offer people with freckles lower rates?&lt;/li&gt;&lt;li&gt;(Will) - Jamie makes a good point. The question of correlation versus causation will be of only philosophical interest to a data mining practitioner, assuming that the underlying behavior being modeled does not change (and this will often be a safe bet). An illustration should make this subtlety clear. Suppose that insurance data indicates that people who play the board game Monopoly are better life insurance risk than people who do not. An insurance company might very well like to take advantage of such knowledge. Is their necessarily a causal arrow between these two items? No, of course not. Monopoly might not "make" someone live longer, and living longer may not "make" someone play Monpoly. Might there exist another characteristic which gives rise to both of these items (such as being a home-body who avoids death by automobile)? Yes, quite possibly. The insurance company does not care, as long as the relationship continues to hold.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;This brings up an interesting point of course. Is the matter of causation versus correlation only a philosophical one, with little bearing in practice (a little bit like the No Free Lunch Theorem is a great theoretical result but seems to have little real impact in practical applications of machine learning; but that is another discussion: &lt;a href="http://dml.cs.byu.edu/%7Ecgc/pubs/ICML2005WS.pdf"&gt;here&lt;/a&gt; for details on this unrelated but interesting topic). Let me try to address this here a little bit (much of this is also found in my response on the above blog). The statistician (I use the term loosely, I am not a statistician myself) seeks the true cause, the one that remains valid through time. On the other hand, the (business) practitioner seeks mainly utility or applicability, which may become invalid over time but serves him/her right for some reasonable amount of time. Under this view of the world, I think it is possible to reconcile the two perspectives. Indeed, one can see that the statements "assuming that the underlying behavior being modeled does not change" and "as long as the relationship continues to hold" may be interpreted (in some way, see below) as effectively equivalent to what statisticians regard as "controlling for variables". By taking this kind of dynamic approach where the relationship (or behavior) is "continuously" monitored for validity and the action is taken only as long as that relationship holds, the user is, in effect, relieved from the problem of lurking variables. Let me illustrate on Will's example. Statisticians would indeed argue that there may be a confounding variable that explains the insurance company's finding, one that has nothing to do with playing monopoly. Will proposed one: "being a home-body". I'll continue the argument with that one. In this case, it may therefore be that there are more home-body monopoly players than not; and it is the "home-bodyness" (if such a word exists) that explains the lower risk for life insurance (and not the monopoly-playing). Now, a statistician would be right in this case, and if one had to come up with the "correct" answer and build a model that remains accurate for now AND the future, you would have to accept the statistician's approach and build your model using home-bodyness rather than monopoly playing. There is little arguing here. I think that what Will and Jaime are be getting at is that there is a way to, in some sense, side-step this issue; namely: monitor the relationship. Indeed, if I keep on looking and checking that the correlation continues to hold, then I don't care about any confounding effect. If there are none, then the correlation also manifests a causation and I am safe; if there are some confounding effects, they will become manifest over time as the observed correlation is weakened. Hence, I can choose at that time to invalidate my model. But in the meantime, it served me right, was accurate, and I did not worry about controlling anything. Going back to the example, as long as the correlation is strong, I am OK. If it turns out that it is home-bodyness that causes the lower risk, I may eventually see more and more non monopoly players with low risk who also turn out to be home-bodies. In this case the originally observed correlation will decrease telling me that I may wish to discontinue the use of my model.&lt;br /&gt;&lt;br /&gt;The distinction may be viewed as only of philosophical interest, at least in the context of such business cases. Again, in medicine, one may have a different perspective as also pointed out by Jaime and Will. One of the drawbacks of the "correlation-driven" approach is that when the model is no longer valid (as seen by the decreasing correlation value), the practitioner has no idea what may be the cause and is thus left with no information as to where to go next. Then again, as suggested by Jaime and Will, maybe he/she does not care. From a strictly business standpoint, he/she was able to quickly build a model with high utility (even if only for a shorter period of time) instead of having to expand a lot of resources to build a "causation" model, with the risk of not doing any better as not all confounding can ever be controlled for! (In fact, there are even situations where the controlled experiments that would be necessary cannot be run; see &lt;a href="http://dml.cs.byu.edu/%7Ecgc/docs/atdm/Parachute.pdf"&gt;here&lt;/a&gt; for a fun example).&lt;br /&gt;&lt;br /&gt;After all is said and done, and more has been said than done :-), one should be aware of confounding effects (or Simpson's paradox), and know how to deal with them: 1) stick to strictly random controlled experiments; or 2) use observations but handle with careful and continuous monitoring.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-9216342288089920052?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/9216342288089920052/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=9216342288089920052' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/9216342288089920052'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/9216342288089920052'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/03/on-correlation-versus-causation.html' title='On Correlation versus Causation'/><author><name>Christophe Giraud-Carrier</name><uri>http://www.blogger.com/profile/17672899844586725651</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-881346472350722697</id><published>2008-03-05T16:54:00.000-08:00</published><updated>2008-10-02T15:23:19.482-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Case Studies'/><category scheme='http://www.blogger.com/atom/ns#' term='Number Crunching'/><category scheme='http://www.blogger.com/atom/ns#' term='Ian Ayres'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><title type='text'>Ian Ayres' Super Crunchers Book</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_I3Dp_Ci1Ie0/R-Hc1_tzbII/AAAAAAAACxw/mxn1vtYK3Ws/s1600-h/Picture+24.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://bp1.blogger.com/_I3Dp_Ci1Ie0/R-Hc1_tzbII/AAAAAAAACxw/mxn1vtYK3Ws/s200/Picture+24.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5179663866647506050" /&gt;&lt;/a&gt;I recently came across &lt;a href="http://mba.yale.edu/faculty/profiles/ayres.shtml"&gt;Ian Ayres&lt;/a&gt;' book: &lt;a href="http://islandia.law.yale.edu/ayers/"&gt;Super Crunchers&lt;/a&gt;. It is a nice read. Ayres essentially makes the case for number crunching (data mining for many of us) in all aspects of business and social life. The book describes a large number of case studies where number crunching has been successfully applied (e.g., wine quality, teaching methods, medical practices, etc.), often providing answers that challenge traditional wisdom. The examples are rather compelling. Most of the studies rely exclusively on random controlled trials and the use of regression techniques. Yet, I think this is a great book for people starting in data mining or looking for good reasons to begin. (The other nice thing is that the book is very cheap: less than $20 on Amazon!). Enjoy!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-881346472350722697?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/881346472350722697/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=881346472350722697' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/881346472350722697'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/881346472350722697'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/03/ian-ayres-super-crunchers-book.html' title='Ian Ayres&apos; Super Crunchers Book'/><author><name>Christophe Giraud-Carrier</name><uri>http://www.blogger.com/profile/17672899844586725651</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp1.blogger.com/_I3Dp_Ci1Ie0/R-Hc1_tzbII/AAAAAAAACxw/mxn1vtYK3Ws/s72-c/Picture+24.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-2390830054735648690</id><published>2008-03-05T07:41:00.000-08:00</published><updated>2008-03-05T08:00:07.184-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Spring Research Conference'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><title type='text'>Spring Research Conference</title><content type='html'>The 22nd Annual Spring Research Conference for the College of Physical and Mathematical Sciences is coming up on Saturday, March 15th. The current presentation schedule for the conference can be found &lt;a href="http://cpms.byu.edu/springresearch/currentschedule"&gt;here&lt;/a&gt;. Six members of the lab will be presenting research at the conference.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://cpms.byu.edu/node/521"&gt;Can a Computer Learn to do Genealogy? - Stephen Ivie&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://cpms.byu.edu/node/586"&gt;Characterizing UCI Data Sets - Jun won Lee&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://cpms.byu.edu/node/556"&gt;An Evaluation of Name, Location, and Date Comparison Metrics for Record Linkage - Yao Huang Lin&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://cpms.byu.edu/node/540"&gt;Building Community around a Blog - Matthew Smith&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://cpms.byu.edu/node/681"&gt;Social Capital in the Blogosphere - Nathan Purser&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://cpms.byu.edu/node/519"&gt;Utilizing Stacking for Feature Reduction in Graph-Based Genealogical Entity Resolution - Stephen Ivie&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://cpms.byu.edu/node/725"&gt;Keeping it Spinning: A Background Check on Virtual Storage Providers - Anne  Roach&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;We're happy about our representation and it is shaping up to be a great conference.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-2390830054735648690?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/2390830054735648690/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=2390830054735648690' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/2390830054735648690'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/2390830054735648690'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/03/spring-research-conference.html' title='Spring Research Conference'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-4669920589538520445</id><published>2008-02-27T19:21:00.000-08:00</published><updated>2008-02-27T19:49:38.508-08:00</updated><title type='text'>Data Mining Lab -- Experience is Key</title><content type='html'>My name is &lt;a href="http://students.cs.byu.edu/%7Ensd6"&gt;Nathan Davis&lt;/a&gt;.  I've been a member of the &lt;a href="http://dml.cs.byu.edu/"&gt;Data Mining Lab&lt;/a&gt; at &lt;a href="http://byu.edu/"&gt;BYU&lt;/a&gt; for 3 years now, and have had a wonderful experience.  In fact, I've had a lot of great experiences, many of which have prepared me for future work and research.&lt;br /&gt;&lt;br /&gt;With respect to work, I've had a chance to conduct real world data mining for large industry partners.  In addition to learning, through experience, about the technical aspects of the data mining process, the lab has also given me an opportunity to learn about business aspects, by meeting face-to-face with industry partner representatives.  Most recently we were able to meet with a Vice President of a large retail company to discuss several issues relevant to the research we conduct!&lt;br /&gt;&lt;br /&gt;Further, the lab has provided me with great research experience.  &lt;a href="http://dml.cs.byu.edu/wiki/index.php/Christophe_Giraud-Carrier"&gt;Dr. Giraud-Carrier&lt;/a&gt; is a tremendous academic, with a great deal of interest in his students and research assistants.  Under his  tutelage I've published academic papers and will soon be completing a Masters degree.  I even had the opportunity to travel to the Netherlands to present at an academic conference.&lt;br /&gt;&lt;br /&gt;Currently I'm conducting a software engineering internship with &lt;a href="http://www.google.com/"&gt;Google&lt;/a&gt;, and my experience in the lab is helping me to be successful.  For anyone interested gaining experience that will help them succeed academically and professionally, I'd highly recommend dropping by the lab and finding out about the great experiences that await &lt;span style="font-style: italic;"&gt;you.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-4669920589538520445?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/4669920589538520445/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=4669920589538520445' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/4669920589538520445'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/4669920589538520445'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/02/data-mining-lab-experience-is-key.html' title='Data Mining Lab -- Experience is Key'/><author><name>Nathan Davis</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh3.googleusercontent.com/-pociCRwf_l8/AAAAAAAAAAI/AAAAAAAAZ-s/_ZKCMSvqBXI/s512-c/photo.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-3291678775711458268</id><published>2008-02-26T15:50:00.001-08:00</published><updated>2008-02-26T16:26:30.934-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining careers'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining Research'/><category scheme='http://www.blogger.com/atom/ns#' term='Data mining Applications'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='Top 10'/><title type='text'>10 Reasons Why Data Mining is Fun and Rewarding</title><content type='html'>&lt;span style="font-weight: bold;"&gt;1. &lt;/span&gt;You can train your computer to do things you can't.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2.&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;&lt;/span&gt;The methods are complicated, but the applications are intuitive.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;3. &lt;/span&gt;It can save/make lots of money.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;4. &lt;/span&gt;Data mining has applications in nearly any area you can think of.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;5. &lt;/span&gt;You get to deal with data sets larger than you could ever process in your mind.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;6. &lt;/span&gt;There are big developments taking place in the industry.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-weight: bold;"&gt;7.&lt;/span&gt;&lt;/span&gt; Data mining algorithms attempt to model how things work in biology and the real world. (ie. Neural networks/genetic algorithms)&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;8.&lt;/span&gt; There is no one size fits all solution when it comes to data mining.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;9. &lt;/span&gt;You help make the statement "I have more data than I know what to do with" obsolete.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;10. &lt;/span&gt;Your results can make an immediate impact in whatever industry you are involved in.&lt;br /&gt;&lt;br /&gt;Why do you like data mining today? What got you interested in the first place?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-3291678775711458268?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/3291678775711458268/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=3291678775711458268' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/3291678775711458268'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/3291678775711458268'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/02/top-10-data-mining-fun-rewarding.html' title='10 Reasons Why Data Mining is Fun and Rewarding'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-2307310716798834837</id><published>2008-02-22T07:03:00.000-08:00</published><updated>2008-02-22T07:13:47.591-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining careers'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><title type='text'>Data Mining in the Workplace</title><content type='html'>I graduate in a few months and so I've been job hunting lately. I attended the Technical Career Fair here at BYU a few weeks back and I was impressed by the number of companies that were interested in data mining. With the exception of one or two companies, they all either were currently involved in data mining or were interested in becoming involved in the near future. I think that as more and more companies amass mounds of data, they are realizing that collecting data for data's sake is useless and that they can get much more out of their data than they have in the past. Data mining is no longer 'a hiss and a byword'. I am witnessing firsthand that it is the direction that many companies are taking to improve the efficiency of their operations.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-2307310716798834837?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/2307310716798834837/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=2307310716798834837' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/2307310716798834837'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/2307310716798834837'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/02/data-mining-in-workplace.html' title='Data Mining in the Workplace'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-8988117803747794477</id><published>2008-02-20T06:45:00.000-08:00</published><updated>2008-02-20T07:06:06.497-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining Lab'/><category scheme='http://www.blogger.com/atom/ns#' term='Utah CEO Magazine'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><title type='text'>Our Lab in Utah CEO Magazine</title><content type='html'>The BYU Data Mining Lab is featured in an article published in this months &lt;a href="http://www.utahceomagazine.com/index.php"&gt;Utah CEO Magazine&lt;/a&gt;. The article, found &lt;a href="http://www.utahceomagazine.com/article.php?id=80"&gt;here,&lt;/a&gt; includes expert opinions from our own Professor Christophe Giraud-Carrier on why finding a champion for data mining within a company is important and how successful data mining is defined. In addition, the article contains a short feature on the lab which explains the benefits students and businesses gain from being involved with the lab. It is exciting to see outside recognition for the great work that goes on here everyday.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-8988117803747794477?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/8988117803747794477/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=8988117803747794477' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/8988117803747794477'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/8988117803747794477'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/02/our-lab-in-utah-ceo-magazine.html' title='Our Lab in Utah CEO Magazine'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-2390137646146740316</id><published>2008-02-13T10:45:00.000-08:00</published><updated>2008-02-16T01:35:09.981-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='social capital'/><title type='text'>Social Connections in Decline</title><content type='html'>Robert Putnam, an influential social capital researcher, visited BYU nearly two years ago to discuss how social connections are on the decline.  Here is good summary of &lt;a href="http://nn.byu.edu/story.cfm/61082"&gt;Putnam's talk&lt;/a&gt; on BYU NewsNet.  His research during the past decade has shown a negative trend in that people are socially connecting less these days.  The speech gave fuel to the research on social networks that we had been involved in and has been a strong motivation to our current work on social capital.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.americanreview.us/putnam5.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px;" src="http://www.americanreview.us/putnam5.gif" alt="" border="0" /&gt;&lt;/a&gt;&lt;strong&gt;&lt;strong&gt;&lt;span style="font-size:85%;"&gt;Figure 1. "The TV Connection" shows that group membership tends to decline as television viewing increases among those having twelve or more years of education. &lt;/span&gt;&lt;/strong&gt;&lt;/strong&gt;&lt;span style="font-size:85%;"&gt;(see &lt;a href="http://www.americanreview.us/putnmtv4.htm"&gt;The Strange Disappearance of Civic America&lt;/a&gt;)&lt;/span&gt;&lt;strong&gt;&lt;strong&gt;&lt;br /&gt;&lt;br /&gt;&lt;/strong&gt;&lt;/strong&gt;Empirical studies on group membership, like the study shown in the plot above contribute to the evidence which Putnam uses to support this claim.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;(Note:  This article was originally posted on &lt;a href="http://dmine.blogspot.com/"&gt;dmine.blogspot.com&lt;/a&gt;)&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-2390137646146740316?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/2390137646146740316/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=2390137646146740316' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/2390137646146740316'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/2390137646146740316'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/02/social-connections-in-decline.html' title='Social Connections in Decline'/><author><name>Matt Smith</name><uri>http://www.blogger.com/profile/00366225861010849516</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://bp0.blogger.com/_I3Dp_Ci1Ie0/SArVcllGRfI/AAAAAAAAC08/SpDdj7CISus/S220/matt_picture2.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-544620849200034373</id><published>2008-02-13T07:27:00.000-08:00</published><updated>2008-02-13T07:38:19.324-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining Search Engine'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining Research'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining Tools'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><title type='text'>Data Mining Search Engine</title><content type='html'>I recently learned at the &lt;a href="http://dataminingresearch.blogspot.com/"&gt; Data Mining Research blog&lt;/a&gt; about a data mining search engine&lt;a href="http://dataminingresearch.blogspot.com/"&gt;&lt;/a&gt;. The search engine, which can be found &lt;a href="http://www.google.com/coop/cse?cx=002173145610235857072%3A1agmeqbmpke"&gt;here&lt;/a&gt;, allows search queries to be performed so that the results come largely from a list of data mining sites. It might prove to be a useful tool for focusing your research on trusted data mining sites, or for discovering new resources in our field of interest. Give it a shot, I don't have much experience with custom Google Search engines, but it seems useful.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-544620849200034373?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/544620849200034373/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=544620849200034373' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/544620849200034373'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/544620849200034373'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/02/data-mining-search-engine.html' title='Data Mining Search Engine'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-5273583549011230617</id><published>2008-02-11T06:35:00.000-08:00</published><updated>2008-02-11T07:19:16.815-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Web Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='RSS Feed'/><category scheme='http://www.blogger.com/atom/ns#' term='Blog Entity Resolution'/><category scheme='http://www.blogger.com/atom/ns#' term='Social Network Analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='ATOM Feed'/><title type='text'>Resolving Blog Entities</title><content type='html'>&lt;span style="font-weight: bold;"&gt;Problem: &lt;/span&gt;How do you determine whether a particular url is associated with a feed? For example, if another blog posted a link to datamining.blogspot.com, how would you determine the feed (http://datamininglab.blogspot.com/feeds/posts/default) associated with that url?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Solution:&lt;/span&gt; In our research we perform two operations to determine whether a url has an associated feed. First, we determine whether the url represents an actual feed. This can usually be determined by submitting an http request and checking the content-type header included in the response. If the content-type is "application/rss+xml", "application/atom+xml","application/rdf+xml" or "text/xml" then you are probably dealing with a feed.&lt;br /&gt;Second, you need to check to see if the url is not a feed, but is associated with a feed. This would be the case in situations where a url was to the front page or a specific entry of a blog. If the content-type in the http response, as describe in step one, was not a feed, then you would parse the "link" tags found between the  "head" tags. If a "link" tag has a "rel=alternate" attribute then you can check the type attribute to see if it has a value equal to "application/rss+xml" or "application/atom+xml" similar to what we did in step one. If it does, then you can parse the value of the href attribute to retrieve the feed url associated with the url. For example, on the main page of our blog, if you look at the page source, you will see link tags to both the rss and atom feeds associated with our blog.&lt;br /&gt;There are certainly other ways for resolving blog entities, but this seems to work fairly consistently. Feel free to chime in if you have any ideas on how to better accomplish this task.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-5273583549011230617?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/5273583549011230617/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=5273583549011230617' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/5273583549011230617'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/5273583549011230617'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/02/resolving-blog-entities.html' title='Resolving Blog Entities'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-6328281172500410946</id><published>2008-02-06T06:34:00.000-08:00</published><updated>2008-02-06T07:41:43.705-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining Lab'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='Research Opportunities'/><title type='text'>Lab Spotlight</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_wzHaXkbDbnQ/R6nTi-MEPZI/AAAAAAAAAcQ/UmzUU_u_2wk/s1600-h/Picture+004.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://bp2.blogger.com/_wzHaXkbDbnQ/R6nTi-MEPZI/AAAAAAAAAcQ/UmzUU_u_2wk/s320/Picture+004.jpg" alt="" id="BLOGGER_PHOTO_ID_5163891045519605138" border="0" /&gt;&lt;/a&gt;At first glance, the Data Mining Lab looks like your average computer science research lab. But conducting research in the data mining lab is not your average research experience. I'll explain what I mean by citing three areas that help make our lab experience great.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The People&lt;/span&gt; - French, Spanish, Portuguese, Korean, Hmong, Tagalog, Chinese. All languages spoken by members of the data mining lab. It is one example of the diverse capabilities belonging to members of the lab. We love computer science. We love data mining. We love python, open source and neural nets (if its possible to love a neural net). But we also appreciate politics, religion, cooking and anything else that is meaningful...or at least interesting. I remember heated discussions last semester about state funded private school vouchers, the political primaries and caucuses and the grammatically correct way to use the word "good." Being in the lab everyday has enriched my views on the world and made me a more rounded person.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Activities&lt;/span&gt; - All work and no play? Makes you want to run away. Which is why we appreciate the many activities of the data mining lab. Each week we have a lab meeting/potluck where we discuss our progress. Curry, kimchi or burritos would make any meeting more exciting. Occasionally we will attend the campus devotionals/forums or the department colloquiums together as a lab. For example, last week our lab went and listened to Paul Rusesabagina tell his inspirational story, which was portrayed in the movie Hotel Rwanda. In addition to these weekly activities, each semester we meet together at our advisor Christophe's home for a lab social. The food is always fabulous and we even get to bring along our families which helps us bond even more. All of these activities help to make our lab experience unique.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Research&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;Environment&lt;/span&gt;- When it comes down to it, the reason we are all here is because we enjoy researching data mining, and the data mining lab is the perfect place to do it. We are given flexibility to research what is most interesting to us and are given the tools to be successful. This is mostly due to our adviser Christophe, who is flexible and supportive of our aspirations, while also helping us to investigate the feasibility and usefulness of the research topics we are considering. When we run into problems in our research, there is almost always someone in the lab with a helpful idea or suggestion. You may begin discussing a question with one lab member, but it usually isn't long before the whole lab is involved. There are also plenty of opportunities to publish papers, present at conferences and work with real companies. I can't think of a better environment for conducting research than we have here.&lt;br /&gt;&lt;br /&gt;These are just a few of the reasons why it is awesome to be a member of the data mining lab. This is the place to be.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-6328281172500410946?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/6328281172500410946/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=6328281172500410946' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/6328281172500410946'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/6328281172500410946'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/02/lab-spotlight.html' title='Lab Spotlight'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp2.blogger.com/_wzHaXkbDbnQ/R6nTi-MEPZI/AAAAAAAAAcQ/UmzUU_u_2wk/s72-c/Picture+004.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-7143891851928227717</id><published>2008-02-05T14:43:00.000-08:00</published><updated>2008-02-07T14:23:35.204-08:00</updated><title type='text'>Meta-learning</title><content type='html'>I just finished reading Rice's seminal paper on algorithm selection [Rice, J.R. (1976). The Algorithm Selection Problem. &lt;span class="Apple-style-span" style="font-style: italic;"&gt;Advances in Computers&lt;/span&gt;, 15:65-118]. For obvious reasons, it does not talk about meta-learning (look at the date!) but meta-learning is clearly one natural approach to solving the algorithm selection problem.&lt;div&gt;&lt;a href="http://www.deakin.edu.au/%7Ekatesm"&gt;Kate Smith-Miles&lt;/a&gt; recently wrote a very nice survey paper (to appear in &lt;span class="Apple-style-span" style="font-style: italic;"&gt;ACM Computing Surveys&lt;/span&gt;) where she uses Rice's framework to review and describe most known attempts at algorithm selection.&lt;div&gt;Rice does indeed offer a very clean formalism for the problem of algorithm selection, where a problem &lt;span class="Apple-style-span" style="font-style: italic;"&gt;X&lt;/span&gt; from some problem space &lt;span class="Apple-style-span" style="font-style: italic;"&gt;P&lt;/span&gt; is mapped, via some feature extraction process, to a representation &lt;span class="Apple-style-span" style="font-style: italic;"&gt;f(X)&lt;/span&gt; in some feature space &lt;span class="Apple-style-span" style="font-style: italic;"&gt;F&lt;/span&gt;, and the selection algorithm &lt;span class="Apple-style-span" style="font-style: italic;"&gt;S&lt;/span&gt; maps &lt;span class="Apple-style-span" style="font-style: italic;"&gt;f(X)&lt;/span&gt; to some algorithm Y in some algorithm space &lt;span class="Apple-style-span" style="font-style: italic;"&gt;A&lt;/span&gt;, so that the performance of &lt;span class="Apple-style-span" style="font-style: italic;"&gt;Y&lt;/span&gt; on &lt;span class="Apple-style-span" style="font-style: italic;"&gt;X&lt;/span&gt; (for some adequately chosen performance measure) is in some sense optimal. Hence, as pointed out, "the selection mapping now depends only on the features &lt;span class="Apple-style-span" style="font-style: italic;"&gt;f(X)&lt;/span&gt;, yet the performance mapping still depends  on the problem &lt;span class="Apple-style-span" style="font-style: italic;"&gt;X&lt;/span&gt;" and, of course, "the determination of the best (or even good) features is one o the most important, yet nebulous, aspects of the algorithm selection process."&lt;/div&gt;&lt;div&gt;Rice is also quick to point out that "ideally, those problems with the same features would have the same performance for any algorithm being considered." I actually also pointed that out in my recent paper[Giraud-Carrier, C. (2005). The Data Mining Advisor: Meta-learning at the Service of Practitioners. In &lt;span class="Apple-style-span" style="font-style: italic;"&gt;Proceedings of the 4th International Conference on Machine Learning Applications&lt;/span&gt;, 113-119] where I stated that unless for all &lt;span class="Apple-style-span" style="font-style: italic;"&gt;X&lt;/span&gt; and &lt;span class="Apple-style-span" style="font-style: italic;"&gt;X'&lt;/span&gt; (&lt;span class="Apple-style-span" style="font-style: italic;"&gt;X&lt;/span&gt; &lt;&gt; &lt;span class="Apple-style-span" style="font-style: italic;"&gt;X'&lt;/span&gt;), &lt;span class="Apple-style-span" style="font-style: italic;"&gt;f(X)=f(X')&lt;/span&gt; implies &lt;span class="Apple-style-span" style="font-style: italic;"&gt;p(X)=p(X') &lt;/span&gt;(where &lt;span class="Apple-style-span" style="font-style: italic;"&gt;p&lt;/span&gt; is the performance measure) then the meta-training set may be noisy and meta-learning may in turn be sub-optimal.&lt;/div&gt;&lt;div&gt;Rice's framework naturally covers various forms of selection (e.g., best algorithm, best algorithm for a subclass of problems, etc.) as well as multi-criteria performance measures.&lt;/div&gt;&lt;div&gt;Another important point brought out by Rice, and often overlooked in the machine learning community, is that "most algorithms are developed for a particular class of problems even though the class is never explicitly defined. Thus the performance of algorithms is unlikely to be understood without some idea of the problem class associated with their development. &lt;a href="http://pages.stern.nyu.edu/%7Efprovost/"&gt;Foster Provost&lt;/a&gt; and I called that the Strong Assumption of Machine Learning in our &lt;a href="http://dml.cs.byu.edu/%7Ecgc/pubs/ICML2005WS.pdf"&gt;paper on the justification of meta-learning&lt;/a&gt; [Giraud-Carrier, C. and Provost, F. (2005). Towards a Justification of Meta-learning: Is the No Free Lunch Theorem a Show-stopper. In &lt;span class="Apple-style-span" style="font-style: italic;"&gt;Proceedings of the ICML-05 Workshop on Meta-learning&lt;/span&gt;, 12-19]. I (and others) have often argued that the notion of delimiting the class of problems on which an algorithm performs well is critical to advances in machine learning.&lt;/div&gt;&lt;div&gt;Anyways, although Rice offers no specific method to solve the algorithm selection problem, the paper is highly relevant and very well-written. A must read for anyone interested in meta-learning.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-7143891851928227717?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/7143891851928227717/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=7143891851928227717' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/7143891851928227717'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/7143891851928227717'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/02/i-just-finished-reading-rices-seminal.html' title='Meta-learning'/><author><name>Christophe Giraud-Carrier</name><uri>http://www.blogger.com/profile/17672899844586725651</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-3515868951897424101</id><published>2008-02-04T15:20:00.000-08:00</published><updated>2008-02-04T15:26:09.953-08:00</updated><title type='text'>Social Capital Simulation</title><content type='html'>Our recent work has explored the concept of social capital, which I have &lt;a href="http://dmine.blogspot.com/search/label/social%20capital"&gt;discussed previously&lt;/a&gt;. Our social capital metrics, namely bonding and bridging (popularized by &lt;a href="http://en.wikipedia.org/wiki/Robert_Putnam"&gt;Robert Putnam&lt;/a&gt;), utilize the &lt;a href="http://dml.cs.byu.edu/%7Esmitty/publications/IAN-SIP08.pdf"&gt;hybrid network methodology&lt;/a&gt; that we have developed for online communities.&lt;br /&gt;&lt;br /&gt;To understand our metrics, I have created a basic &lt;a style="font-weight: bold;" href="http://dml.cs.byu.edu/matthewsmith/docs/social_capital_simulation.xls"&gt;social capital simulation&lt;/a&gt; (an excel spreadsheet) having five nodes. The simulation allows for you to change the connection strengths in both the implicit affinity network (IAN) or explicit social network (ESN). Changing these values will give you an idea of how social capital fluctuates as the social network changes.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_I3Dp_Ci1Ie0/R6eXqjzZ8OI/AAAAAAAACrg/0IzwEu7V-Gg/s1600-h/sample.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp0.blogger.com/_I3Dp_Ci1Ie0/R6eXqjzZ8OI/AAAAAAAACrg/0IzwEu7V-Gg/s400/sample.png" alt="" id="BLOGGER_PHOTO_ID_5163262255224713442" border="0" /&gt;&lt;/a&gt;The figure above shows the initial configuration of the simulation.  The dashed &lt;span style="color: rgb(0, 0, 153);"&gt;blue&lt;/span&gt; lines represent the IAN and the solid &lt;span style="color: rgb(255, 102, 102);"&gt;pink&lt;/span&gt; lines represent the ESN. The thicker the lines the stronger the connection. The weights for the IAN were randomly assigned, while the ESN weights were all set to one, thus creating a clique.&lt;br /&gt;&lt;br /&gt;Initially, the bonding and bridging social capital are both 1, since everyone in the network is connected. To see how the social capital fluctuates, change the &lt;span style="color: rgb(0, 0, 153);"&gt;blue&lt;/span&gt; and/or &lt;span style="color: rgb(255, 102, 102);"&gt;pink&lt;/span&gt; values, again representing the IAN and the ESN weights respectively, in the spreadsheet.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;(Note:  This article was originally posted on &lt;a href="http://dmine.blogspot.com/"&gt;dmine.blogspot.com&lt;/a&gt;)&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-3515868951897424101?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/3515868951897424101/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=3515868951897424101' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/3515868951897424101'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/3515868951897424101'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/02/social-capital-simulation.html' title='Social Capital Simulation'/><author><name>Matt Smith</name><uri>http://www.blogger.com/profile/00366225861010849516</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://bp0.blogger.com/_I3Dp_Ci1Ie0/SArVcllGRfI/AAAAAAAAC08/SpDdj7CISus/S220/matt_picture2.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp0.blogger.com/_I3Dp_Ci1Ie0/R6eXqjzZ8OI/AAAAAAAACrg/0IzwEu7V-Gg/s72-c/sample.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-4899827969433141873</id><published>2008-02-04T06:15:00.000-08:00</published><updated>2008-02-04T09:45:01.985-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Blog Content'/><category scheme='http://www.blogger.com/atom/ns#' term='Google Reader API'/><category scheme='http://www.blogger.com/atom/ns#' term='pyrfeed'/><category scheme='http://www.blogger.com/atom/ns#' term='Problems and Solutions'/><category scheme='http://www.blogger.com/atom/ns#' term='Social Network Analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='Google Reader'/><title type='text'>Google Reader API</title><content type='html'>&lt;span style="font-weight: bold;"&gt;Problem&lt;/span&gt;: To perform social network analysis on blog data you need consistent data over a period of time. Periodically retrieving the content directly from the blog's feed has its limitations because you can only retrieve current blog content. Thus if you decide to begin retrieving content from a specific blog, you have no way at getting at the archived blog content.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Solution&lt;/span&gt;: Use the unofficial Google Reader API to retrieved archived feed content. The API was first documented two years ago at &lt;a href="http://www.niallkennedy.com/blog/2005/12/google-reader-api.html"&gt;Nial Kennedy's blog&lt;/a&gt; and its reality was confirmed by several Google employees associated with the project. Little information has been published since as to an official release of the API, but the unofficial API still works great for retrieving archived feed content.&lt;br /&gt;&lt;br /&gt;In our research the framework we use for interacting with the API is &lt;a style="font-style: italic;" href="http://code.google.com/p/pyrfeed/"&gt;pyrfeed&lt;/a&gt;. The creators or &lt;span style="font-style: italic;"&gt;pyrfeed&lt;/span&gt; also did some &lt;a href="http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI"&gt;additional documentation on the capabilities of the API&lt;/a&gt;. The Google Code site has &lt;a href="http://code.google.com/p/pyrfeed/downloads/list"&gt;two downloadable files&lt;/a&gt;. The Google Reader stand alone is a simple interface for interacting with the API to perform simple actions such as feed retrieval. The other file, which is the full &lt;span style="font-style: italic;"&gt;pyrfeed&lt;/span&gt; release, also provides gui and command line interfaces for interacting with the API and automated blog content storage in a mysqlite3 database. An example how to interact with the Google Reader stand alone package can be seen below.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_wzHaXkbDbnQ/R6cnYeMEPXI/AAAAAAAAAcA/JUojt40MkPY/s1600-h/pyrfeedexample.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp2.blogger.com/_wzHaXkbDbnQ/R6cnYeMEPXI/AAAAAAAAAcA/JUojt40MkPY/s400/pyrfeedexample.jpg" alt="" id="BLOGGER_PHOTO_ID_5163138799177579890" border="0" /&gt;&lt;/a&gt;In summary, if you are looking for a simple way to retrieve archived blog content, the Google Reader API and &lt;span style="font-style: italic;"&gt;pyrfeed&lt;/span&gt; framework are cheap and easy tools for doing so. The blogosphere is at your fingertips.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-4899827969433141873?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/4899827969433141873/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=4899827969433141873' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/4899827969433141873'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/4899827969433141873'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/02/google-reader-api.html' title='Google Reader API'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp2.blogger.com/_wzHaXkbDbnQ/R6cnYeMEPXI/AAAAAAAAAcA/JUojt40MkPY/s72-c/pyrfeedexample.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-5681158107124167734</id><published>2008-01-30T06:44:00.000-08:00</published><updated>2008-01-30T07:18:14.759-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Social Network Analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='Robert Putnam'/><category scheme='http://www.blogger.com/atom/ns#' term='social capital'/><category scheme='http://www.blogger.com/atom/ns#' term='Blogosphere'/><title type='text'>Social Capital?</title><content type='html'>My last post was titled &lt;a href="http://datamininglab.blogspot.com/2008/01/social-capital-in-blogosphere.html"&gt;Social Capital in the Blogosphere&lt;/a&gt; and dealt with the experiments we are conducting into the social capital found in blog networks. When you saw the title, some of you probably wondered social capital...what's that? I do not claim to be an expert on social capital, but I have a fair idea of what it is and how it is useful. Interpretations vary, but our idea of social capital has been motivated by that of Robert Putnam, author of &lt;a href="http://www.bowlingalone.com/"&gt;Bowling Alone&lt;/a&gt; who came and &lt;a href="http://nn.byu.edu/story.cfm/61082"&gt;spoke here at BYU last year. &lt;/a&gt;&lt;br /&gt;&lt;br /&gt;In many realms who you know may matter just as what you know. The value represented by these connections in a social network is known as the social capital of that network. In our work, we compute the social capital of a blog network by using a mathematical formula that takes into account both the actual and potential bonding (connectons with similar people) and bridging (connections with dissimilar people) of blog networks. A more detailed description of this formula can be found &lt;a href="http://dml.cs.byu.edu/%7Esmitty/publications/IAN-SIP08.pdf"&gt;in this paper&lt;/a&gt;. Matt also recently posted about other ways of measuring social capital. His findings can be found &lt;a href="http://dmine.blogspot.com/2008/01/social-capital-measurement.html"&gt;here&lt;/a&gt; and &lt;a href="http://dmine.blogspot.com/2008/01/measuring-social-capital-weekly-update.html"&gt;here&lt;/a&gt;. The social capital of a network can then be used to determine how much value furthering connections in that social network would have. In our example, it would tell you whether or not you should attempt to establish a place in a certain blog network. Thanks for your comments, hopefully this gives you a good intro to social capital in the context we are using it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-5681158107124167734?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/5681158107124167734/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=5681158107124167734' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/5681158107124167734'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/5681158107124167734'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/01/social-capital.html' title='Social Capital?'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-4174666411627124097</id><published>2008-01-28T06:41:00.000-08:00</published><updated>2008-01-29T13:56:47.543-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Implicit Connections'/><category scheme='http://www.blogger.com/atom/ns#' term='Blogs'/><category scheme='http://www.blogger.com/atom/ns#' term='Implicit Affinity Networks'/><category scheme='http://www.blogger.com/atom/ns#' term='Explicit Connections'/><category scheme='http://www.blogger.com/atom/ns#' term='Social Network Analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='social capital'/><category scheme='http://www.blogger.com/atom/ns#' term='Blogosphere'/><title type='text'>Social Capital in the Blogosphere</title><content type='html'>For the past year, Matt and I (Nate) have been conducting research into the social capital that can be found online in blog networks. Why should you care? Well, first off, you're reading a blog so you must have some interest in the overall blogging community. But more importantly, blogs are being used to establish the identity of people, places and products. In today's online age, the potential of blogs is tremendous.  Here's a quick summary of our research.&lt;br /&gt;&lt;br /&gt;What: Analysis has been done on the explicit connections (links, comments, friend lists) between blogs. Little work has been done regarding the implicit connections (interests, hobbies, location) that exist between blog authors. We are conducting research into methods of using both explicit and implicit connections in social network analysis.&lt;br /&gt;&lt;br /&gt;How: We have retrieved a large archive of blog content for use in our research. An explicit social network of the content is created from the hyper links found in the blog content. Using topic extraction methods such as &lt;a href="http://www.cs.princeton.edu/%7Eblei/papers/BleiNgJordan2003.pdf"&gt;Latent Dirichlet Allocation,&lt;/a&gt; a network of implicit connections is constructed. Overlaying the implicit network on the explicit network allows for potential and actual connections or social capital to be identified. A example graphical representation of one of these networks, which we created using &lt;a href="http://www.cytoscape.org/"&gt;Cytoscape&lt;/a&gt;, is found below.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_wzHaXkbDbnQ/R531F-MEPQI/AAAAAAAAAaw/V23uyBjeA6o/s1600-h/network.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp2.blogger.com/_wzHaXkbDbnQ/R531F-MEPQI/AAAAAAAAAaw/V23uyBjeA6o/s400/network.jpg" alt="" id="BLOGGER_PHOTO_ID_5160550230978215170" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Why: Information about actual social communities, and the implicit similarities that exist between them, can be used to recommend potentially valuable actions that could be taken. For example, a politician could contact influential blogs and attempt to convince them to lead a grassroots campaign for his candidacy. A doctor could use social network analysis to identify and coordinate with colleagues in order to help patients with rare diseases. Companies could approach blogs that are found in the center of their customer market about participating in usability testing or marketing campaigns. Conducting social network analysis on blogging communities has valuable potential in many domains.&lt;br /&gt;&lt;br /&gt;You can learn more of the details about our research &lt;a href="http://dml.cs.byu.edu/wiki/index.php/Social_Capital_in_Online_Communities"&gt;here&lt;/a&gt; at our lab wiki.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-4174666411627124097?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/4174666411627124097/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=4174666411627124097' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/4174666411627124097'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/4174666411627124097'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/01/social-capital-in-blogosphere.html' title='Social Capital in the Blogosphere'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp2.blogger.com/_wzHaXkbDbnQ/R531F-MEPQI/AAAAAAAAAaw/V23uyBjeA6o/s72-c/network.jpg' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5365131201156021161.post-8065607826734439882</id><published>2008-01-23T06:58:00.000-08:00</published><updated>2008-01-29T13:56:17.769-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Blogs'/><category scheme='http://www.blogger.com/atom/ns#' term='Meta Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Record Linkage'/><category scheme='http://www.blogger.com/atom/ns#' term='Family History'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='BYU'/><category scheme='http://www.blogger.com/atom/ns#' term='Machine Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Artificial Intelligence'/><category scheme='http://www.blogger.com/atom/ns#' term='Blogosphere'/><category scheme='http://www.blogger.com/atom/ns#' term='Computer Science'/><category scheme='http://www.blogger.com/atom/ns#' term='Genealogy'/><category scheme='http://www.blogger.com/atom/ns#' term='Knowledge Discovery'/><category scheme='http://www.blogger.com/atom/ns#' term='Transfer Learning'/><title type='text'>What is the Data Mining Lab?</title><content type='html'>The &lt;a href="http://dml.cs.byu.edu/"&gt;Data Mining Lab&lt;/a&gt; is a research lab hosted by the &lt;a href="http://cs.byu.edu/"&gt;Computer Science Department&lt;/a&gt; at &lt;a href="http://www.byu.edu/"&gt;Brigham Young University&lt;/a&gt;. We research methods for extracting valuable knowledge from data. &lt;a href="http://en.wikipedia.org/wiki/Data_mining"&gt;Data mining&lt;/a&gt; can be applied to a wide range of business and scientific problems. Almost everyone gathers  data, and we go about finding ways to make that data useful.&lt;br /&gt;&lt;br /&gt;Current areas of research include &lt;ul&gt;&lt;li&gt;&lt;a href="http://dml.cs.byu.edu/wiki/index.php/Social_Capital_in_Online_Communities"&gt;Social Capital in Online Communities&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://dml.cs.byu.edu/wiki/index.php/Genealogical_Record_Linkage"&gt;Genealogical Record Linkage&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Transfer_learning"&gt;Transfer Learning&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://dml.cs.byu.edu/wiki/index.php/Meta-learning"&gt;Meta Learning&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;The purpose of this blog is to establish connections and further our collaboration with others who share our same interests. We will be publishing information that we have gained from our research, and invite others to share their insights here as well. Feel free to contact us by posting comments or by email. More contact information can be found at our lab website located at &lt;a href="http://dml.cs.byu.edu/"&gt;http://dml.cs.byu.edu&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5365131201156021161-8065607826734439882?l=datamininglab.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://datamininglab.blogspot.com/feeds/8065607826734439882/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5365131201156021161&amp;postID=8065607826734439882' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/8065607826734439882'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5365131201156021161/posts/default/8065607826734439882'/><link rel='alternate' type='text/html' href='http://datamininglab.blogspot.com/2008/01/what-is-data-mining-lab.html' title='What is the Data Mining Lab?'/><author><name>The Boy</name><uri>http://www.blogger.com/profile/02165028123565951287</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry></feed>
