Monday, February 11, 2008

Resolving Blog Entities

Problem: How do you determine whether a particular url is associated with a feed? For example, if another blog posted a link to datamining.blogspot.com, how would you determine the feed (http://datamininglab.blogspot.com/feeds/posts/default) associated with that url?

Solution: In our research we perform two operations to determine whether a url has an associated feed. First, we determine whether the url represents an actual feed. This can usually be determined by submitting an http request and checking the content-type header included in the response. If the content-type is "application/rss+xml", "application/atom+xml","application/rdf+xml" or "text/xml" then you are probably dealing with a feed.
Second, you need to check to see if the url is not a feed, but is associated with a feed. This would be the case in situations where a url was to the front page or a specific entry of a blog. If the content-type in the http response, as describe in step one, was not a feed, then you would parse the "link" tags found between the "head" tags. If a "link" tag has a "rel=alternate" attribute then you can check the type attribute to see if it has a value equal to "application/rss+xml" or "application/atom+xml" similar to what we did in step one. If it does, then you can parse the value of the href attribute to retrieve the feed url associated with the url. For example, on the main page of our blog, if you look at the page source, you will see link tags to both the rss and atom feeds associated with our blog.
There are certainly other ways for resolving blog entities, but this seems to work fairly consistently. Feel free to chime in if you have any ideas on how to better accomplish this task.

No comments: