URL Canonicalization: Stop the Dupes!

Aggregation is the name of the game these days and a big problem for sites like RSSmeme and ReadBurner is dealing with duplicates.  How do you know for sure that you have all the shares for a given URL?  What about services like FeedBurner or TinyURL which use redirects to get you where you want to go?  Enter URL canonicalization.

Canonicalizing something means to find the “standard state” of it.  So when you canonicalize a URL you want to find the URL that you finally end up on.  If you don’t canonicalize URLs before aggregating you end up with duplicates; maybe some users shared through a blog’s old RSS feed while others are using FeedBurner.  When you have duplicates you can’t reliably get a count of how many times a story has been shared.  This skews your data and makes adding features like RSSmeme’s widget absolutely impossible.

How do you canonicalize?  Well the easiest way is to just do a HEAD request to any URL that looks fishy.  On RSSmeme if a URL starts with http://feeds. or http://rss. then I do a HEAD request to that URL, which will follow the redirect and find the canonical URL.  If you do this for every URL then you are going to have performance issues so just choose the usual suspects.

Sites like FriendFeed don’t need to do this.  But RSSmeme and ReadBurner live and die by the counter.  RSSmeme currently canonicalizes URLs; it doesn’t look like ReadBurner is right now but maybe this post will enlighten them and anyone considering entering this area.

Congratulations on the launch ReadBurner!

Viewing 8 Comments

Trackbacks

close Reblog this comment
blog comments powered by Disqus