Google blocking certain User-Agents

If you're trying to write a tool that retrieves information from Google, you want to make sure to explicitly set the UserAgent string to something like what a regular browser would set.

For example, perl's LWP::UserAgent sets the agent string per default to "libwww-perl/#.#". Google apparently doesn't like that and will not return results to you. Setting it to something like "Mozilla/5.0 (X11; U; NetBSD i386; en-US; rv:" would work.

Interestingly, they appear to whitelist agents, rather than blacklisting them. That is, a random string as a user agent will not work. This means that they are effectively saying "we will only serve content to agents that we know (and approve) of", banning any client they simply might not know of.

One would assume that they did put this whitelist into place to prevent certain abuses, but then again, it would be trivial for anybody developing tools that lend themselves to abuse to simply reset the agent string to something they would accept, so the win seems negligible to me, while the cost of banning unknown agents seems high.

December 3, 2007

