One customer of mine has been trying to get his content listed in Google News index (as distinct from Google’s main web index).
Google’s general guidelines for news publishers includes a collection of Technical Requirements for Article URLs.
One of those requirements strikes me as just odd:
Display a three-digit number. The URL for each article must contain a unique number consisting of at least three digits. For example, we can’t crawl an article with this URL: http://www.google.com/news/article23.html. We can, however, crawl an article with this URL: http://www.google.com/news/article234.html. Keep in mind that if the only number in the article consists of an isolated four-digit number that resembles a year, such as http://www.google.com/news/article2006.html, we won’t be able to crawl it. Please note, this rule is waived with News sitemaps.
What the heck is that is that for? As long as each article has its own unique URL, why must it have a unique number as part of the URL? It sounds like a pointless hoop through which news sites are simply forced to jump in order to demonstrate that they are “big” enough to jump through it. I would be grateful to anyone who can provide a definitive answer to what purpose is served by this requirement.
Anyway, rather than change the entire CMS, I opted to develop a Google News sitemap, a creature that uses a different format than the standard sitemap protocol. Google Webmaster Tools confirms that the the sitemap is being spidered and is error free.
Although Google representatives confirmed to me via email on several occasions that the 3-digit-url requirement is waived for sites that submit a news sitemap and the spider is still successfully pulling/parsing the news sitemap file, the customer has yet to see any content show up in Google News.
It actually occurred to me later that since the Google will probably only spider/add new content to the news index (as opposed to the standard index), we really could get away with simply changing the scheme for new URLs only. So, we bit the bullet and changed the CMS to add the digits.
It will be interesting to see if it makes a difference. If so, it undermines the Google claim that the url rule is waived for sites implementing a news sitemap.
2009-04-13: Update Content now appearing Google News. Apparently, the three-digit rule waiver is unreliable.