Dave, I just read about Google having some sort of sitemap feature that lets you ensure that their spider sees all of your pages. I kinda wonder whether this is a good thing or not. What have you been hearing about it?
I admit, I haven’t had much time to dig into exactly what it is and what it means, but I am lucky that my colleague Michael Motherwell, head of the Austrailian WMS Consulting, shared his own thoughts on this new Google feature. Here’s what he had to say:
The goal of SiteMaps is simple: save costs, improve their product, generate buzz.
As background, the single hardest part of a search engine is the crawl scheduler. Try to think of a programme that permanently runs, has to schedule 6 billion documents for crawling, with some crawled every five to ten minutes, some once a month and many inbetween, with multiple rules for NOT crawling pages and a list to crawl that grows every second haphazardly, all utilising thousands of servers and, well, my head spins.
The key to better crawling has always been finding all the URLs to crawl faster and more accurately. That is why a site map is recommended. A list of all URLs to crawl has all the URLs in the scheduler sooner. Problem there is that now we have 8 million page sites (Amazon and eBay) and millions of sites.
As an example of the problems this causes, if a site has 100,000 pages, how many pages need to be crawled before a search engine knows about them all? The bottom 10,000 pages of such a site will probably be linked to once from a cateory page itself several links deep. That means Google need to crawl many levels of a site before finding all the pages, and that means to find the web’s freshest and newest content, like, say, a new printer, Google need to crawl deep and often. That is a heck of a lot more pages to crawl to find all URLs on a site than the 1 XML file SiteMaps offers, that is for sure.
While the old way was bandwidth intensive and, therefore, costly to both sides, now when Google arrives at a site, it has a nice neat list of all pages and, if people are smart, all NEW pages, and starts with the new and works down. SiteMaps has the potential to reduce costs for Google and site owners without sacrificing quality or freshness.
How good is that? What this means for site owners is that rather than a top down crawl, starting at the Home Page, we might see bottom then top then middle, e.g. specific products, home page, categories. No more “I have 10,000 pages and only 50 are indexed”. No more Robots all over a site constantly. Now sites will have closer to 100% of pages indexed, and new pages indexed sooner, with a fresher refresh date for changing content. Assumming, of course, this all works out…
It also means that rather than Google crawling daily / weekly / monthly 100,000 pages of large sites, they can crawl less often with more certainty. That, in my book, equals less bandwidth expense per page for Google, less bandwidth expense overall for both sides (the W3C standards page that never changes will get crawled once ina blue moon, as they can set their recrawl to “Never”) and a fresher, more complete index.
if sites get this right (and they have every reason to), they can now start to tell Google which pages have changed. Have you had a product recalled? Put it in your “new pagesd sitemaP” and it will get updated sooner. Product no longer sold? Ditto.
This last point makes for a better Google index, IMHO. Finding changed, deep content, and new, fresh content sooner is fantastic alround! I don’t know about anyone else, but if I were Google, being better is something I would want, not least of which because the PR potential is huge and Gates is crying about Google number of tricks and taking them out of the picture.
As an aside, many moons ago, GoogleGuy made a comment @ Webmasterworld joking that “I put a page up 7 seconds ago and Google hasn’t found it – What is going on?” At the time, I thought “funny, how can Google ever get to that stage? To do so would need a recrwal rate of every page every five seconds, or Webmaster help”. Well, here is that help! Google should know about a site’s newest pages ASAP, and webmasters can get timely content in faster and easier. Win (Google) – Win (website owner) – win (searchers).
So, my $0.02 (for what that is worth): no consipiracies, no ulterior moives, just good old fashioned business logic. “Reduced costs == increased profits, a better product == market leadership and better PR, innovation == keeping Microsoft in third place”. Personally, I wish *I* had an idea for my business that reduced my costs, improved my products AND generated solid PR and plenty of press. I can’t imagine anyone would nee any more motivation for an idea than that.
So, rather than look for the cloud behind this silver lining, I reckon we give it a collective go, and see what happens. A world in which webmasters and crawlers are partners rather than enemies is good for all. Personally, the only issue I have is that, in true US tradition (sorry, had to chuck that in) this is a uni-lateral exercise. I just wish, as with the nofollow link attribute, that Google had consulted the W3C and that this was a ratified standard that haold multi-vendor buy in.
If you want to learn more about Google Sitemaps, here’s the main What is Google Sitemaps page, and here’s a page that covers Frequently Asked Questions. Hope those help illuminate this interesting topic!