I’m tired from trying to get my Digital Video Recorder (Canon Optura200MC) to actually make a video on my computer. If only it was as easy as they made it look when I bought it. It’s certainly not as easy as setting up robots.txt. Or is it?
As a consultant I charge by the hour. I charge by the hour because I know that nothing, no matter how simple it might seem to be, takes a mere half hour. Not a DVR hookup, not an SFTP connection, not a blog post, and not any of the SEO work I usually have to do for clients. When describing the steps of making a digital video, the Canon Optura manuals and website repeatedly state that things may not work properly due to the configuration of your software and computer. I know that is just another way to say “nothing takes a half an hour”.
Anyway robots.txt is a classic example. I can honestly say that I encounter an improperly configured robotx.txt more often than I encounter a missing robots.txt. That doesn’t reflect on the web as a whole (where I suspect the majority of websites still have no robots.txt file) but it is telling of the SEO consulting world. It is so easy to do it wrong, and the people doing it wrong are the ones looking for advanced webmaster assistance. Is it better to have none or have a bad one?
Robots.txt is not eas simple as it seems. Google inteprets robotx.txt differently than Yahoo! does. Which is right for you? That depends. What about MSN and Ask? Each is slightly different from my observation, although not as different as Google appears to be from Yahoo!. Do we exclude SSL enabled portions of a website, and if so, how? Well, that depends on your goals, because excluding SSL enabled sections of a domain via robots.txt may not actually keep your content from getting 1. spidered or 2. indexed or 3. displayed in the search results that search engines return to users.
And that is where the meta tag myths come into play. Most recently I read Danny Sullivan comment that robots.txt and the INDEX and FOLLOW meta tag directives are expected by webmasters to be redundant, and so should function equivalently for the search engines. Danny is a consummate webmaster investigative reporter, and very good at pursuing research from the webmaster perspective. But in my opinion he is not driven by the practical reality of how things work as much as the desire to understand and clarify how things work. I think that is where he differs from an SEO. As an SEO, I know that the word “index” has different meanings to Yahoo! than Google. And I also expect that each of the search engines can change their working defintion of “to index” at any time. Will Danny actually influence them to change it, by asking the good questions and putting all of the pieces together, making a case for some standardization? Yes, maybe.. he can do that. In the mean time, though, we need to be able to work with it *and* be ready to adjust should those changes be made. It seems to me robotx.txt and the follow,archive,index meta tag values are not redundant but complimentary.
Yahoo seems to consider indexing to be analyzing the content and adding it into the massive framework of overlaping semantic sets and links that is TheIndex. So Yahoo! will respond to a noindex directive by not including the content in TheIndex. Will it read that content it? Sure. Will it show that content in the SERPs? Sure. The content simply won’t have any influence in the index. Not what you expected, right?
Well it’s not so bad. If you don’t want the content read by Yahoo!, exclude it via robots.txt. Yahoo! won’t access it. If you don’t want it shown in the SERPs, also assign a NOARCHIVE value to the robots meta tag. If you don’t want it analyzed (for links or semantics) assign NOINDEX. And if you don’t want a spider crawling your domain to include the links on the page as links from your domain, use NOFOLLOW. The only redundancy in there is possible a logical overlap between robots exclusion and the on page meta tag values. If the page is excluded, why would the page even get read? Well, because if Yahoo! ever does read your page due to some issue with robots.txt, the page will contain the desired directives. In my experience, there are often “blips” in the effectiveness of robots.txt, which, by the way, is a voluntary standard with which the search engines do not have to comply.
Google will usually comply with robots.txt but in my experience will use any reason available to ignore it. If even the slightest irregularity exists, robots.txt is considered invalid and ignored. In some cases there may be a gray area of interpretation (such as the use of back slashes or wildcard characters) and you should carefully follow Google’s intepretation of those issues. So if it finds a valid robots.txt, Google won’t access the file.
On the other hand, Google will list a page in the SERPs even if it has been excluded by robots.txt, if the page has been externally referenced “well enough”. So if you exclude it, but someone else links to it, your excluded page may show up anyway. Google says it’s because Google knows about the page and wants to tell people about it, even if you don’t. Fair enough, but what about on-page meta tags? Google will respect them if it encounters them. Again I would still put in the NOINDEX/NOARCHIVE/NOFOLLOW or whatever for those rare times that robots.txt is “malfunctioning”. Oh, and remember what I said about Google looking ery closey and ignoring robots if there is even a hint of irregularity in it? Ditto for the meta tags. But in the case of robots, the logic of wildcard characters and slashes had one general solution, which is what Google follows. With meta tags, there is a conflict. Googl ecan’t just pick a general solution to a gray area that may exist with meta tags, so it offers a unique GOOGLEBOT directive for you to use to explicity tell the Google spider how to behave. Whew.
Let’s summarize that. Robotx.txt must be perfect, and even then should not be relied upon completely. Meta tags to control FOLLOW, INDEX, and ARCHIVE should be used on all pages, even those that are excluded by robots. The definition of INDEX is different at different search engines, and should be considered carefully. There are also terms like listing and referencing involved. In cases where it is uncertain how Google will interpret a meta tag, use the GOOGLEBOT meta tag *in addition* to the ROBOTS meta tag, to tell the Google spider exactly what you want it to do according to Google’s explanation of it’s own interpretation of INDEX, FOLLOW, and ARCHIVE directives. Which may change at any time.
Note also that above I stated that Google may choose to show your page in the SERPs (via a listing), even though you asked via robots.txt that your page be excluded. That is important. Will Google someday also generate a snippet for your excluded page, based not on your page content but what other people say about it? I think someday they will. In fact, should Google need to ad some incentive to the mix, they could simply say that unless you let them read it, they have no choice but to use the (inferior?) snippets from those rumors about your page. Look at how they used ODP data instead of titles or descriptions. Better to tell Google, eh? Maybe better to sign up for Google sitemas and tell Google? They may do that.
Meta tags do matter, and can be very important. Robots.txt is not easy to get right. This post took more than a half hour, and I didn’t even make it instructional. Now I’ll see if I can make an instructional video for you… on my Canon Optura 200MC.