I spend a lot of time on the minutia of web publishing. I also work for International clients. Since most of what I do is minutia for International clients, I recently referred to myself as an “International Minutia Dealer” and the poor lady next to me in the group acted shocked and said “I think we have enough war in the world, thank you very much”. Whatever.
So one of the minutia I deal with is the trailing slash. When dealing with frameworks, front controllers, and other-than-Apache web servers, it can be tough to get absolute control over trailing slash minutia. If you don’t know what I mean, consider this SEO quiz:
Q: How many web resources are represented by the following list of URLs, according to Google?
- www.example.com/index.html (or index.php or index.asp, your web server’s default)
- http://www.example.com/index (your web server default is still index.html)
See what I mean about minutia? That’s 18 versions so far, and I left out some biggies. So what’s the answer? How many of those are unique URLs according to Google? And if we had another search engine today, what would the answer be for that search engine?
How about putting this into a little context. What if you have a page at www.example.com/index.html and it is very, very popular. It has 2 million back links to that exact URL, and 2 million more to www.example.com, half with a trailing slash (www.example.com/) and half without (www.example.com). Your boss implements a new site, and your job is to migrate the site to the new server. Oh, and the new server uses a front controlling system such that all urls will look like www.example.com/index/. You have been reminded that “you were not hired for your web development or web design skillz, so stay out of that kitchen but because you know SEO make sure they don’t screw up and we don’t lose any rank, ok? It’s your only responsibility, so make sure it’s done right.” Nuff said.
That simple example of minutia drives an industry of highly-prized SEO consultants that work diligently while regular SEO “consultants” argue about whether SEO or PPC is the new sliced bread of marketing, or whether or not SEO is a Ron Popeil “set it and forget it” task.
Matt Cutts almost addressed this issue (from a “Google perspective”) last year on his blog. He was trying to clean up some of the mess around Google’s handling of redirects and Google’s use of the word “canonicalization”.
Those of us who suffered through courses on Linear Systems Control Theory in college know that a canonical form is an arrangement of a system such that you represent it with the least amount of parts (yet it is fully represented). Examples of canonicalization are everywhere, even if unlabeled. here’s another one:
The car has wheels and wheels have wheel covers. If you need to draw (represent) a car, you need to draw it with circles for wheels, because if you drew the car with no wheels people would not say it was a car, but would probably say “it looks like a car with no wheels”, or “it looks like a car that has no wheels”, etc. Draw the wheels and everyone will say “it’s a car”. Did you need to draw the wheel covers? No. The canonical form of that representation (in this very specific example) could be the car body plus circles for wheels. Google is staffed by a bunch o’ engineers who have probably all taken control systems theory or higher math and so when they wanted to label the idea of “how do we identify the web site resource without all the extra redundancy that might be present in default file names, extensions, meaningless subdomains like www, and trailing slashes“, they probably started simplifying with lingo like “what’s the canonical root“. Geek talk. Of course my example is a physical one for the non-Engineers. Systems theory is not about physical parts like cars and wheels but mathematic equations and representations, which can be mixed and blended as needed to come up with different forms (such as canonical forms). There are actually many kinds of canonical forms. Go figure.
By the way “canonical” is also sometimes defined as “according to the rules” (or canon), but since in this case there are no rules to follow, and the Google people were clearly trying to “figure this out” for a best way, I doubt that’s the source of the use.
Anyway Matt said this about the trailing slash:
Q: What is a canonical url? Do you have to use such a weird word, anyway?
A: Sorry that it’s a strange word; that’s what we call it around Google. Canonicalization is the process of picking the best url when there are several choices, and it usually refers to home pages. For example, most people would consider these the same urls:
But technically all of these urls are different. A web server could return completely different content for all the urls above. When Google “canonicalizes” a url, we try to pick the url that seems like the best representative from that set.
Q: So how do I make sure that Google picks the url that I want?
A: One thing that helps is to pick the url that you want and use that url consistently across your entire site. For example, don’t make half of your links go to http://example.com/ and the other half go to http://www.example.com/ . Instead, pick the url you prefer and always use that format for your internal links.
A commenter apparently is also a Minutia Dealer because he followed up with this good question:
Thanks Matt for the continued explanations and advice about this stuff. I have been reading up on Canonical issues for a while (suffering from one myself due to not knowing about them before hand and not using 301 protection), I have set up a 301 and on server name resolution so that all requests for the main index page go to www.theurl.com/ (the trailing slash is always added anyway).
Google still shows www.theurl.com/ and www.theurl.com/index.php in the serps and is docking my PR due to it. Will the 301 be picked up by the main googlebot and remove the index.php reference from the results in due course?
Also I can’t fathom out why this sort of thing isn’t under the webmaster’s control? If I know that the result www.theurl.com/index.php is WRONG then there should be a system to remove JUST that reference? Is this impossible?
As far as I know that follow up question remains unanswered, but that’s not surprising to me. Google has incorporated some automated “canonicalization” checkers which are pre-programmed to handle these minutia according to “the Google Algorithm”. Matt suggested that the examples are “technically different” but also says Google tries to pick the right ones to represent the web resource. More recently, Google people (was it Matt again?) have said that Google is “pretty good as that stuff” when discussing this very issue (I have to re-locate that reference.. don’t have it handy cause it just doesn’t matter to me). Of course they are. But that’s not the question. The question is, what does Google do?
The commenter followed the rules and got stuck – he’s got duplicate entries in the index for the same web resource, due to the way Google spidered and indexed his site. He can’t remove the bad one. He can’t fix the problem.
As a professional SEO (International Minutia Dealer) I want to exercise 100% absolute control over how Google spiders, indexes, and serves up my content. I don’t want to “try and see”, and I don’t want to “find out” on a live site. And when Google changes The Algo, I want Google to change it correctly, not just from TheOldGoogleWay to TheNewGoogleWay. I am all in favor of Google getting better over time, but very much against Google just getting “different” over time. I will strive to be TheBest, and get all of my minutia in a row, and I want Google to evolve into TheBest, reward my orderly, technically-correct minutia with error-free, predictable spidering, indexing, and serving in the SERPs. Is that too much to ask?
Of course I recognize the advances Google has made with Webmaster Console (cue Vanessa?). Yes I know there is now a “www or non-www” option in Google sitemaps. That’s not the answer, however. The questions are there even when there are only a few, relatively advanced people asking them. Those questions should be answered. It is not enough to answer the most basic ones once the majority of people are encountering them (think www vs. non-www).
So about that SEO Quiz… what’s the right answer? The first twelve are almost always the same resource, although they do not have to be. Does Google assume they are? Today? The next six are a bit odd, but today’s frameworks make them more common and less unusual. Are they unique according to Google today? Will they be tomorrow? Does anybody know? Can anybody know?
It seems Google figured out the domain fairly well, so issues of http:// or not are non-issues, but https:// and http:// are different as they should be *unless* you mix them yourself and then I suspect https:// is fair game for indexing even if there is a robots exclusion. Google does a fair job of picking www or non-www from the way people link to you, the way you use it yourself for internal linking, and your preferences if you use webmaster console (in that order if you ask me, in reverse order based on my read of Google representative’s suggestions). They still get it wrong sometimes. I still strongly recommend a hard 301 to your preferred default, and a very considerable eye on your in-linking.
What about the harder question of trailing slashes on deep resources? It’s a toss up. Keep in mind you wield some influence over Google by the way you self link and the way other’s link to you, but Matt Cutts said he thinks people commonly type in domains with trailing slashes so I still worry. I suspect there are more important issues on Google burners right now, and competitive SEO types will continue to build test sites and learn for themselves how TheGoogleAlgorithm works, for themselves and for their clients.