John Andrews is a Competitive Webmaster and Search Engine Optimization Consultant in Seattle, Washington. This is John Andrews blog on issues of interest to the SEO community and competitive webmasters. Want to know more?

johnon.com  Competitive Web & SEO

SEO as International Minutia Dealer

I spend a lot of time on the minutia of web publishing. I also work for International clients. Since most of what I do is minutia for International clients, I recently referred to myself as an “International Minutia Dealer” and the poor lady next to me in the group acted shocked and said “I think we have enough war in the world, thank you very much”. Whatever.

So one of the minutia I deal with is the trailing slash. When dealing with frameworks, front controllers, and other-than-Apache web servers, it can be tough to get absolute control over trailing slash minutia. If you don’t know what I mean, consider this SEO quiz:

Q: How many web resources are represented by the following list of URLs, according to Google?

  1. http://www.example.com/
  2. http://www.example.com
  3. http://example.com/
  4. http://example.com
  5. www.example.com
  6. example.com
  7. www.example.com/
  8. example.com/
  9. www.example.com/index.html (or index.php or index.asp, your web server’s default)
  10. example.com/index.html
  11. http://www.example.com/index.html
  12. http://example.com/index.html
  13. http://www.example.com/index (your web server default is still index.html)
  14. http://www.example.com/index/
  15. http://example.com/index/
  16. http://example.com/index
  17. www.example.com/index/
  18. example.com/index

See what I mean about minutia? That’s 18 versions so far, and I left out some biggies. So what’s the answer? How many of those are unique URLs according to Google? And if we had another search engine today, what would the answer be for that search engine?

How about putting this into a little context. What if you have a page at www.example.com/index.html and it is very, very popular. It has 2 million back links to that exact URL, and 2 million more to www.example.com, half with a trailing slash (www.example.com/) and half without (www.example.com). Your boss implements a new site, and your job is to migrate the site to the new server. Oh, and the new server uses a front controlling system such that all urls will look like www.example.com/index/.  You have been reminded that “you were not hired for your web development or web design skillz, so stay out of that kitchen but because you know SEO make sure they don’t screw up and we don’t lose any rank, ok? It’s your only responsibility, so make sure it’s done right.” Nuff said.

That simple example of minutia drives an industry of highly-prized SEO consultants that work diligently while regular SEO “consultants” argue about whether SEO or PPC is the new sliced bread of marketing, or whether or not SEO is a Ron Popeil “set it and forget it” task.

Matt Cutts almost addressed this issue (from a “Google perspective”) last year on his blog. He was trying to clean up some of the mess around Google’s handling of redirects and Google’s use of the word “canonicalization”.

Those of us who suffered through courses on Linear Systems Control Theory in college know that a canonical form is an arrangement of a system such that you represent it with the least amount of parts (yet it is fully represented). Examples of canonicalization are everywhere, even if unlabeled. here’s another one:

The car has wheels and wheels have wheel covers. If you need to draw (represent) a car, you need to draw it with circles for wheels, because if you drew the car with no wheels people would not say it was a car, but would probably say “it looks like a car with no wheels”, or “it looks like a car that has no wheels”, etc. Draw the wheels and everyone will say “it’s a car”. Did you need to draw the wheel covers?  No. The canonical form of that representation (in this very specific example) could be the car body plus circles for wheels. Google is staffed by a bunch o’ engineers who have probably all taken control systems theory or higher math and so when they wanted to label the idea of “how do we identify the web site resource without all the extra redundancy that might be present in default file names, extensions, meaningless subdomains like www, and trailing slashes“, they probably started simplifying with lingo like “what’s the canonical root“. Geek talk. Of course my example is a physical one for the non-Engineers. Systems theory is not about physical parts like cars and wheels but mathematic equations and representations, which can be mixed and blended as needed to come up with different forms (such as canonical forms). There are actually many kinds of canonical forms. Go figure.

By the way “canonical” is also sometimes defined as ”according to the rules” (or canon), but since in this case there are no rules to follow, and the Google people were clearly trying to “figure this out” for a best way, I doubt that’s the source of the use.

Anyway Matt said this about the trailing slash:

Q: What is a canonical url? Do you have to use such a weird word, anyway?
A: Sorry that it’s a strange word; that’s what we call it around Google. Canonicalization is the process of picking the best url when there are several choices, and it usually refers to home pages. For example, most people would consider these the same urls:

    * www.example.com
    * example.com/
    * www.example.com/index.html
    * example.com/home.asp

But technically all of these urls are different. A web server could return completely different content for all the urls above. When Google “canonicalizes” a url, we try to pick the url that seems like the best representative from that set.

Q: So how do I make sure that Google picks the url that I want?
A: One thing that helps is to pick the url that you want and use that url consistently across your entire site. For example, don’t make half of your links go to http://example.com/ and the other half go to http://www.example.com/ . Instead, pick the url you prefer and always use that format for your internal links.

A commenter apparently is also a Minutia Dealer because he followed up with this good question:

Thanks Matt for the continued explanations and advice about this stuff. I have been reading up on Canonical issues for a while (suffering from one myself due to not knowing about them before hand and not using 301 protection), I have set up a 301 and on server name resolution so that all requests for the main index page go to www.theurl.com/ (the trailing slash is always added anyway).

Google still shows www.theurl.com/ and www.theurl.com/index.php in the serps and is docking my PR due to it. Will the 301 be picked up by the main googlebot and remove the index.php reference from the results in due course?

Also I can’t fathom out why this sort of thing isn’t under the webmaster’s control? If I know that the result www.theurl.com/index.php is WRONG then there should be a system to remove JUST that reference? Is this impossible?

As far as I know that follow up question remains unanswered, but that’s not surprising to me. Google has incorporated some automated “canonicalization” checkers which are pre-programmed to handle these minutia according to “the Google Algorithm”. Matt suggested that the examples are “technically different” but also says Google tries to pick the right ones to represent the web resource. More recently, Google people (was it Matt again?) have said that Google is “pretty good as that stuff” when discussing this very issue (I have to re-locate that reference.. don’t have it handy cause it just doesn’t matter to me). Of course they are. But that’s not the question. The question is, what does Google do?

The commenter followed the rules and got stuck - he’s got duplicate entries in the index for the same web resource, due to the way Google spidered and indexed his site. He can’t remove the bad one. He can’t fix the problem.

As a professional SEO (International Minutia Dealer) I want to exercise 100% absolute control over how Google spiders, indexes, and serves up my content. I don’t want to “try and see”, and I don’t want to “find out” on a live site. And when Google changes The Algo, I want Google to change it correctly, not just from TheOldGoogleWay to TheNewGoogleWay. I am all in favor of Google getting better over time, but very much against Google just getting “different” over time. I will strive to be TheBest, and get all of my minutia in a row, and I want Google to evolve into TheBest, reward my orderly, technically-correct minutia with error-free, predictable spidering, indexing, and serving in the SERPs. Is that too much to ask?

Of course I recognize the advances Google has made with Webmaster Console (cue Vanessa?). Yes I know there is now a “www or non-www” option in Google sitemaps. That’s not the answer, however. The questions are there even when there are only a few, relatively advanced people asking them. Those questions should be answered. It is not enough to answer the most basic ones once the majority of people are encountering them (think www vs. non-www).

So about that SEO Quiz… what’s the right answer? The first twelve are almost always the same resource, although they do not have to be. Does Google assume they are? Today? The next six are a bit odd, but today’s frameworks make them more common and less unusual. Are they unique according to Google today? Will they be tomorrow? Does anybody know? Can anybody know?

It seems Google figured out the domain fairly well, so issues of http:// or not are non-issues, but https:// and http:// are different as they should be *unless* you mix them yourself and then I suspect https:// is fair game for indexing even if there is a robots exclusion.  Google does a fair job of picking www or non-www from the way people link to you, the way you use it yourself  for internal linking, and your preferences if you use webmaster console (in that order if you ask me, in reverse order based on my read of Google representative’s suggestions). They still get it wrong sometimes. I still strongly recommend a hard 301 to your preferred default, and a very considerable eye on your in-linking.

What about the harder question of trailing slashes on deep resources? It’s a toss up. Keep in mind you wield some influence over Google by the way you self link and the way other’s link to you, but Matt Cutts said he thinks people commonly type in domains with trailing slashes so I still worry. I suspect there are more important issues on Google burners right now, and competitive SEO types will continue to build test sites and learn for themselves how TheGoogleAlgorithm works, for themselves and for their clients.

★★ Click to share this article:   Digg this     Create a del.icio.us Bookmark     Add to Newsvine

29 Responses to “SEO as International Minutia Dealer”

  1. Lea de Groot Says:

    Interesting - I hadn’t considered www.example.com vs www.example.com/ as an issue as I generally work on Apache and it comes ‘free’ so to speak, but, yes, its obviously another one to check in a new situation.
    Thanks!
    I note that canonicalisation is obviously becoming a more visible problem when packages like wordpress have extensions to make sure each post is only ever shown from the ‘correct’ URI :)

  2. links for 2007-03-21 » mhinze.com Says:

    […] johnon.com - John Andrews - » SEO as International Minutia Dealer (tags: seo canonicalization) […]

  3. You Say Index.html, I Say Trailing Slash Says:

    […] Write down which of the above qualify as a web resource. No, don’t click yet to see his answer, write down first. Done, then go read the article to find John Andrews answer. […]

  4. |► Il Rank di un sito si basa su una "/" ? Says:

    […] Il Rank di un sito si basa su una “/” ? Chiedo lumi a chi mastica l’inglese meglio di me e mi spiega bene cosa sia questo, leggendo ho tradotto un poco e mi sembra parli di quando una “/” faccia perdere Rank ad un sito web ma non vorrei dire c……te […]

  5. This Week In SEO - 3/23/07 - TheVanBlog Says:

    […] SEO as International Minutia Dealer […]

  6. Matt Cutts: Gadgets, Google, and SEO » Canonicalization update Says:

    […] It’s almost not worth mentioning, but I know one website noticed this, so I’ll talk about it. Last week there was an update to how we canonicalize a small number of urls. What is “canonicalization” again? Read this previous post, or see this post by John Andrews to see all the ways that you can have the same content on urls that are technically different. Some people ask “Why don’t you just assume www.example.com and example.com are the same?” The answer is that they don’t have to be, and for some websites they are different. For example, http://phpicalendar.net/ is a different page than http://www.phpicalendar.net/. This happens more often than you might think; FindWhat has different www vs. non-www pages, for example. […]

  7. john andrews Says:

    Update: Matt Cutt’s notes that “Last week there was an update to how we canonicalize a small number of urls.” He notes it’s a very minor issue for most people, but if you’ve been experimenting with the backslashes and other “minutia” you might want to consider that something may have changed last week cause in experimentation land, such little changes usually make you start over :-)

    Matt also re-emphasizes how it is smart to set a standard for within-site use of backslash and stick with it.

    See http://www.mattcutts.com/blog/canonicalization-update/

  8. mrg Says:

    19. index.htm

    I have seen examples of different pages at index.html and index.htm with different pr’s.

  9. Nirupam Roy Says:

    Hi

    After going through all the posts on canonicalization, i have only one thing to ask i.e. “Isn’t Google SMART enough to identify the right version of an url on its own ??”
    Being a Search Giant, everyone would expect Google to be clever enough to understand the right form of an url. Isn’t that so ???

    Regards
    Nirupam

  10. john andrews Says:

    @Nirupam: I think you misunderstand. There are many valid ways to place a web resource. What is ‘the right version”? If you want to know who should be doing things the right way, it is the web master.

  11. This Week In SEO - Business Online Blog Says:

    […] SEO as International Minutia Dealer […]

  12. Brookooly » links for 2007-03-27 Says:

    […] John Andrews - » SEO as International Minutia Dealer Sur les url canoniques. (tags: seo url canonique duplicate-content) […]

  13. Nirupam Roy Says:

    John

    I understood the problem. But my question is “how many webmasters are aware of canonicalization”? thatz why i said it would have been better had google do it on its own. I am very much sure they are aware of which url’s to be considered in search engine listings and which one should not.

    Regards
    Nirupam

  14. Canonicalization update · djanggo.com Says:

    […] It’s almost not worth mentioning, but I know one website noticed this, so I’ll talk about it. Last week there was an update to how we canonicalize a small number of urls. What is “canonicalization” again? Read this previous post, or see this post by John Andrews to see all the ways that you can have the same content on urls that are technically different. Some people ask “Why don’t you just assume www.example.com and example.com are the same?” The answer is that they don’t have to be, and for some websites they are different. For example, http://phpicalendar.net/ is a different page than http://www.phpicalendar.net/. This happens more often than you might think; FindWhat has different www vs. non-www pages, for example. […]

  15. Arnab Ganguly Says:

    As far as I understood the canonical problem is an ongoing problem Even google in its webmaster tools i.e. the sitemap has introduced the concept of making one url to be the main but still I do prefer to ask google are they themselves not aware which of them is most important to them. Why leave it on people who probably knows little to nothing about canonical problems.

  16. johnon.com - John Andrews - » Advanced SEO, Apache Bug, and Google Says:

    […] Continuing on the concept of SEOs Dealing in Minutia, I am also marveling at how this small problem with Apache has gone un-noticed for so long in SEO world. […]

  17. |► Il Rank di un sito si basa su una "/" ? Says:

    […] Il Rank di un sito si basa su una "/" ? Chiedo lumi a chi mastica l’inglese meglio di me e mi spiega bene cosa sia questo, leggendo ho tradotto un poco e mi sembra parli di quando una "/" faccia perdere Rank ad un sito web ma non vorrei dire c……te. ste __________________ A primavera i ciliegi sono in fiore, ci si mette le infradito e c’ IL RADUNO GT IL 12 MAGGIO!!!! nei nostri hotel in liguria troverete….info sul turismo in liguria itinerari turistici in moto in liguria […]

  18. Ubuntu Daily Says:

    Fortunately Wordpress and other software are programmed so that it does the right thing out of the box.

  19. Artem Says:

    Blogger and Wordpress have not this problem. But I can’t understand, how could be so that one page have the different Page Rank. For example w w w .site.com have PR5, but w w w.site.com/index.html have PR4?

  20. Arnab Ganguly Says:

    I hopw with the introduction of Google Webmaster Tools things will be easier where you can actually help Google identify the right type of url for a site.

  21. Better Search Engine Rankings Says:

    The 301 redirect works very well, it also retains the history, and the back-links. What I have realized though, is that it takes a good six months for your rankings to get back, and for all of your site to get indexed.

  22. Tim Fuchs Says:

    @artem

    Every single url gets its own pagerank from google. If you don’t want to have different pageranks for those two you mentioned, just link to / instead of /index.html, the webserver software will pick up the index.html automatically.

    John replies: Yes, they are unique pages and that is why it is a problem. While you may link to just one of them, that does not solve the issue as others can link to the other version, causing both to get indexed.

  23. Donetsk Says:

    This may have been one of my biggest (most annoying) problems, whenever I changed shopping cart software it would reformat my url from say /store to /store/ or /store/index.php or /store/home.php HOW FRUSTRATING! Currently with static sites I use a couple of mod rewrites to change the htaccess moving everything to www.mysite.com
    Thanks for the post & sharing my pain lol.
    Regards, Don.

  24. Bree Falk Says:

    I didn’t know there were so many different types of web address. My blog has the trailing “/” but I think that all wordpress blogs have this. I don’t know if there is a way to get rid permanently. Would be nice to to just have the normal site address.

  25. » Favorite SEO Blog Posts - John Andrews - johnon.com Says:

    […] SEO as International Minutia Dealer […]

  26. » UpperLeftPlacement.com and Snarky SEO Blogs - John Andrews - johnon.com Says:

    […] SEO as International Minutia Dealer […]

  27. Empresas Says:

    There are some update about this.? What better for SEO purposes? finish with slash or extension .html. Thanks.

  28. Zola Says:

    Recently I moved my blog from www to non www version. I found drastically downfall in images traffic from google. As well as down Pr3 to Pr2.

    @Zola: there are many factors involved… perhaps you had incoming links to www before, etc. 

  29. spammer Says:

    I definitly go for www.example.com… But i see more and more successfull blogs without the “www”… I gess it is just for the readers… the URL seems shorter and easier to remember.

    however Im still wondering if google see www.example.com and www.example.com/ as 2 different websites…

Leave a Reply: All comments with embedded links will be placed into moderation. All SPAM is reported.