Continuing on the concept of “advanced SEO“, I am today marveling at the collision of Apache web server and SEO. It is finally happening, and it is about time.
Continuing on the concept of SEOs Dealing in Minutia, I am also marveling at how this small problem with Apache has gone un-noticed for so long in SEO world.
First, let me say that Beverly Hills is a very nice place to live. Beverly-Hills is just as nice, and Beverly+Hills is even nicer. However, Beverly%20Hills is not quite as nice, although I certainly understand the urlencoding that led to the sub-standard living conditions. To be fair, Matt Cutts has said it’s better than Beverly_Hills, but my SEO senses tell me that, too is changing. Of course Beverly Hills and beverly Hills are the same, as are beverly Hills, and even the easily-parsed beverlyHills with or without proper Beverlyhills capitalizations. Oh if it were so easy. Too bad we can’t all live in easily-parsed locations like BeverlyHills, LaJolla, and NewYorkCity. We wouldn’t need fences. Life would be easy.
But the only reason those places are easily parsed is because they are space-separated place names in the corpus of information that is the index. The URL is the anomaly. Google needs the space-separated HTML out there in order to know that /beverlyhills/plasticsurgeons.html is semantically equivalent to /Beverly Hills/plastic surgeons. The first chicken was indeed born of an egg from other than a chicken. Check Darwin’s notes on that.
But as the advanced SEO positions his search engine friendly URLs in Beverly Hills neighborhoods, he runs into this recognized Apache bug, which reveals that Apache does some escaping of its own before the rewrite engine even gets the URL:
At the early beginning, when the internal request processing starts, apache unescapes the URL-path once. This is not done by mod_rewrite, this happens before mod_rewrite is involved and I think this is also a part of the security concept.
If you are using your rewrite rules in directory context, you have a filename (a physical path, e.g. /var/www/abc) while the per-dir prefix is stripped (so you’re matching only against the local path ‘abc’ if your rules are stored in /var/www/). How would you map some unescaped URL-path to the file system? There’s no way to make the unescaping process optional for a physical path in directory context.
This is not a show stopper. If you’re building front controllers you are capable of avoiding Apache’s rewrite altogether (and may now recognize this as a necessity), but it sure is inconvenient if you had planned out a site architecture with an eye on a nice, stable, data-driven virtual URL hierarchy using your own front controller in collaboration with Apache’s fast, integrated mod_rewrite.
The bug reports describe this in more detail, and show a few ways to work around the problem if you are so inclined to do so (that is, if you are so inclined to revisit your code once Apache gets pached again).
My point? This ain’t beginner-level SEO, friends. Pursuing Google-friendly URLs with a modern web infrastructure, and running into a bug in Apache? THE web server? And not just a bug, but one that demonstrates how Apache’s roots are in file systems, which we left behind a few years ago when we started using CMS’s and frameworks. If you’re moving a large dynamic site to a more “search engine friendly” site architecture with semantically useful URLs, you’re client just got a change order and a work authorization form. And the first order of business on that agenda is not working around the problem. It’s revisiting a cost-benefit analysis. If such an obstacle is too big for your SEO boots, what will you do? Settle for good-enough? I won’t. Certainly not in Beverly Hills.
In 2002 I implemented a method of statically caching a sitemap via use of a primitive front controller with hooks into Apache’s 404 handler. The goal was the same as it is today – user and search friendly URLs with no physical file system correlates, and fast, clean structure. I presented it to a technical audience in 2005, and was asked why it was necessary at all. Even now, 5 years later, working with frameworks and application programming languages far advanced from the old days of PHP3 and 4, we have the same problems: Google gives weight to things it can’t manage properly, and everything’s running on a web server built a long time ago when “things was different”. SEO is about optimizing content as published, so that it ranks in search engines. As long as the web keeps changing, and Google does or does not, SEO will be hard.
As far as the Apache/SEO collision thing, it’s not so much this bug as the source of it: Apache protects an underlying file system, and I just don’t have any need for a file system any more.
If you’re into the minutia, here are some links: