Apache rewrite rules are almost as horrid as XSLT

It’s time for a rant. I haven’t posted in a while because I haven’t been doing much that might be of interest to anyone, in terms of coding. Apart from my internship, perhaps; I’m working at Think Wize, a company based around Django, for six weeks. They are hoping to set up a continuous integration environment of sorts, and I’ve been given a few tasks to help make this happen. The first few weeks I’ve mainly been trying to get their codebase to play nice with South, a schema migration tool for Django. It’s been quite a challenge. I’m not sure if I’m at liberty to say much more about the subject, but that’s not what this post was going to be about anyway.

Nope, this is a rant. I just spent almost two hours trying to figure out how to get Apache to remove the “www.” from URLs. mod_rewrite, goddamn. It isn’t so much the horrible syntax, or the contradictory information to be found about it, as the fact that it was basically ignoring what I wrote the whole time. Really, really frustrating. I don’t know why, either. Feel free to continue reading and let me know what I was doing wrong, if you want.

The reason I’ve been wasting my time on this is because I’m going to set up a second website on my VPS: got-djent.com. It’s going to be a community portal for people who are interested in djent. Djent is a music thing. Some would argue it is a genre on its own, but I’m not going to speak out on that. If you’re into progressive music and/or metal, you might want to check it out. Anyway, I run a fairly large last.fm group called “djent”. Recently, we decided that we’d like to have some more options available than just a shoutbox and a forum with limited functionality, so we’re going to set up a Drupal-based community portal. Perhaps I’ll write more about my Drupal adventures later.

To make this work, I had to set up Apache so there would be two virtual hosts, one for each website. My main site (this one) is the default. I set the ServerName directive to got-djent.com for the other one, and that seems to be working fine now. Everything was fine up to here.

Unfortunately, www.got-djent.com was still pointing to my main site. My guess is that this is because the hostname doesn’t match the ServerName of the virtual host, so it falls back to the default. Luckily, there is also a ServerAlias directive, so I added ServerAlias www.got-djent.com below that and it worked fine.

Some keyboards have to endure even worse fates.

Some keyboards have to endure even worse fates.

However, “www” is useless. What is it there for anyway? To remind us that we’re surfing the World Wide Web? Frankly, I think most internet denizens can tell a URL from an email address these days, without needing the “www” prefix to make it blatantly obvious. We would be better off without it. It would save us a lot of keystrokes. And put less strain on our W keys, somehow it seems unfair that they have to endure such ongoing abuse :)

On a more serious note, the prefix affects caching; if you visit http://example.com/foo/bar.html, and then http://www.example.com/foo/bar.html later, the entire page will be reloaded (including all images hosted on the same domain) because your browser can’t tell that the URLs actually point to the same page. This means that it’s definitely a good idea to make only one of these available. Just serving the user a 404 page on the other is a bit rude though, so the nice thing to do is to install a redirect. A website has been set up to promote this cause: no-www.org1.

And so, I set out to make my domains class B compliant. I got there eventually, but it was quite a pain. What I needed to do, basically, was throw this into my Apache configuration somewhere:

RewriteEngine On

RewriteCond %{HTTP_HOST} ^www.got-djent.com$ [NC]
RewriteRule ^(.*)$ http://got-djent.com/$1 [R=301,L]

RewriteCond %{HTTP_HOST} ^www.benanne.net$ [NC]
RewriteRule ^(.*)$ http://benanne.net/$1 [R=301,L]

Someone once said that mod_rewrite is voodoo. If you think these five lines aren’t particularly easy on the eyes, you can’t even begin to fathom what ghastly horrors lurk in the depths of URL rewriting. Seriously, this is nothing. Most of it can obviously be attributed to its extensive use of regular expressions, but even aside from that, it’s hardly elegant.

Let’s go through these five lines step by step. What RewriteEngine On does should be fairly obvious. RewriteCond %{HTTP_HOST} ^www.got-djent.com$ [NC] is a bit less transparent. RewriteCond describes a condition for a rewrite rule. Basically, it lets you match a set of variables associated with the incoming request to a regular expression. %{HTTP_HOST} is a variable that contains the host name that was used for the current request2. This is matched against ^www.got-djent.com$, a regular expression which will only match “www.got-djent.com” itself. The [NC] is an option flag that indicates the matching should be case insensitive (it stands for no case). A RewriteCond always affects the first RewriteRule that follows it.

Next up is RewriteRule ^(.*)$ http://got-djent.com/$1 [R=301,L]. As you can see, RewriteRule also takes two parameters: a regular expression and a… well, it’s a URL of sorts, but it has funny characters like dollar signs in it – a substitution pattern. The regular expression extracts information from the request URL3 and this is used to interpolate the backreferences (like $1) in the substitution pattern. In this case, we have the regular expression match everything4 and paste this after the domain name without the “www”.

This directive also has two option flags. [L] is for last; processing will stop after applying this rule, and any following rules will be ignored. [R] is for redirect, which is what we want to do here. Note that just [R] would send out a 302 Moved Temporarily status code. We explicitly specify 301 so it sends out 301 Moved Permanently. The last two lines are the same as the previous two, but for my other domain. It is probably possible to consolidate these two into a single rule, but I’ve had enough of this for a while.

One thing I’m wondering about is why the dot wildcard (.) usually isn’t escaped in the regular expression provided to RewriteCond. Out of the stuff I found on Google, roughly half would escape them (www\.got-djent\.com) and the other half wouldn’t. I left the backslashes out here, because it looks a little bit nicer this way and they don’t seem to be doing anything. My guess is that they should actually be escaped, but because they are matched against %{HTTP_HOST} they can only ever match a dot anyway. The regular expression itself might also match, say, wwwwgot-djent_com, but that hostname doesn’t point to my server in the first place. If you have another explanation for this, or if you think mine is incorrect, please let me know in the comments.

Now, my main issue wasn’t the rules themselves, but where to put them. Nothing seemed to work. I put them in httpd.conf, in sites-available/ and in conf.d/5, and threw in some other directives without any luck. The rules just would not get picked up. I know the files in question were all being read though, because I also put RewriteLog "/var/log/rewrite.log" and RewriteLogLevel 9 in there for debugging purposes, and these definitely were being picked up. I tested this by browsing the Drupal install I’m working on, which has some rewrite rules to make its URLs prettier. Those rewrites would show up in the log just fine, but my own rules were doing nothing at all.

The solution, of course, is to put them inside an .htaccess file. However, not just any .htaccess file; it has to be the one in the document root of the default virtual host. There is a simple explanation for this: www.got-djent.com does not match the ServerName (I had obviously removed the ServerAlias directive earlier) of the got-djent.com virtual host, so the request never reaches it. Instead, it goes to the default virtual host. It took me the better part of two hours to finally realise this. Once I knew this, the solution was obvious.

I still don’t know why placing the directives in any of the main configuration files has no effect. According to Google this should work just fine, and to me it seems like a more natural place to put them in. I hope this post is helpful to anyone who is running into the same kind of problems. For others it will probably be a painfully boring read. I’m not sorry for this.

To justify the title, I should probably mention that my hate for XSLT is far worse. I respect the power of Apache’s URL rewrite rules, and the inevitable complexity that comes with this, but XSLT is just a giant sack of unnecessarily complicated shit. Aaaaaaaanyway. I am going to watch an episode of A Bit of Fry and Laurie and then go to bed. I think I’ve deserved that. Keep an eye out on got-djent.com if you’re interested, it’ll be going live in a few weeks or so. Good night :)

Notes

  1. ↑1 Other people think “www” is an essential part of internet culture and would rather we used it more liberally: www.www.extra-www.org :D .
  2. ↑2 It doesn’t contain the full request URL; it seems to be a fairly common mistake to assume that it does.
  3. ↑3 Note that this URL might have been rewritten by preceding rules. Can you imagine how much fun this must be to debug!
  4. ↑4 If you don’t understand why ^(.*)$ matches everything, pay a visit to www.regular-expressions.info.
  5. ↑5 I’m using Ubuntu Server, and its default Apache install has a pretty intricate configuration file structure.

6 Comments

  1. Posted August 28, 2009 at 5:18 pm | Permalink

    I’m not totally sure I believe in the “no www” movement. It certainly doesn’t save you typing!

    For instance, if twitter didn’t also accept http://www.twitter.com, I would have to type t-w-i-t-t-e-r-dot-c-o-m-enter. But because they DO allow www, all I have to type is t-w-i-t-t-e-r-ctrl-enter. That’s like 3 characters difference! And yes, it works in all browsers.

  2. Sander
    Posted August 28, 2009 at 5:26 pm | Permalink

    The “less typing” argument was in jest ;) Well, kind of, anyway. It certainly isn’t a good enough reason to loudly campaign for the abolishment of “www”. The main reason I prefer without is because, well, it doesn’t do anything useful, and because of the caching situation.

    The ctrl+enter thing is a nice trick which I often use myself (and also its friends shift+enter and ctrl+shift+enter), but that is guaranteed to work by putting a redirect in place. Class B compliance is the best of both worlds, really.

  3. Posted August 28, 2009 at 6:00 pm | Permalink

    Alright ;) in that case I agree with you. Everyone “knows” that whatever you type is on the www, unless you’re on corporate intranet or something. And it only breaks caching if you don’t redirect from one to the other. Really the strongest argument is that it’s…vestigial. There’s no reason you *need* to have it. It’s definitely a growing trend to not have it.

  4. ziggy
    Posted November 13, 2009 at 11:28 am | Permalink

    In httpd.conf you have to put this in the VirtualHost, Directory section.

    Your understanding of the . wildcards is exactly correct. I see several versions of this code but no idea which one is a millisecond faster.

    One reason to do this is so that all your links in end up at the same place, otherwise your googlejuice is spilled over duplicate pages, which isn’t good.

  5. feniix
    Posted February 23, 2010 at 8:19 am | Permalink

    you don’t need to do grouping, use grouping when you need to mangle the uri like exchanging parameters position or creating SEO friendly urls.
    In in your case is just a waste of cpu and memory I recommend:
    “The Definitive Guide to Apache mod_rewrite”

    RewriteEngine On

    RewriteCond %{HTTP_HOST} ^www.got-djent.com$ [NC]
    RewriteRule .* http://got-djent.com%{REQUEST_URI} [R=301,L]

    RewriteCond %{HTTP_HOST} ^www.benanne.net$ [NC]
    RewriteRule .* http://benanne.net%{REQUEST_URI} [R=301,L]

  6. Sander
    Posted February 23, 2010 at 12:16 pm | Permalink

    Thanks, that does make a lot of sense :) I guess I missed REQUEST_URI when I skimmed through the docs.

Post a Comment

A subset of HTML is supported. Please use <code> and <pre> for code samples. You can enable syntax highlighting like this: <pre lang="ruby">. This also works for Javascript, Python, and probably most other languages you can think of.

Your email is never published nor shared. Required fields are marked *

*
*