One thing that's always bugged me is file extensions...I hate having to type ".php" or ".cgi" or ".asp" or whatever else just happens to reflect the server's implementation at the time. If they switch from Microsoft to LAMP then their URLs will probably all break needlessly.
It can be much worse though...exposing machine names, unnecessary complexity and parameters, all changing from year to year.
A server doesn't have to puke its implementation details everywhere. I can't even count all the "enterprise" apps that make that mistake (e.g. a helpdesk system that gives me "serverNameThatWillChangeNextYear.domain.net/some/unnecessarily/convoluted/path.unnecessaryExtension?whatthehellisallthis&garbage1=a&garbage2=b&finallyRelevantBugNumber" instead of a stable URL like "company.net/bugs/bug123456". And just try E-mailing a complex URL to somebody (it wraps, and time is spent to awkwardly correct it).
Honestly, it's as if most web developers don't understand how powerful URLs can be. If you make URLs short and stable and use them to help look up stuff, they can be very nice. Instead in the past I've seen people E-mailing 9-step instructions on how to find something because the damn URL is unreliable.
> Honestly, it's as if most web developers don't understand how powerful URLs can be.
Or, more likely, most of the "enterprise" apps with those URLs were first released over a decade ago, when the idea of "pretty URLs" was not yet mainstream. There was not even a mention of mod_rewrite on Wikipedia at that point. The article on the front controller pattern, which most webapps that handle their own URL routing use, wasn't written until 2008.
> most of the "enterprise" apps with those URLs were first released over a decade ago, when the idea of "pretty URLs" was not yet mainstream.
Or, even more likely, the enterprise customer never specified "URLs that aren't terrible" in their 100-page RFP, and so the lowest bidder didn't bother with URLs that aren't terrible.
The thing that drives me up the wall is "index.html" - that's pretty much inexcusable in my opinion. There are even some sites where the root path on the domain REDIRECTS to example.com/index.html - don't do that.
<a href="/x/"> will work, but will break when someone decides to move "/x" to "/y".
<a href="."> will work from the server, which does an internal redirect, but not on the static version on my local drive I'm going to demo to my boss. (I also have a hunch a significant number of developers aren't aware of "." and don't know this is an option.)
So we end up with <a href="index.html"> for better or worse.
If you're having problems like this, your web development environment is total garbage and should be fixed.
99% of the time the reason for index.html is the developer was viewing static files in their browser because they didn't have a proper server or test environment. This is inexcusable in 2012.
The index object should be the default page opened by a webserver for any given unspecified path.
Eg: http://www.example.com/path/to/url/ should open to the first specified of the the specified default objects, typically index.html index.shtml index.php index.php, or similar. This is defined in your Apache conf file, or locally via .htaccess.
I understand that: index.html is just one possible name for the "default page" filename. IMO its perfectly valid (if redundant) to specify it explicitly in a url. I still don't understand why this was described (not by you) as unacceptable.
In general, you want URLs that are short, clear, memorable, and stable. You want to tell people,
"Come to mydomain.com/blog," not,
"Come to http://www.wordpress.mydomain.com/stamp_collectors_heaven_bl... The latter includes lots of information that their browser + your server can figure out, so you shouldn't be burdening your (potential) users with it.
So, regarding the "index.html" portion: The index is the default. There's no point having a default if you have to specify it, and there's no point specifying it if it's the default.
A web page should have one and only one URL. If you link to the index.html version of a page you're creating two URLs for that page (the index.html one and the bare / one). The / one is shorter, easier to type and more attractive, so you should pick that one.
If you'd started with the 'document'-.'html' naming convention, you could use any of numerous webserver hacks to preserve this illusion. How you request something has little to do with what the server does to satisfy your request.
What I was distinguishing earlier was the distinction between 'return default index of this level of the path' and 'return a specific document from this level of the path'. Specifically indicating 'index.html' is a tad gauche.
I think some organizations see this kind of implementation leakage as a feature, not a bug.
I remember when http://microsoft.com/ began doing external redirects to "default.asp" circa 1997. If you were a "webmaster" (do these exist anymore?), this was a dog whistle. They were not using static .html (or .htm) but not any of the common dynamic methods like .cgi or .shtml either. And using "default" rather than "index" indicated a break from NCSA/Apache convention. They were using a different web server. Those 11 extra characters said a lot.
Bright people use other people's problems to their advantage: in pg's viaweb they deliberately put extra `cgi-bin/` every here and there just to confuse their competitors.
Problem ist most companies don't have someone who actually owns the URL namespace. Or the engineer who wrote that bug reporting tool was too shy to contact the CIO, etc.
I agree with your point, but it's also valuable to understand the structural reasons that prevent such things.
If it's bad magic, update your distro. If it's a Konqueror error, file a bug. I'd be surprised if upstream hasn't addressed this (quick DDG/Google doesn't turn up any similar complaints).
But they are zip files. (Though `file somebook.epub tells me "data". Sigh. Desktop and OS seem to be on different pages.)
Just like .css and .log are text files, but I may not want the same default file handler.
What I like is having *.epub open in, say, calibre, but if I append a .zip extension then it will open in xarchive.
Anyways, I open most things from the command line and wrote my version of 'open' so I get what i expect 99% of the time. :)
I'm using KDE3 so I'm not expecting any upstream fixes for this in my lifetime. I can live with crafting my own solutions.
What might work best is if mime types were used by default but forcing behavior for specific extensions was much easier. Get the best of both worlds (which I can mostly do in KDE3 Konqueror but it's tedious.)
File(1) on my Debian wheezy says they're epubs. File has dealt with nested formats for decades. It examines up to the first 1024 bytes of the file IIRC.
Other examples: tar.gz, WAR files, most ODF formats.
Or do you mean storing a database of the application associated, specifically, with any given file (so that I might open, say, a given myprog.c with vi, emacs, or textmate)?
The Mac OS resource fork model attempts the second, but it's an inherently single-user concept that leaves artifacts around for other users and/or on shared media in an annoying way. A shadow filesystem maintained on a per-user basis under their control would be a preferred solution.
Files used to be tied to applications directly. While I prefer the BeOS approach of MIME-type, it's not nearly as user-friendly as a file extension.
OS X tries to hide extensions by default, which like the behavior in Windows that's similar, seems dangerous. Too many times people have been stung by sexy.jpg.exe.
I had a number of revelations through rereading this classic article:
First, holy crap, I hadn't been to http://www.w3.org/ in a long time, and it looks like they've actually made it to the 21st century!
Second, perhaps cool URIs don't change, but it seems like http://www.w3.org/Provider/Style/URI.html is kind of an unfortunate URI. What's "Provider"? Why are "Provider" and "Style" uppercase? And what's wrong with 301 redirecting (don't break old URIs, but still restructure them as your website matures and you realize a better organizational hierarchy)?
Third and perhaps most importantly, this all seems like a pretty awesome problem to have! How many websites survive more than a few years? (Geocities doesn't count.)
> Second, perhaps cool URIs don't change, but it seems like http://www.w3.org/Provider/Style/URI.html is kind of an unfortunate URI. What's "Provider"? Why are "Provider" and "Style" uppercase?
They're uppercase because they're uppercase. Asking why is like asking why Rubyists like_using_method_names_like_this and .NET devs LikeUsingMethodNamesLikeThis.
Convention is for URLs to be all lowercase, which IMO aids in readability and certainly makes them easier to type (imagine giving a URL to a friend over the phone -- remember this is 1999).
The standard is that hostnames are case-insensitive, but non-FQDN paths should allow both uppercase (especially for systems which don't include lower-case elements, yes, Virginia, they exist) and lowercase elements.
I agree that method_names_like_this is more readable, even (or, especially) after many years of .NET programming. I used the_style_like_this back in C++ days and, honestly, miss it. Perhaps, the renaissance of C++ in the form of C++11 will bring it back :)
like_using_method_names_like_this is more readable. White space has been in use to separate words since the time of Alciun, and that's the closest way to do it. Call me conservative but I prefer to stick to the insights of the last 1400 years rather than mangle them horribly ;-)
so your eyes go zig-zag from the taller glyphs of the word to the low glyph of the underscore (vertical zig zag motion), instead of jumping to the next world (horizontal).
funny, I was just checking because I have never noticed my eyes doing that and afaics my eyes still scan horizontally. I don't need to look down to see the underscore. I guess YMMV?
Actually I read somewhere that the decision for Ruby using underscored names as a naming convention was because of the very large number of non-native English speakers in the early days of the language. Something along the lines of it was easier to read for non-native speakers because it more closely mimicked spaced English.
> Second, perhaps cool URIs don't change, but it seems like http://www.w3.org/Provider/Style/URI.html is kind of an unfortunate URI. What's "Provider"? Why are "Provider" and "Style" uppercase?
Take a look at w3.org/Provider and w3.org/Provider/Style. It's actually a well-constructed URL. It appears to be a "Style guide" for "web content Providers."
It's because it is a part of a collection of numerous pages named "Putting Information onto the Web". Its audience is obviously (content) providers, which the URI reflects.
However, http://www.w3.org/Provider/Style/URI/ does not work. I guess this makes some sense, but certainly violates the "typical" functionality on the web where trailing slashes are ignored.
Actually, this behavior you're referring to is a combination of trying to access a directory (to which a missing trailing slash is automatically appended) and the use of default pages (for Apache, that's the DirectoryIndex directive). Trailing slashes on files is not valid, because files are not directories.
Agreed. That’s a rule I mention in my piece, “URL as UI”, which I really want everyone to read. (If it feels like self-promotion, then please kindly ignore my username / blog masthead. I only care about these ideas.) http://alanhogan.com/url-as-ui
> Consider linking to your home page and/or site map [...]
For a second I thought you were advocating redirecting 404s to the homepage, and I was about to flame you. I hate that one, and I think it's harmful enough that you probably ought to have a point about not doing it in that article.
I agree with this article to the extent that we should choose our URIs carefully and semantically. However, it is very likely that your web facing application, and the semantics of your resources will change over time. What if I wanted to deprecate and eventually remove my 'Providor' or 'Style' resource? Maybe I decided to create a 'blogs' and an 'articles' resource. That would change the URI from http://www.w3.org/Provider/Style/URI.html to http://www.w3.org/blogs/articles/URI.html (I know that my choice of new resources aren't the most interesting but you get the point). Of course, we would still support the old URI via a 301 redirect to the new one, or continue to serve up the page from the old URI and use a rel canonical meta tag. I'd advocate for 301ing to the new URI and letting the past be the past. We shouldn't be bound to these decisions for the rest of our lives. Unless you're sure that the focus of your web app will never change, or you plan to build your new resource URIs around old resource URIs that were chosen at a time when your new resources weren't being considered (which would lead to far worse URIs), then I'd plan for your URIs to change.
Good they mention "banner blindness" and related phenomena. I see webpages sometimes that make these mistakes, and once or twice I ever got burned by it - like when I missed a very important "Point 0" from some IRC channel's FAQ. This Point 0 was so important for channel mods, that they put it directly above the page headline, and I honestly didn't perceive it was there until I reread the page for the third time.
The problem is that given enough time, the URL will break-- you'll redo the entire site (a shiny new CMS/cloud provider/hosted solution/server!), or accidentally delete everything, or get hacked, or get acquired, or go out of business. Or even if you keep your important links up forever, various obscure bits and pieces will come and go, and years later, somebody will find an old link to one of them. Bitrot is a fundamental limitation of the web, and it sucks.
The workaround is the Wayback Machine, which is amazing, but could be more comprehensive and frequently updated. I wish someone like Google would throw more servers at it.
The point of the article is that as long as you control the domain, you have no excuse for your links breaking. Going out of business, and therefore being unable to pay for your domain, is specifically called out as a valid reason, but there's no reason you'd lose control of your domain after being acquired, even if you decided to redirect your old links to newer information. Shiny new software is also specifically called out as a bad reason to break links, since backwards-compatible redirects are trivial. And if you're capable of permanently losing all your data through accidental deletion or a server being compromised, you have much bigger problems.
All that said, you're fundamentally right -- sometimes information stops being available because it's out of date, and keeping it available would be confusing (if a product is no longer available, it would be strange to maintain a page describing it for years afterwards). Archiving through the Wayback machine is a very helpful stopgap, but expecting them to continuously archive every version of the entire Internet for all time won't scale.
What's needed is a distributed, decentralized system, ideally at the protocol level. Imagine if a GET request by default gave you the "current" version of a page, but you could send an extra header that said "give me this page, as it appeared at date-time X". This would remove the confusion caused by the existence of a page being conflated with that page being current[1], and allow sites to maintain clean navigational and data structures by flagging outdated pages as "expired" instead of completely deleting them. When a server got a request for a page that used to but no longer exists, it could respond with a new 4xx-series header, "No longer current", indicating the document is not available for the given date-time, but is available for an earlier date.
[1] I frequently get people sending me ANGRY emails about flippant, immature blog posts I wrote 10+ years ago[2]. They assume that because it's still on my website, I still stand by those statements, when in fact I'm just reluctant to delete information.
[2] The posts still get traffic, because links to them made 10+ years ago still work, despite rewriting my CMS 3 times.
> What's needed is a distributed, decentralized system, ideally at the protocol level. Imagine if a GET request by default gave you the "current" version of a page, but you could send an extra header that said "give me this page, as it appeared at date-time X".
> [1] I frequently get people sending me ANGRY emails about flippant, immature blog posts I wrote 10+ years ago[2]. They assume that because it's still on my website, I still stand by those statements, when in fact I'm just reluctant to delete information.
Couldn't you implement a clumsy manual version of the 'expired' header you're proposing, by doing something like having your server precede each page with "The following has not been modified since …, and should not be regarded as current" if it is more than a certain amount of time old?
Some of it depends on the nature of the URI though. The article is about linked resources, but a lot of URI usage isn't really about linked resources. Many of these can change reasonably.
For example, is there any harm in changing where that contact us form points to? Do you have to maintain the contact us forms API to be backwards compatible forever? This really depends on what you are doing.
I agree with the article as it relates to documents for the most part. There are cases where maintaining URL backwards compatibility is a problem due to unforeseen dead ends. However, for the most part it shouldn't be. However, URIs needs for persistence varies so widely that I don't know one can generalize much beyond that.
> For example, is there any harm in changing where that contact us form points to?
The point could more be made: "Why not keep that URL the same?"
It's trivial to do technically, and there's no advantage to be had from moving back and forth between foo.com/contact and foo.com/contact-us. If the URL of your company's contact form gets published in a book, would that change your mind?
Sorry I was misunderstood. I was talking about the URL where the form is submitted, not the URL of the form. For the URL of the form, it's generally desirable to keep it the same all things being equal because this doesn't break other people's links.
But where it is a form submission target, the only reason to care is if you are accepting third party form submission. But we are no longer talking about hypertext resources at that point which is my main point.
Organizations with large websites use content management systems. Vendors of content management systems come and go as technology changes. Usually, changing the CMS a website uses necessitates URI structure changes depending on the conventions of the system.
Maintaining in perpetuity a complete set of redirects (or stack of rewrite rules) for every page for a website consisting of millions of pages for decades is not feasible in most cases. Things change, departments are created or disbanded or renamed. Management decides certain things should not be accessible to the public or stored in a particular location.
It's incredibly unrealistic to expect every URI to be permanent.
No, I'm saying that technology changes, the web changes, your organization changes and the content on the website changes. To act like it's even desirable to have all of the same documents at the same locations on your website in 20 years is misguided.
Jeremy Keith's talk and resulting long bet on the topic is interesting. He contends that it's not proven that data put on the internet is in fact immortalized.
> Pretty much the only good reason for a document to disappear from the Web is that the company which owned the domain name went out of business or can no longer afford to keep the server running. Then why are there so many dangling links in the world? Part of it is just lack of forethought.
Nope, it's the lack of a crystal ball. The technology world moves fast. And breaks things, as Zuck says. I have no idea how a site of mine might be structured even a year from now, and if the currently URL scheme will conflict directly with its needs or not, or if an old-to-new translation will be trivial or overly resource-intensive. By now, the world has mostly realized that over-planning is bad, and agile is important.
So who cares if a 5-year-old URL to a page virtually nobody ever visits anymore doesn't work. It's easy to Google the keywords in the link, and you'll probably be able to find the content if it's still around.
(Obviously big sites have an incentive to keep their links working, but they don't need an article from the W3C to tell them that.)
Oh, so you are the guy who has been causing all those 404s that I bump into.
And nice strawman, by the way. I have had some EXTREMELY popular blog posts that I bookmarked (remember NVIE’s post on git branching?) 404 just because the site changed architecture. (Now hosted on GitHub.) It’s all laziness and lack of forethought and planning. What is agile about one of the most popular blog posts in the development world 404ing?
That said, yes, it can be a bit time-consuming to map old URLs to new. I once started a project that would help considerably with this, but my employer at the time decided it wasn’t worth it (hey, would they see any more gold coins if they helped clients’ visitors’ bookmarks and search results work?). But a manual connection isn’t impossible. In some cases, it’s easy. When I moved off Drupal (damn it to hell), I redirected all those old /node/<id> URLs to their new ones. Because, damn it, I wasn’t going to add any 404s to the Internet if I could. I didn’t blog to make money. I blogged to help people. And maintaining URLs helps people.
"technology world moves fast [...] currently URL scheme will conflict directly with its needs or not"
That is your mistake. The URI should be decoupled from the technology you are using. The URI is part of the content and not of the software behind the site.
I love how this starts out by (wrongly) assuming that "you" is a single person or even stable group of people, and that the answer to a changing URI is to say "You did it wrong."
And companies that have millions of developers making money off their APIs (and, in one way or another, paying some dividend back to the API creators) indeed do tend to keep their APIs fairly backwards-compatible, with some notable exceptions.
It can be much worse though...exposing machine names, unnecessary complexity and parameters, all changing from year to year.
A server doesn't have to puke its implementation details everywhere. I can't even count all the "enterprise" apps that make that mistake (e.g. a helpdesk system that gives me "serverNameThatWillChangeNextYear.domain.net/some/unnecessarily/convoluted/path.unnecessaryExtension?whatthehellisallthis&garbage1=a&garbage2=b&finallyRelevantBugNumber" instead of a stable URL like "company.net/bugs/bug123456". And just try E-mailing a complex URL to somebody (it wraps, and time is spent to awkwardly correct it).
Honestly, it's as if most web developers don't understand how powerful URLs can be. If you make URLs short and stable and use them to help look up stuff, they can be very nice. Instead in the past I've seen people E-mailing 9-step instructions on how to find something because the damn URL is unreliable.