Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cool URIs don't change. (w3.org)
187 points by diwank on June 25, 2012 | hide | past | favorite | 84 comments


One thing that's always bugged me is file extensions...I hate having to type ".php" or ".cgi" or ".asp" or whatever else just happens to reflect the server's implementation at the time. If they switch from Microsoft to LAMP then their URLs will probably all break needlessly.

It can be much worse though...exposing machine names, unnecessary complexity and parameters, all changing from year to year.

A server doesn't have to puke its implementation details everywhere. I can't even count all the "enterprise" apps that make that mistake (e.g. a helpdesk system that gives me "serverNameThatWillChangeNextYear.domain.net/some/unnecessarily/convoluted/path.unnecessaryExtension?whatthehellisallthis&garbage1=a&garbage2=b&finallyRelevantBugNumber" instead of a stable URL like "company.net/bugs/bug123456". And just try E-mailing a complex URL to somebody (it wraps, and time is spent to awkwardly correct it).

Honestly, it's as if most web developers don't understand how powerful URLs can be. If you make URLs short and stable and use them to help look up stuff, they can be very nice. Instead in the past I've seen people E-mailing 9-step instructions on how to find something because the damn URL is unreliable.


> Honestly, it's as if most web developers don't understand how powerful URLs can be.

Or, more likely, most of the "enterprise" apps with those URLs were first released over a decade ago, when the idea of "pretty URLs" was not yet mainstream. There was not even a mention of mod_rewrite on Wikipedia at that point. The article on the front controller pattern, which most webapps that handle their own URL routing use, wasn't written until 2008.


> most of the "enterprise" apps with those URLs were first released over a decade ago, when the idea of "pretty URLs" was not yet mainstream.

Or, even more likely, the enterprise customer never specified "URLs that aren't terrible" in their 100-page RFP, and so the lowest bidder didn't bother with URLs that aren't terrible.


The first URLs were short and sweet.

HTML, and server paths / CGIs were largely hand-coded and short.

The URI explosion occurred in the late 1990s / early aughts for the most part with Java and Microsoft entering the fray from my recollection.


So like he said, over a decade ago.


The thing that drives me up the wall is "index.html" - that's pretty much inexcusable in my opinion. There are even some sites where the root path on the domain REDIRECTS to example.com/index.html - don't do that.


While seeing "index.html" (or "index.ANYTHING") makes me cringe, I have some sympathy for the developers who do this.

Let's say I have an old-school, all-static site with pages at http://example.com/x/index.html and http://example.com/x/about.html. I would like to make a link to the "index" page from the "about" page. What are my choices?

<a href="/x/"> will work, but will break when someone decides to move "/x" to "/y".

<a href="."> will work from the server, which does an internal redirect, but not on the static version on my local drive I'm going to demo to my boss. (I also have a hunch a significant number of developers aren't aware of "." and don't know this is an option.)

So we end up with <a href="index.html"> for better or worse.


  <a href="/x/"> will work, but will break when someone decides to move "/x" to "/y".
If /x redirected to /y, that wouldn't be a problem.


If you're having problems like this, your web development environment is total garbage and should be fixed.

99% of the time the reason for index.html is the developer was viewing static files in their browser because they didn't have a proper server or test environment. This is inexcusable in 2012.


I don't get it - why is index inexcusable as part of a url?


The index object should be the default page opened by a webserver for any given unspecified path.

Eg: http://www.example.com/path/to/url/ should open to the first specified of the the specified default objects, typically index.html index.shtml index.php index.php, or similar. This is defined in your Apache conf file, or locally via .htaccess.

If you want to refer to a specific non-index page, you'd specify http://www.example.com/path/to/uri/somepage.html


I understand that: index.html is just one possible name for the "default page" filename. IMO its perfectly valid (if redundant) to specify it explicitly in a url. I still don't understand why this was described (not by you) as unacceptable.


In general, you want URLs that are short, clear, memorable, and stable. You want to tell people, "Come to mydomain.com/blog," not, "Come to http://www.wordpress.mydomain.com/stamp_collectors_heaven_bl... The latter includes lots of information that their browser + your server can figure out, so you shouldn't be burdening your (potential) users with it.

So, regarding the "index.html" portion: The index is the default. There's no point having a default if you have to specify it, and there's no point specifying it if it's the default.


A web page should have one and only one URL. If you link to the index.html version of a page you're creating two URLs for that page (the index.html one and the bare / one). The / one is shorter, easier to type and more attractive, so you should pick that one.


Because it's presenting an implementation detail.

If I change my file formats from HTML to SHTML, .jsp, .php, etc., the URL changes.

If I change from files to directories for every possible URL instance, the URL changes.

The user shouldn't care.


Earlier you said: "If you want to refer to a specific non-index page, you'd specify http://www.example.com/path/to/uri/somepage.html "

So, is this presenting an implementation detail as well?

I only asked my original question because I thought I was missing something...


Fair gripe.

If you'd started with the 'document'-.'html' naming convention, you could use any of numerous webserver hacks to preserve this illusion. How you request something has little to do with what the server does to satisfy your request.

What I was distinguishing earlier was the distinction between 'return default index of this level of the path' and 'return a specific document from this level of the path'. Specifically indicating 'index.html' is a tad gauche.


I think some organizations see this kind of implementation leakage as a feature, not a bug.

I remember when http://microsoft.com/ began doing external redirects to "default.asp" circa 1997. If you were a "webmaster" (do these exist anymore?), this was a dog whistle. They were not using static .html (or .htm) but not any of the common dynamic methods like .cgi or .shtml either. And using "default" rather than "index" indicated a break from NCSA/Apache convention. They were using a different web server. Those 11 extra characters said a lot.


Bright people use other people's problems to their advantage: in pg's viaweb they deliberately put extra `cgi-bin/` every here and there just to confuse their competitors.

http://news.ycombinator.com/item?id=4589


Problem ist most companies don't have someone who actually owns the URL namespace. Or the engineer who wrote that bug reporting tool was too shy to contact the CIO, etc.

I agree with your point, but it's also valuable to understand the structural reasons that prevent such things.


On a related tangent, can we please have an operating system that eschews extensions and instead stores the MIME type for each file?


Ugh. I have that in the konqueror file browser. It beleives that my .epub files are zip archives, so the default handler is not an epub reader.


What's file(1) say about those files?

If it's bad magic, update your distro. If it's a Konqueror error, file a bug. I'd be surprised if upstream hasn't addressed this (quick DDG/Google doesn't turn up any similar complaints).


But they are zip files. (Though `file somebook.epub tells me "data". Sigh. Desktop and OS seem to be on different pages.)

Just like .css and .log are text files, but I may not want the same default file handler.

What I like is having *.epub open in, say, calibre, but if I append a .zip extension then it will open in xarchive.

Anyways, I open most things from the command line and wrote my version of 'open' so I get what i expect 99% of the time. :)

I'm using KDE3 so I'm not expecting any upstream fixes for this in my lifetime. I can live with crafting my own solutions.

What might work best is if mime types were used by default but forcing behavior for specific extensions was much easier. Get the best of both worlds (which I can mostly do in KDE3 Konqueror but it's tedious.)


File(1) on my Debian wheezy says they're epubs. File has dealt with nested formats for decades. It examines up to the first 1024 bytes of the file IIRC.

Other examples: tar.gz, WAR files, most ODF formats.

Edit: smartphone tyops fixed.


... and file(1) under my Ubuntu 11.10 system calls 'em zips.

Debian added upstream epub support (and backed out its own) in September, 2011, so this is a pretty recent feature.


Do you mean default file magic: http://en.wikipedia.org/wiki/Magic_number_(programming)

Or do you mean storing a database of the application associated, specifically, with any given file (so that I might open, say, a given myprog.c with vi, emacs, or textmate)?

The Mac OS resource fork model attempts the second, but it's an inherently single-user concept that leaves artifacts around for other users and/or on shared media in an annoying way. A shadow filesystem maintained on a per-user basis under their control would be a preferred solution.



BeOS did that. I think it's a shame that OS X seems to be moving backwards in this respect.


Files used to be tied to applications directly. While I prefer the BeOS approach of MIME-type, it's not nearly as user-friendly as a file extension.

OS X tries to hide extensions by default, which like the behavior in Windows that's similar, seems dangerous. Too many times people have been stung by sexy.jpg.exe.


The crazy mac resource fork thing isn't exactly better.


When working with apache, one can use the option MultiViews to easily perform content negotiation without the file extension in the uri.

If I do find myself working with LAMP (which a lot of people still use) then I use this option, and my uris magically no longer end with ".php"


Not only has the URL to that page not changed in 13 years, its content hasn't either. http://web.archive.org/web/19990508205057/http://www.w3.org/...


I had a number of revelations through rereading this classic article:

First, holy crap, I hadn't been to http://www.w3.org/ in a long time, and it looks like they've actually made it to the 21st century!

Second, perhaps cool URIs don't change, but it seems like http://www.w3.org/Provider/Style/URI.html is kind of an unfortunate URI. What's "Provider"? Why are "Provider" and "Style" uppercase? And what's wrong with 301 redirecting (don't break old URIs, but still restructure them as your website matures and you realize a better organizational hierarchy)?

Third and perhaps most importantly, this all seems like a pretty awesome problem to have! How many websites survive more than a few years? (Geocities doesn't count.)


> Second, perhaps cool URIs don't change, but it seems like http://www.w3.org/Provider/Style/URI.html is kind of an unfortunate URI. What's "Provider"? Why are "Provider" and "Style" uppercase?

They're uppercase because they're uppercase. Asking why is like asking why Rubyists like_using_method_names_like_this and .NET devs LikeUsingMethodNamesLikeThis.


Convention is for URLs to be all lowercase, which IMO aids in readability and certainly makes them easier to type (imagine giving a URL to a friend over the phone -- remember this is 1999).


Well luckily for you, the URI is case-insensitive and redirects (301) to the canonical version.


The standard is that hostnames are case-insensitive, but non-FQDN paths should allow both uppercase (especially for systems which don't include lower-case elements, yes, Virginia, they exist) and lowercase elements.


I agree that method_names_like_this is more readable, even (or, especially) after many years of .NET programming. I used the_style_like_this back in C++ days and, honestly, miss it. Perhaps, the renaissance of C++ in the form of C++11 will bring it back :)


like_using_method_names_like_this is more readable. White space has been in use to separate words since the time of Alciun, and that's the closest way to do it. Call me conservative but I prefer to stick to the insights of the last 1400 years rather than mangle them horribly ;-)

Oh, and I am not a Ruby developer ;-)


Whitespace is nice but underscores aren't whitespace. They also leave a zigzag effect that isn't wonderful.


They are the closest thing you can have to whitespace. That's the point. Actually come to think of it, they are underlined whitespace ;-)

As for a zigzag effect I think CamelCasing is at least that bad, maybe moreso.

Wondering if I get points for bringing early Medieval history into a discussion on coding conventions ;-)


What do you mean by zigzag effect?


The it isn't actually

word+empty-space+word

but

word+line-down-below+word

so your eyes go zig-zag from the taller glyphs of the word to the low glyph of the underscore (vertical zig zag motion), instead of jumping to the next world (horizontal).


funny, I was just checking because I have never noticed my eyes doing that and afaics my eyes still scan horizontally. I don't need to look down to see the underscore. I guess YMMV?


Actually I read somewhere that the decision for Ruby using underscored names as a naming convention was because of the very large number of non-native English speakers in the early days of the language. Something along the lines of it was easier to read for non-native speakers because it more closely mimicked spaced English.

Searching for the source now.


> Second, perhaps cool URIs don't change, but it seems like http://www.w3.org/Provider/Style/URI.html is kind of an unfortunate URI. What's "Provider"? Why are "Provider" and "Style" uppercase?

Take a look at w3.org/Provider and w3.org/Provider/Style. It's actually a well-constructed URL. It appears to be a "Style guide" for "web content Providers."


> What's "Provider"?

It's because it is a part of a collection of numerous pages named "Putting Information onto the Web". Its audience is obviously (content) providers, which the URI reflects.


I'm not sure if you did it on purpose, but

http://www.w3.org/Provider/Style/URI

works and is more in the spirit of the article than

http://www.w3.org/Provider/Style/URI.html


http://www.w3.org/Provider/Style/URI.anything gives a menu of available extensions to choose from. For example, html.es for spanish.


However, http://www.w3.org/Provider/Style/URI/ does not work. I guess this makes some sense, but certainly violates the "typical" functionality on the web where trailing slashes are ignored.


Actually, this behavior you're referring to is a combination of trying to access a directory (to which a missing trailing slash is automatically appended) and the use of default pages (for Apache, that's the DirectoryIndex directive). Trailing slashes on files is not valid, because files are not directories.


Agreed. That’s a rule I mention in my piece, “URL as UI”, which I really want everyone to read. (If it feels like self-promotion, then please kindly ignore my username / blog masthead. I only care about these ideas.) http://alanhogan.com/url-as-ui


> Gracefully handle 404s.

> [...]

> Consider linking to your home page and/or site map [...]

For a second I thought you were advocating redirecting 404s to the homepage, and I was about to flame you. I hate that one, and I think it's harmful enough that you probably ought to have a point about not doing it in that article.


That's not true everywhere. For example, http://en.wikipedia.org/wiki/Wikipedia/.


Obviously it isn't true everywhere, it's a convention not a hard rule. If Wikipedia were friendlier they'd do a redirect there instead of 404-ing.



I agree with this article to the extent that we should choose our URIs carefully and semantically. However, it is very likely that your web facing application, and the semantics of your resources will change over time. What if I wanted to deprecate and eventually remove my 'Providor' or 'Style' resource? Maybe I decided to create a 'blogs' and an 'articles' resource. That would change the URI from http://www.w3.org/Provider/Style/URI.html to http://www.w3.org/blogs/articles/URI.html (I know that my choice of new resources aren't the most interesting but you get the point). Of course, we would still support the old URI via a 301 redirect to the new one, or continue to serve up the page from the old URI and use a rel canonical meta tag. I'd advocate for 301ing to the new URI and letting the past be the past. We shouldn't be bound to these decisions for the rest of our lives. Unless you're sure that the focus of your web app will never change, or you plan to build your new resource URIs around old resource URIs that were chosen at a time when your new resources weren't being considered (which would lead to far worse URIs), then I'd plan for your URIs to change.


It's somewhat ironic that this was submitted as http://www.w3.org/Provider/Style/URI.html rather than http://www.w3.org/Provider/Style/URI.


Just saw another classic in the comments of a different story:

Top 10 mistakes of Web design http://www.useit.com/alertbox/9605.html

More classics anybody?


Good they mention "banner blindness" and related phenomena. I see webpages sometimes that make these mistakes, and once or twice I ever got burned by it - like when I missed a very important "Point 0" from some IRC channel's FAQ. This Point 0 was so important for channel mods, that they put it directly above the page headline, and I honestly didn't perceive it was there until I reread the page for the third time.


HN, what is going on here? Why are the four comments which disagree with the article (including my comment) all being downvoted?

Downvotes are for comments that don't contribute, not comments you disagree with.


The problem is that given enough time, the URL will break-- you'll redo the entire site (a shiny new CMS/cloud provider/hosted solution/server!), or accidentally delete everything, or get hacked, or get acquired, or go out of business. Or even if you keep your important links up forever, various obscure bits and pieces will come and go, and years later, somebody will find an old link to one of them. Bitrot is a fundamental limitation of the web, and it sucks.

The workaround is the Wayback Machine, which is amazing, but could be more comprehensive and frequently updated. I wish someone like Google would throw more servers at it.


The point of the article is that as long as you control the domain, you have no excuse for your links breaking. Going out of business, and therefore being unable to pay for your domain, is specifically called out as a valid reason, but there's no reason you'd lose control of your domain after being acquired, even if you decided to redirect your old links to newer information. Shiny new software is also specifically called out as a bad reason to break links, since backwards-compatible redirects are trivial. And if you're capable of permanently losing all your data through accidental deletion or a server being compromised, you have much bigger problems.

All that said, you're fundamentally right -- sometimes information stops being available because it's out of date, and keeping it available would be confusing (if a product is no longer available, it would be strange to maintain a page describing it for years afterwards). Archiving through the Wayback machine is a very helpful stopgap, but expecting them to continuously archive every version of the entire Internet for all time won't scale.

What's needed is a distributed, decentralized system, ideally at the protocol level. Imagine if a GET request by default gave you the "current" version of a page, but you could send an extra header that said "give me this page, as it appeared at date-time X". This would remove the confusion caused by the existence of a page being conflated with that page being current[1], and allow sites to maintain clean navigational and data structures by flagging outdated pages as "expired" instead of completely deleting them. When a server got a request for a page that used to but no longer exists, it could respond with a new 4xx-series header, "No longer current", indicating the document is not available for the given date-time, but is available for an earlier date.

[1] I frequently get people sending me ANGRY emails about flippant, immature blog posts I wrote 10+ years ago[2]. They assume that because it's still on my website, I still stand by those statements, when in fact I'm just reluctant to delete information.

[2] The posts still get traffic, because links to them made 10+ years ago still work, despite rewriting my CMS 3 times.


> What's needed is a distributed, decentralized system, ideally at the protocol level. Imagine if a GET request by default gave you the "current" version of a page, but you could send an extra header that said "give me this page, as it appeared at date-time X".

Sounds like Freenet USK's (see https://freenetproject.org/understand.html, search for USK (and boo on them for not having any anchors on that page)).


> [1] I frequently get people sending me ANGRY emails about flippant, immature blog posts I wrote 10+ years ago[2]. They assume that because it's still on my website, I still stand by those statements, when in fact I'm just reluctant to delete information.

Couldn't you implement a clumsy manual version of the 'expired' header you're proposing, by doing something like having your server precede each page with "The following has not been modified since …, and should not be regarded as current" if it is more than a certain amount of time old?


> an extra header that said "give me this page, as it appeared at date-time X"

It's your lucky day: http://www.mementoweb.org/


Some of it depends on the nature of the URI though. The article is about linked resources, but a lot of URI usage isn't really about linked resources. Many of these can change reasonably.

For example, is there any harm in changing where that contact us form points to? Do you have to maintain the contact us forms API to be backwards compatible forever? This really depends on what you are doing.

I agree with the article as it relates to documents for the most part. There are cases where maintaining URL backwards compatibility is a problem due to unforeseen dead ends. However, for the most part it shouldn't be. However, URIs needs for persistence varies so widely that I don't know one can generalize much beyond that.


> For example, is there any harm in changing where that contact us form points to?

The point could more be made: "Why not keep that URL the same?"

It's trivial to do technically, and there's no advantage to be had from moving back and forth between foo.com/contact and foo.com/contact-us. If the URL of your company's contact form gets published in a book, would that change your mind?


Sorry I was misunderstood. I was talking about the URL where the form is submitted, not the URL of the form. For the URL of the form, it's generally desirable to keep it the same all things being equal because this doesn't break other people's links.

But where it is a form submission target, the only reason to care is if you are accepting third party form submission. But we are no longer talking about hypertext resources at that point which is my main point.


Organizations with large websites use content management systems. Vendors of content management systems come and go as technology changes. Usually, changing the CMS a website uses necessitates URI structure changes depending on the conventions of the system.

Maintaining in perpetuity a complete set of redirects (or stack of rewrite rules) for every page for a website consisting of millions of pages for decades is not feasible in most cases. Things change, departments are created or disbanded or renamed. Management decides certain things should not be accessible to the public or stored in a particular location.

It's incredibly unrealistic to expect every URI to be permanent.


You are describing the status quo as you see it, as if it were a rebuttal to the ideal towards which we should be striving.


No, I'm saying that technology changes, the web changes, your organization changes and the content on the website changes. To act like it's even desirable to have all of the same documents at the same locations on your website in 20 years is misguided.


Jeremy Keith's talk and resulting long bet on the topic is interesting. He contends that it's not proven that data put on the internet is in fact immortalized.

- http://vimeo.com/34269615

- http://longbets.org/601/


> Pretty much the only good reason for a document to disappear from the Web is that the company which owned the domain name went out of business or can no longer afford to keep the server running. Then why are there so many dangling links in the world? Part of it is just lack of forethought.

Nope, it's the lack of a crystal ball. The technology world moves fast. And breaks things, as Zuck says. I have no idea how a site of mine might be structured even a year from now, and if the currently URL scheme will conflict directly with its needs or not, or if an old-to-new translation will be trivial or overly resource-intensive. By now, the world has mostly realized that over-planning is bad, and agile is important.

So who cares if a 5-year-old URL to a page virtually nobody ever visits anymore doesn't work. It's easy to Google the keywords in the link, and you'll probably be able to find the content if it's still around.

(Obviously big sites have an incentive to keep their links working, but they don't need an article from the W3C to tell them that.)


Oh, so you are the guy who has been causing all those 404s that I bump into.

And nice strawman, by the way. I have had some EXTREMELY popular blog posts that I bookmarked (remember NVIE’s post on git branching?) 404 just because the site changed architecture. (Now hosted on GitHub.) It’s all laziness and lack of forethought and planning. What is agile about one of the most popular blog posts in the development world 404ing?

That said, yes, it can be a bit time-consuming to map old URLs to new. I once started a project that would help considerably with this, but my employer at the time decided it wasn’t worth it (hey, would they see any more gold coins if they helped clients’ visitors’ bookmarks and search results work?). But a manual connection isn’t impossible. In some cases, it’s easy. When I moved off Drupal (damn it to hell), I redirected all those old /node/<id> URLs to their new ones. Because, damn it, I wasn’t going to add any 404s to the Internet if I could. I didn’t blog to make money. I blogged to help people. And maintaining URLs helps people.


An interface where an admin can incrementally add redirect URLs directly in the 404 logs, to the top most frequent misses, would go a long way.

It's the UX inertia of having to write mod_rewrite rules that ensure its never a priority and thus never happens.


"technology world moves fast [...] currently URL scheme will conflict directly with its needs or not"

That is your mistake. The URI should be decoupled from the technology you are using. The URI is part of the content and not of the software behind the site.


Amen. A thousand times, amen.

If your stack requires tight coupling with URL structure, you have doomed yourself already.


And if you do not decouple them, users will be more likely to use 3rd party services to effect the decoupling.


> If so, you chose them very badly.

I love how this starts out by (wrongly) assuming that "you" is a single person or even stable group of people, and that the answer to a changing URI is to say "You did it wrong."


It is like asking a developer not to change API. Or keep the old version.


And companies that have millions of developers making money off their APIs (and, in one way or another, paying some dividend back to the API creators) indeed do tend to keep their APIs fairly backwards-compatible, with some notable exceptions.


And the coolest URIs can't change.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: