Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But Unicode isn't in the business of cataloguing all the possible meanings in the world. They're a catalog of glyphs.

Looking into it further, I see that indeed "Cyrillic Capital Letter A", "Greek Capital Letter Alpha", and "Latin Capital Letter A" are different, despite not actually being different in any way. That supports the idea that "Latin small letter i" and "Turkish small letter i with dot" should be different points. But it also supports the idea that "Latin Capital Letter A", "English Capital Letter A", and "French Capital Letter A" should all be different points, which they're not. This approach is incoherent.

In Chinese writing, the same syllable shì is encoded differently depending on whether it means "be", "circumstance", "the world", "city", "style", "respectable person", "Shì, the surname", "room", "test", "look at", "indicate", "suitable", "knowledge", etc. And those examples I've listed are just the shì's that I personally know -- there are plenty more. This was, by any standard at all, a terrible, terrible decision (to be fair, Japanese writing is even worse). Why would you want Unicode to be more like that?



The reason Latin Capital Letter A is distinct from Cyrillic Capital Letter A and Greek Capital Letter Alpha but not English Capital Letter A or French Capital Letter A is that the latter two are actually the same thing whereas the first two are not.

For example, it might not be acceptable for a Cyrillic reader to find the Cyrillic A to be stylized the same way as a Latin A or Greek Alpha in multi-lingual text. Imagine a cursive handwriting font for example: the Latin letter "A" should likely appear in the same style as the other Latin letters whereas the Cyrillic letter "A" should likely appear in the same style as the other Cyrillic letters (but French and English -- e.g. in citations or quotes -- shouldn't be distinguished that way, except for punctuation).

The reason the s symbol for seconds isn't distinguished is that the SI explicitly defined the symbols as regular letters. Even the widely used cursive "l" (for litre/liter) is merely a legacy symbol (as is the capital "L", which is mostly used to avoid misreading it as the digit "1"). So while, for example, the sequence "km²" has a specific meaning of "square kilometers", it's intended to be represented as <k><m><superscript 2> not <SI prefix kilo><SI unit meter><superscript 2> (just like the English word "I" is represented as <Capital Latin Letter I>, not <English First Person Pronoun>).

The problem with Chinese writing is that each code point corresponds to a different glyph and there are many homophonic words (with different meanings and corresponding glyphs). Han Unification and CJK in general are complex subjects of their own, however -- even the experts seem to disagree on how CJK glyphs should be treated in detail (and it's not as easy as just saying Unicode is broken because it's written by white American men, because that's not at all true).


> The reason Latin Capital Letter A is distinct from Cyrillic Capital Letter A and Greek Capital Letter Alpha but not English Capital Letter A or French Capital Letter A is that the latter two are actually the same thing whereas the first two are not.

You say this, but you give no argument at all. What makes English A and French A the same thing in a way that English A and Cyrillic A aren't? From a Unicode perspective (at least, from the perspective that Unicode takes with respect to Chinese characters -- 青 and 靑 are different code points despite being the same character), "Cyrillic Capital Letter A" and "Cyrillic Handwritten Capital Letter A" would be different code points, since they look nothing alike (disclaimer: I don't know that in the particular case of A the printed form and the handwritten form look nothing alike, but I do know that in general the handwritten form of a cyrillic character may not resemble the printed form. Feel free to substitute some other letter where I say A).

> but French and English -- e.g. in citations or quotes -- shouldn't be distinguished that way, except for punctuation

Why not?

I have seen advocacy for using the distinction that Unicode already makes (!) between a fancily curved apostrophe that is part of the spelling of a word, and the same glyph, but used to end a quotation. This is stupid. You'd need to train everyone in the world (well, the relevant part of the world) to input the same thing differently depending on how they're using it. Even if you made the effort, it will never happen reliably -- people have enough trouble spelling words differently depending on how those words are used.

Finally, my original point about Chinese writing didn't refer to Unicode at all. Spoken Chinese is encoded into characters using a system that makes heavy reference to semantics. This is a terrible idea, and as a result Chinese speakers are severely retarded in the development of their ability to read. The claim I'm making is that mission-creeping Unicode to encode semantics rather than glyphs (much as written Chinese encodes semantics more than it encodes sound) is similarly a bad idea. The goal of Unicode is to represent writing, not meaning.


> You say this, but you give no argument at all. What makes English A and French A the same thing in a way that English A and Cyrillic A aren't?

"Latin" is used as the name of a writing system, which is used by both English and French. "Cyrillic" is also the name of a writing system, used by several languages. In some cases, there is a one-to-one relationship between named writing systems and languages, but not in these cases.

The distinction made in Unicode is between writing systems, not languages.

(Note that some valid renderings of a letter in one writing system may be similar to a letter in a different writing system, but the range of acceptable renderings is different between writing systems. This is generally not the case between languages using the same writing system.)


English and French do not use the same writing system. Here are some letters in the French alphabet that don't occur in the English alphabet (or the Latin alphabet!): ô é ç. (It is true that these are described in French as o, e, and c with diacritics. You can make this more rigorous by observing a language like German, or until recently Spanish, which promotes such marks as ö or "ch" to full letter status. Unless you're arguing that Spanish used to use a unique Spanish writing system, but recently switched to the more common Latin one?) If French and English use the same system, there is no argument that Turkish uses a different one.

You could argue that French A and English A are the same because they are identical by descent, but then you'd have to admit that Greek A and Cyrillic A are more of the same.

So I guess the question is: how are you defining "writing system"?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: