This has been bugging me for a couple of weeks now, so I want to put it out where it might be noticed by a few more people.
Increasingly there is an assumption that one can simply use UTF8 encoding on Unix-like filesystems without any problem. I'm pretty sure this is false; moreover, there is already evidence of the problems I see with it:
- The Gtk+ file selector can be made to dump core on entering a directory with files whose names are not valid UTF8;
- Both Gtk+ and Cocoa (Mac OS X) file selectors can be confused/made to open the wrong files (or later save to the wrong file) if the filename is not normalized;
- And, as the subject suggests, all the issues we've already encountered with Punycode for DNS apply.
I suspect the latter two have security implications.
The root problems are that (a) POSIX specifies that pathnames are unencoded bytes, and (b) UTF8 does not (and, because of the existence of non-normalized forms in Unicode, cannot) provide a unique and reversible mapping between pathnames and the byte sequences that represent them. This is going to blow up in someone's face sooner or later, and I hope it doesn't do so because the abovementioned confusion leads to someone being tricked into opening the wrong file. (Think symlink race conditions enabled by non-normalized UTF8 filenames, for one possibility. I'm sure the real security experts — I know just enough to be dangerous — can think of more.)
I suspect the only fix for this is a new filesystem API which requires some standard encoding (UTF8, or UTF16 like Windows, or even UTF32) and which enforces normalization of pathnames. And this will both violate POSIX (hence new API; it will also have to deal with invalid names created via the POSIX API, which in the case of non-normalized Unicode will probably still leave Punycode-like issues) and probably annoy CJK users who have good reason to avoid the wastefulness of UTF8 and even UTF16 (given the "astral plane" CJK characters).