Safe Haskell | Safe-Inferred |
---|
Get the arguments from the command line, ensuring they are properly encoded into Unicode.
base 4.3.1.0 has a System.Environment.getArgs that does not return a Unicode string. Instead, it simply puts each octet into a different Char. Thus its getArgs is broken on UTF-8 and nearly any non-ASCII encoding. As a workaround I use System.Environment.UTF8. The downside of this is that it requires that the command line be encoded in UTF8, regardless of what the default system encoding is.
Unlike base 4.3.1.0, base 4.4.0.0 actually returns a proper Unicode string when you call System.Environment.getArgs. (base 4.3.1.0 comes with ghc 7.0.4; base 4.4.0.0 comes with ghc 7.2.) The string is encoded depending on the default system locale. The only problem is that System.Environment.UTF8 apparently simply uses System.Environment.getArgs and then assumes that the string it returns has not been decoded. In other words, System.Environment.UTF8 assumes that System.Environment.getArgs is broken, and when System.Environment.getArgs was fixed in base 4.4.0.0, it likely will break System.Environment.UTF8.
One obvious solution to this problem is to find some other way to get the command line that will not break when base is updated. But it was not easy to find such a thing. The other libraries I saw on hackage (as of January 6, 2012) had problems, such as breakage on ghc 7.2. There is a package that has a simple interface to the UNIX setlocale(3) function, but I'm not sure that what it returns easily and reliably maps to character encodings that you can use with, say, iconv.
So by use of Cabal and preprocessor macors, the code uses utf8-string if base is less than 4.4, and uses System.Environment.getArgs if base is at least 4.4.
The GHC bug is here:
Documentation
Gets the command-line arguments supplied by the program's
user. If the base
package is older than version 4.4, then this
function assumes the command line is encoded in UTF-8, which is
true for many newer Unix systems; however, many older systems may
use single-byte encodings like ISO-8859. In such cases, this
function will give erroneous results.
If the base
package is version 4.4.0 or newer, this function
simply uses the getArgs that comes with base
. That getArgs
detects the system's default encoding and uses that, so it should
give accurate results on most systems.
getProgName :: IO StringSource
Gets the name of the program that the user invoked. See
documentation for getArgs
for important caveats that also apply
to this function.