Safe Haskell | Safe |
---|---|
Language | Haskell2010 |
Conversion combinator module structure is similar to one found in text and bytestring packages And can be found nested under this module:
- Data.TypedEncoding.Conv.Text
- Data.TypedEncoding.Conv.Text.Encoding
- Data.TypedEncoding.Conv.Text.Lazy
- Data.TypedEncoding.Conv.Text.Lazy.Encoding
- Data.TypedEncoding.Conv.ByteString.Char8
- Data.TypedEncoding.Conv.ByteString.Lazy.Char8
Two goals of these conversions are:
- provide added type safety for string conversions
- provide a way to easily convert encoded data directly between text and bytestring types.
Type Safety
Haskell has 3 (5 counting lazy versions) popular string types, unfortunately they are all quite different.
Consider these 4 popular conversion functions:
import qualified Data.Text as T import qualified Data.ByteString.Char8 as B8 import qualified Data.Text.Encoding as TE T.pack :: String -> T.Text T.unpack :: T.Text -> String B8.pack :: String -> B8.ByteString B8.unpack :: B8.ByteString -> String TE.decodeUtf8 :: B8.ByteString -> T.Text TE.encodeUtf8 :: T.Text -> B8.ByteString
They come in pairs but are not reversible.
Going from A to C depends on B, none of the following 4 diagrams commutes:
String -> B8.pack -> ByteString ^ ^ | | | TE.decodeUtf8 id | | | TE.encodeUtf8 | v | v String -> T.pack -> Text
String <- B8.unpack <- ByteString ^ ^ | | | TE.decodeUtf8 id | | | TE.encodeUtf8 | v | v String <- T.unpack <- Text
All of this can lead to bugs that are hard to find and hard to troubleshoot.
typed-encoding provides more precise types so that all of this goes away.
Here are the type signatures simplified to one single encoding annotation:
import qualified Data.TypedEncoding.Conv.Text as ET import qualified Data.TypedEncoding.Conv.ByteString.Char8 as EB8 import qualified Data.TypedEncoding.Conv.Text.Encoding as ETE ET.pack :: (Superset "r-UNICODE.D76" r) => Enc '[r] c String -> Enc '[r] c T.Text ET.unpack :: (Superset "r-UNICODE.D76" r) => Enc '[r] c T.Text -> Enc '[r] c String EB8.pack :: (Superset "r-CHAR8" r) => Enc '[r] c String -> Enc '[r] c B8.ByteString EB8.unpack :: (Superset "r-CHAR8" r) => Enc '[r] c B8.ByteString -> Enc '[r] c String ETE.decodeUtf8 :: (Superset "r-UTF8" r) => Enc '[r] c B8.ByteString -> Enc '[r] c T.Text ETE.encodeUtf8 :: (Superset "r-UTF8" r) => Enc '[r] c T.Text -> Enc '[r] c B8.ByteString
"r-UNICODE.D76"
and "r-UTF8"
is considered redundant for T.Text
and can be added or dropped as needed.
(This library currently assumes that "r-UTF8"
includes the UNICODE.D76 restriction. This works well with
assumptions made by Text
).
Corresponding pairs reverse, this should be clear since the types are restricted to what T.Text
can store or to
how B8.Char
works.
Now consider any of the above diagrams, for instance, compare
ETE.encodeUtf8 . ET.pack :: (Superset "r-UNICODE.D76" r, Superset "r-UTF8" r) => Enc '[r] c String -> Enc '[r] c B8.ByteString -- and EB8.pack :: (Superset "r-CHAR8" r) => Enc '[r] c String -> Enc '[r] c B8.ByteString
What is the set of common values allowing us to use any of these 2 options?
"r-UNICODE.D76" is not important here (it removes a range of Unicode values way above '255'
), what is the intersection of UTF8 and CHAR8 code point space?
There are many character set encodings that utilize one byte (CHAR8) and UTF8 is different from all of them
but it backward compatible only within the ASCII range of chars < 127
. So the intersection should be ASCII, let us check that:
ghci> :t ETE.encodeUtf8 . ET.pack '["r-ASCII"]
EncTe.encodeUtf8 . EncT.pack
'["r-ASCII"]
:: Enc Symbol Symbol "r-ASCII" ('[] Symbol)) c String
-> Enc Symbol Symbol "r-ASCII" ('[] Symbol)) c B8.ByteString
ghci> :t EB8.pack @'["r-ASCII"]
:: Enc Symbol Symbol "r-ASCII" ('[] Symbol)) c String
-> Enc Symbol Symbol "r-ASCII" ('[] Symbol)) c B8.ByteString
They both accept that common denominator. Now we could run a property test but it is clear that by the design these will match!
Note, there is no Superset "r-UNICODE.D76" "r-CHAR8"
mapping, "r-CHAR8" supersets any
8-bit encoding like ISOIEC 8859/ family of encodings. This is by design even if structurally such
definition would made sense.
This choice effectively prevents anything classified under "r-CHAR8"
to end up as a visible encoding annotation in Text
(since that would made little sense as Text
is UTF encoded).
This is just one example of added type level security that type-encoding provides.
Currently, "r-CHAR8" is intended as upper bound on "r-" encodings only. There is no way to encode to it using provided encoding mechanisms (except for unsafe options). Effectively the types
Enc "r-CHAR8" c str
can be viewed as uninhabited.
However, Char
is often used instead of Word8
for low level ByteString
programming. This is supported with the "r-ByteRep"
annotation.
Enc "r-ByteRep" c str
this one can be used as Superset "r-CHAR8" "r-ByteRep"
!
That allows for EncB8
conversions to work on such data.
However, there is no Superset "r-UNICODE.D76" "r-ByteRep"
so these cannot be converted to Text
, which is
exactly what is intended.
Enc
conversions
Consider defining a conversion function :: Enc xs c str1 -> f (Enc xs c str2)
.
One challenge is how do we know that xs
is a valid encoding stack also for str2
?
Should we constrain that?
This is made even more difficult because this library uses (has to) orphan instances.
The other challenge is how to ensure that, if the destination is partially or fully decoded, then it will decode without errors and the decoding will be meaningful.
Current version does not impose any instance constraint about existence of the stack for str2
.
It is possible to not have one, in that case decodeAll
combinators will not be available.
This is still useful as the payload could be safely extracted, to save to the database or do other things with it.
Future versions of typed-encoding may provide ways to ensure validity of the encoding stack for str2
.