Safe Haskell | None |
---|---|
Language | Haskell2010 |
If you're accustomed to working with text in almost any other programming
language, you'd be aware that a "string" typically refers to an in-memory
array of characters. Traditionally this was a single ASCII byte per
character; more recently UTF-8 variable byte encodings which dramatically
complicates finding offsets but which gives efficient support for the
entire Unicode character space. In Haskell, the original text type,
String
, is implemented as a list of Char
which, because a Haskell list
is implemented as a linked-list of boxed values, is wildly inefficient at
any kind of scale.
In modern Haskell there are two primary ways to represent text.
First is via the [rather poorly named] ByteString
from the bytestring
package (which is an array of bytes in pinned memory). The
Data.ByteString.Char8 submodule gives you ways to manipulate those arrays
as if they were ASCII characters. Confusingly there are both strict
(Data.ByteString
) and lazy (Data.ByteString.Lazy
) variants which are
often hard to tell the difference between when reading function signatures
or haddock documentation. The performance problem an immutable array backed
data type runs into is that appending a character (that is, ASCII byte) or
concatonating a string (that is, another array of ASCII bytes) is very
expensive and requires allocating a new larger array and copying the whole
thing into it. This led to the development of "builders" which amortize
this reallocation cost over time, but it can be cumbersome to switch
between Builder
, the lazy ByteString
that results, and then having to
inevitably convert to a strict ByteString
because that's what the next
function in your sequence requires.
The second way is through the opaque Text
type of Data.Text from the
text package, which is well tuned and high-performing but suffers from
the same design; it is likewise backed by arrays. Rather surprisingly, the
storage backing Text objects are encoded in UTF-16, meaning every time you
want to work with unicode characters that came in from anywhere else and
which inevitably are UTF-8 encoded you have to convert to UTF-16 and copy
into a new array, wasting time and memory.
In this package we introduce Rope
, a text type backed by the 2-3
FingerTree
data structure from the fingertree
package. This is not an uncommon solution in many languages as finger trees
support exceptionally efficient appending to either end and good
performance inserting anywhere else (you often find them as the backing
data type underneath text editors for this reason). Rather than Char
the
pieces of the rope are ShortText
from the text-short
package, which are UTF-8 encoded and in normal memory managed by the
Haskell runtime. Conversion from other Haskell text types is not O(1)
(UTF-8 validity must be checked, or UTF-16 decoded, or...), but in our
benchmarking the performance has been comparable to the established types
and you may find the resultant interface for combining chunks is comparable
to using a Builder, without being forced to use a Builder.
Rope
is used as the text type throughout this library. If you use the
functions within this package (rather than converting to other text types)
operations are quite efficient. When you do need to convert to another type
you can use fromRope
or intoRope
from the Textual
typeclass.
Note that we haven't tried to cover the entire gamut of operations or
customary convenience functions you would find in the other libraries; so
far Rope
is concentrated on aiding interoperation, being good at
appending (lots of) small pieces, and then efficiently taking the resultant
text object out to a file handle, be that the terminal console, a file, or
a network socket.
Synopsis
Rope type
A type for textual data. A rope is text backed by a tree data structure, rather than a single large continguous array, as is the case for strings.
There are three use cases:
Referencing externally sourced data
Often we interpret large blocks of data sourced from external systems as
text. Ideally we would hold onto this without copying the memory, but (as
in the case of ByteString
which is the most common source of data) before
we can treat it as text we have to validate the UTF-8 content. Safety
first. We also copy it out of pinned memory, allowing the Haskell runtime
to manage the storage.
Interoperating with other libraries
The only constant of the Haskell universe is that you won't have the right
combination of {strict, lazy} × {Text
, ByteString
, String
, [Word8]
,
etc} you need for the next function call. The Textual
typeclass provides
for moving between different text representations. To convert between
Rope
and something else use fromRope
; to construct a Rope
from
textual content in another type use intoRope
.
You can get at the underlying finger tree with the unRope
function.
Assembling text to go out
This involves considerable appending of data, very very occaisionally
inserting it. Often the pieces are tiny. To add text to a Rope
use the
append
method as below or (<>
) from Data.Monoid (like you
would have with a Builder
).
Output to a Handle
can be done efficiently with hWrite
.
Instances
Eq Rope Source # | |
Ord Rope Source # | |
Show Rope Source # | |
IsString Rope Source # | |
Defined in Core.Text.Rope fromString :: String -> Rope # | |
Generic Rope Source # | |
Semigroup Rope Source # | |
Monoid Rope Source # | |
NFData Rope Source # | |
Defined in Core.Text.Rope | |
Hashable Rope Source # | |
Defined in Core.Text.Rope | |
ToJSON Rope Source # | |
Defined in Core.Encoding.Json | |
Pretty Rope Source # | |
Defined in Core.Text.Rope | |
Textual Rope Source # | |
Render Rope Source # | |
Key Rope Source # | |
Defined in Core.Data.Structures | |
type Rep Rope Source # | |
Defined in Core.Text.Rope | |
type Token Rope Source # | |
Defined in Core.Text.Utilities |
split :: Int -> Rope -> (Rope, Rope) Source #
Break the text into two pieces at the specified offset.
Examples:
λ> split 0 "abcdef" ("", "abcdef") λ> split 3 "abcdef" ("abc", "def") λ> split 6 "abcdef" ("abcdef","")
Going off either end behaves sensibly:
λ> split 7 "abcdef" ("abcdef","") λ> split (-1) "abcdef" ("", "abcdef")
insert :: Int -> Rope -> Rope -> Rope Source #
Insert a new piece of text into an existing Rope
at the specified offset.
Examples:
λ> insert 3 "Con" "Def 1" "DefCon 1" λ> insert 0 "United " "Nations" "United Nations"
contains :: Char -> Rope -> Bool Source #
Does the text contain this character?
We've used it to ask whether there are newlines present in a Rope
, for
example:
if contains
'\n' text
then handleComplexCase
else keepItSimple
Interoperation and Output
class Textual α where Source #
Machinery to interpret a type as containing valid Unicode that can be
represented as a Rope
object.
Implementation notes
Given that Rope
is backed by a finger tree, append
is relatively
inexpensive, plus whatever the cost of conversion is. There is a subtle
trap, however: if adding small fragments of that were obtained by slicing
(for example) a large ByteString we would end up holding on to a reference
to the entire underlying block of memory. This module is optimized to
reduce heap fragmentation by letting the Haskell runtime and garbage
collector manage the memory, so instances are expected to copy these
substrings out of pinned memory.
The ByteString
instance requires that its content be valid UTF-8. If not an
empty Rope
will be returned.
Several of the fromRope
implementations are expensive and involve a lot
of intermiate allocation and copying. If you're ultimately writing to a
handle prefer hWrite
which will write directly to the output buffer.
fromRope :: Rope -> α Source #
Convert a Rope
into another text-like type.
intoRope :: α -> Rope Source #
Take another text-like type and convert it to a Rope
.
Instances
Textual ByteString Source # | from Data.ByteString.Lazy |
Defined in Core.Text.Rope | |
Textual ByteString Source # | from Data.ByteString Strict |
Defined in Core.Text.Rope | |
Textual Text Source # | from Data.Text Strict |
Textual Text Source # | from Data.Text.Lazy |
Textual ShortText Source # | from Data.Text.Short |
Textual Rope Source # | |
Textual JsonKey Source # | |
Textual [Char] Source # | from Data.String |
Textual (FingerTree Width ShortText) Source # | |
unsafeIntoRope :: ByteString -> Rope Source #
If you know the input bytes are valid UTF-8 encoded characters, then
you can use this function to convert to a piece of Rope
.
hWrite :: Handle -> Rope -> IO () Source #
Write the Rope
to the given Handle
.
import Core.Text import Core.System -- re-exports stdout main :: IO () main = let text ::Rope
text = "Hello World" inhWrite
stdout
text
because it's tradition.
Uses hPutBuilder
internally which saves all kinds
of intermediate allocation and copying because we can go from the
ShortText
s in the finger tree to
ShortByteString
to
Builder
to the Handle'
s output buffer
in one go.
If you're working in the Program
monad, then
write
provides an efficient way to write a Rope
to
stdout
.
Internals
unRope :: Rope -> FingerTree Width ShortText Source #
Access the finger tree underlying the Rope
. You'll want the following
imports:
import qualified Data.FingerTree as F -- from the fingertree package import qualified Data.Text.Short as S -- from the text-short package
The length of the Rope
, in characters. This is the monoid used to
structure the finger tree underlying the Rope
.
Instances
Eq Width Source # | |
Num Width Source # | |
Ord Width Source # | |
Show Width Source # | |
Generic Width Source # | |
Semigroup Width Source # | |
Monoid Width Source # | |
Measured Width ShortText Source # | |
Defined in Core.Text.Rope | |
Textual (FingerTree Width ShortText) Source # | |
type Rep Width Source # | |
Defined in Core.Text.Rope |