Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
This module contains plain tree indexing code. The index itself is a
CACHE: you should only ever use it as an optimisation and never as a primary
storage. In practice, this means that when we change index format, the
application is expected to throw the old index away and build a fresh
index. Please note that tracking index validity is out of scope for this
module: this is responsibility of your application. It is advisable that in
your validity tracking code, you also check for format validity (see
indexFormatValid
) and scrap and re-create index when needed.
The index is a binary file that overlays a hashed tree over the working copy. This means that every working file and directory has an entry in the index, that contains its path and hash and validity data. The validity data is a timestamp plus the file size. The file hashes are sha256's of the file's content. It also contains the fileid to track moved files.
There are two entry types, a file entry and a directory entry. Both have a
common binary format (see Item
). The on-disk format is described by
the section Index format below.
For each file, the index has a copy of the file's last modification
timestamp taken at the instant when the hash has been computed. This means
that when file size and timestamp of a file in working tree matches those in
the index, we assume that the hash stored in the index for given file is
valid. These hashes are then exposed in the resulting Tree
object, and can
be leveraged by eg. diffTrees
to compare many files quickly.
You may have noticed that we also keep hashes of directories. These are assumed to be valid whenever the complete subtree has been valid. At any point, as soon as a size or timestamp mismatch is found, the working file in question is opened, its hash (and timestamp and size) is recomputed and updated in-place in the index file (everything lives at a fixed offset and is fixed size, so this isn't an issue). This is also true of directories: when a file in a directory changes hash, this triggers recomputation of all of its parent directory hashes; moreover this is done efficiently -- each directory is updated at most once during an update run.
Endianness
Since version 6 (magic == HSI6), the file format depends on the endianness of the architecture. To account for the (rare) case where darcs executables from different architectures operate on the same repo, we make an additional check in indexFormatValid to detect whether the file's endianness differs from what we expect. If this is detected, the file is considered invalid and will be re-created.
Index format
The index starts with a header consisting of a 4 bytes magic word, followed by a 4 byte word to indicate the endianness of the encoding. This word should, when read directly from the mmapped file, be equal to 1.
After the header comes the actual content of the index, which is a
sequence of Item
s. An Item
consists of:
- size: item size, 8 bytes
- aux: timestamp (for file) or offset to sibling (for dir), 8 bytes
- fileid: inode or fhandle of the item, 8 bytes
- hash: sha256 of content, 32 bytes
- descriptor length: >= 2 due to type and null, 4 bytes
- descriptor:
- type:
D
orF
, one byte - path: flattened path, variable >= 0
- null: terminating null byte
- alignment padding: 0 to 3 bytes
Each Item
is 4 byte aligned. Thus the descriptor length must be
rounded up to get the position of the next item using align
. Similar,
when determining the aux (offset to sibling) for dir items.
With directories, the aux holds the offset of the next sibling item in the
index, so we can efficiently skip reading the whole subtree starting at a
given directory (by just seeking aux bytes forward). The items are
pre-ordered with respect to directory structure -- the directory comes first
and after it come all its items. Cf. openIndex
.
For files, the aux field holds a timestamp.
Internally, the item is stored as a pointer to the first field (iBase) from which we directly read off the first three fields (size, aux, fileid), and a ByteString for the rest (iHashAndDescriptor), up to but not including the terminating null byte.
TODO
The null byte terminator seems useless.
We could as well use a single plain pointer for the item. The dumpIndex function demonstrates how this could be done.
Another possible improvement is to store only the Name of an item, not the full path. We need to keep track of the current path anyway when traversing the index.
Synopsis
- openIndex :: FilePath -> IO Index
- updateIndexFrom :: FilePath -> Tree IO -> IO Index
- indexFormatValid :: FilePath -> IO Bool
- treeFromIndex :: Index -> IO (Tree IO)
- listFileIDs :: Index -> IO [((AnchoredPath, ItemType), FileID)]
- type Index = IndexM IO
- filter :: FilterTree a m => (AnchoredPath -> TreeItem m -> Bool) -> a m -> a m
- getFileID :: AnchoredPath -> IO (Maybe FileID)
- data IndexEntry = IndexEntry {}
- dumpIndex :: FilePath -> IO [IndexEntry]
- align :: Integral a => a -> a -> a
Documentation
indexFormatValid :: FilePath -> IO Bool Source #
Check that a given file is an index file with a format we can handle. You should remove and re-create the index whenever this is not true.
treeFromIndex :: Index -> IO (Tree IO) Source #
Read an IndexM
, starting with the root, to create a Tree
.
listFileIDs :: Index -> IO [((AnchoredPath, ItemType), FileID)] Source #
Return a list containing all the file/folder names in an index, with their respective ItemType and FileID.
filter :: FilterTree a m => (AnchoredPath -> TreeItem m -> Bool) -> a m -> a m Source #
Given pred tree
, produce a Tree
that only has items for which
pred
returns True
.
The tree might contain stubs. When expanded, these will be subject to
filtering as well.
getFileID :: AnchoredPath -> IO (Maybe FileID) Source #
For a given path, get the corresponding fileID from the filesystem.
data IndexEntry Source #