Copyright	(c) Vjeran Crnjak, 2014
License	BSD3
Maintainer	vjeran.crnjak@gmail.com
Stability	experimental
Portability	portable
Safe Haskell	None
Language	Haskell2010

NLP.Morphosyntax.Analyzer

Contents

Model
Token matching
Configuration

Description

Implementation of a space-efficient morphosyntactic analyzer.

It solves a problem of providing a set of possible tags for a given word. Instead of just matching on the word-set pair, one can assume that suffixes of an unknown word also hold some information about the set.

This library provides the functionality of that kind of analysis. One example of where this might be useful is concraft tagging library. Before the POS-tagging one needs to have a set of possible tags for a word from which the correct one is disambiguated.

For a sufficiently large construction corpus this analyzer might only benefit from additional regular expressions for punctuation and number matching. There is a possibility of returning a set of possible tags that isn't complete - the set doesn't contain a correct tag. If construction corpus isn't sufficiently large, there might be a fair amount of incomplete sets on unseen named entities (person names, corporation names etc.).

If one needs the analyzer to be less aggressive, it is recommended to extend the functionality and remove the sets of possible tags from words which might be named (ex. capitalized words in the middle of a sentence). This is present mostly in use cases where part-of-speech tags of a language contain information whether a word represents a named entity or not, so if this is not a case, there will be no need to extend the current functionality.

A simple example of using GHCi for construction:

:set -XOverloadedStrings
import qualified Data.Text.IO as T
import qualified Data.Tagset.Positional as P
f <- readFile "tagset.cfg"
let tset = P.parseTagset "tagset1" f
f <- T.readFile "fulldict.txt"
let train = map (\(word:tags) -> (word, map (P.parseTag tset) tags)) . map T.words . filter (not . T.null) . T.lines $ f
let an = create tset (AConf 3 [] M.empty) train
save "analyzer.gz" an

It is assumed that tag attributes are separated with : for parseTag. One could write a different parsing function.

Synopsis

Model

data Analyzer Source

Representation of the analyzer.

Instances

Eq Analyzer
Binary Analyzer

elem :: Text -> Analyzer -> Bool Source

Checks whether a word is in the analyzer. If it is the set of tags returned by the getTags will be non-empty.

getTags :: Analyzer -> Text -> Set Tag Source

Gives a set of possible tags for a given word. It is possible that the set of possible tags is empty.

save :: FilePath -> Analyzer -> IO () Source

Save analyzer in a file. Data is compressed using the gzip format.

load :: FilePath -> IO Analyzer Source

Load analyzer from a file.

create Source

Arguments

:: Tagset	Tagset used in the construction corpus.
-> AConf	Configuration of the analyzer.
-> [(Text, [Tag])]	Construction corpus.
-> Analyzer	Morphological analyzer.

Creates a morphological analyzer given a tagset, a list of regex for additional matching, smallest suffix length and a construction corpus.

emptyConf :: AConf Source

Can be used for dummy analyzer building.

Token matching

data Matcher Source

Replaces the need of writing regular expressions for simple matching. Matching on punctuation, number, alphanumeric, upper-case tokens or regular expressions.

Constructors

Punct	Matches a token with all punctuation characters.
Number	Matches a token with all unicode numeral characters.
AlphaNum	Matches a token with all alphanumeric characters.
AnyUpper	Matches a token with at least one uppercase characther.
AllUpper	Matches a token with all uppercase characters.
AnyLower	Matches a token with at least one lowercase characther.
AllLower	Matches a token with all lowercase characters.
Capital	Matches a capitalized token.
RegExpr Text	Matches on a regular expression.

Instances

Eq Matcher
Ord Matcher
Show Matcher
Binary Matcher

Configuration

data AConf Source

Configuration for the analyzer.

Constructors

AConf

Fields

suffixLen :: Int: If word isn't known this is the smallest suffix length that will be matched.
regexMatch :: [(Matcher, Set Tag)]: A list of regular expressions (POSIX) and accompanying set of tags. If a word matches a regular expression, the accompanying set of tags will be given as the set of possible tags.
separationLayout :: Map POS (Set POS): Provides the analyzer with the ability to analyze the word on a single POS-tag in case incomplete construction corpus is present. (Ex. Croatian adjectives and pronouns) It might be the case that words that can be adjectives can also be pronouns. If the analyzer isn't thorough enough (the provided construction data doesn't have all cases covered) one would also like that words that are adjectives are also interpreted as being pronouns. What can happen is, an unknown word has a very long suffix that matches an adjective, but it can also be a pronoun. In that case one would like pronoun tags too. If your construction data is very large this doesn't have to be used.

Instances

Eq AConf
Show AConf
Binary AConf