moan-0.2.0.2: Language-agnostic analyzer for positional morphosyntactic tags

Copyright(c) Vjeran Crnjak, 2014
LicenseBSD3
Maintainervjeran.crnjak@gmail.com
Stabilityexperimental
Portabilityportable
Safe HaskellNone
LanguageHaskell2010

NLP.Morphosyntax.Analyzer

Contents

Description

Implementation of a space-efficient morphosyntactic analyzer.

It solves a problem of providing a set of possible tags for a given word. Instead of just matching on the word-set pair, one can assume that suffixes of an unknown word also hold some information about the set.

This library provides the functionality of that kind of analysis. One example of where this might be useful is concraft tagging library. Before the POS-tagging one needs to have a set of possible tags for a word from which the correct one is disambiguated.

For a sufficiently large construction corpus this analyzer might only benefit from additional regular expressions for punctuation and number matching. There is a possibility of returning a set of possible tags that isn't complete - the set doesn't contain a correct tag. If construction corpus isn't sufficiently large, there might be a fair amount of incomplete sets on unseen named entities (person names, corporation names etc.).

If one needs the analyzer to be less aggressive, it is recommended to extend the functionality and remove the sets of possible tags from words which might be named (ex. capitalized words in the middle of a sentence). This is present mostly in use cases where part-of-speech tags of a language contain information whether a word represents a named entity or not, so if this is not a case, there will be no need to extend the current functionality.

A simple example of using GHCi for construction:

:set -XOverloadedStrings
import qualified Data.Text.IO as T
import qualified Data.Tagset.Positional as P
f <- readFile "tagset.cfg"
let tset = P.parseTagset "tagset1" f
f <- T.readFile "fulldict.txt"
let train = map (\(word:tags) -> (word, map (P.parseTag tset) tags)) . map T.words . filter (not . T.null) . T.lines $ f
let an = create tset (AConf 3 [] M.empty) train
save "analyzer.gz" an

It is assumed that tag attributes are separated with : for parseTag. One could write a different parsing function.

Synopsis

Model

data Analyzer Source

Representation of the analyzer.

elem :: Text -> Analyzer -> Bool Source

Checks whether a word is in the analyzer. If it is the set of tags returned by the getTags will be non-empty.

getTags :: Analyzer -> Text -> Set Tag Source

Gives a set of possible tags for a given word. It is possible that the set of possible tags is empty.

save :: FilePath -> Analyzer -> IO () Source

Save analyzer in a file. Data is compressed using the gzip format.

load :: FilePath -> IO Analyzer Source

Load analyzer from a file.

create Source

Arguments

:: Tagset

Tagset used in the construction corpus.

-> AConf

Configuration of the analyzer.

-> [(Text, [Tag])]

Construction corpus.

-> Analyzer

Morphological analyzer.

Creates a morphological analyzer given a tagset, a list of regex for additional matching, smallest suffix length and a construction corpus.

emptyConf :: AConf Source

Can be used for dummy analyzer building.

Token matching

data Matcher Source

Replaces the need of writing regular expressions for simple matching. Matching on punctuation, number, alphanumeric, upper-case tokens or regular expressions.

Constructors

Punct

Matches a token with all punctuation characters.

Number

Matches a token with all unicode numeral characters.

AlphaNum

Matches a token with all alphanumeric characters.

AnyUpper

Matches a token with at least one uppercase characther.

AllUpper

Matches a token with all uppercase characters.

AnyLower

Matches a token with at least one lowercase characther.

AllLower

Matches a token with all lowercase characters.

Capital

Matches a capitalized token.

RegExpr Text

Matches on a regular expression.

Configuration

data AConf Source

Configuration for the analyzer.

Constructors

AConf 

Fields

suffixLen :: Int

If word isn't known this is the smallest suffix length that will be matched.

regexMatch :: [(Matcher, Set Tag)]

A list of regular expressions (POSIX) and accompanying set of tags. If a word matches a regular expression, the accompanying set of tags will be given as the set of possible tags.

separationLayout :: Map POS (Set POS)

Provides the analyzer with the ability to analyze the word on a single POS-tag in case incomplete construction corpus is present. (Ex. Croatian adjectives and pronouns) It might be the case that words that can be adjectives can also be pronouns. If the analyzer isn't thorough enough (the provided construction data doesn't have all cases covered) one would also like that words that are adjectives are also interpreted as being pronouns. What can happen is, an unknown word has a very long suffix that matches an adjective, but it can also be a pronoun. In that case one would like pronoun tags too. If your construction data is very large this doesn't have to be used.

Instances