meeting_2006-09-19
Meeting with Polderland 19.9.2006
Participants:
- Peter Beinema
- Marijke Koster
- Jeroen Daanen
- Thomas Omma
- Sjur Moshagen
- Tomi Pieski
Agenda
- status of the speller lexicon
- speller lexicon continuation
- hyphenation data
- questions and answers
Since last time
Divvun has sent more data to PL, essentially the whole lexicon spelled out.
PL has compiled the lexicon data. There now is a first binary lexicon of size 1.05 MB and it seems to work (full tests not completed yet).
This does not mean lexicon work is complete yet:
-
compounding / PLX encoding: all words have been translated to the simplest PLX input lines possible: the codes "W" and "I" have been added to signify that each word is of unspecified word class and can occur in isolation.
- spelling correction: we have no specific phonetic rules for north Sami yet. Phonetic rules are used in the speller in case of misspellings to determine which correct words are phonetically close to a misspelled word. Having phonetic rules increase the quality of the word suggestions.
- speller hyphenation: the words in the lexicon are not hyphenated. The speller works with unhyphenated words, but when words are hyphenated and the MS Word option for automatic hyphenation is on, words will be hyphenated at a valid hyphenation position; if the Word user enters a soft-hyphen at an invalid word position, the speller will flag the word and will suggest the correctly hyphenated word.
- locale settings: we have compiled the lexicon on a Linux system using the "default" locale settings for Dutch. This seems to work fine, but usually we compile using the locale settings that are applicable for future users. Do you know if there are specific Sami locale settings, or alternatively Norwegian / Swedish / Finnish locale settings?
Hyphenation
Present hyphenation markup in the Divvun output:
# = word boundary ^ = possible hyphenation point - = hard hyphen
These are fine for Polderland
Locales for Sámi
Microsoft
See the codes Sjur sent earlier
Linux/Unix layer on Mac
Anything in UTF-8 should work just fine, e.g. no_NO.UTF-8.
Sámi-specific alternatives:
se_NO.UTF-8 - North Sámi in Norway se_SE.UTF-8 - North Sámi in Sweden se_FI.UTF-8 - North Sámi in Finland smj_NO.UTF-8 - Lule Sámi in Norway
Debian has support for North Sámi (se), possibly in other distributions as well.
Mac (non-Unix layer)
Microsoft applications: see MS above
Carbon/Cocoa applications: have full Unicode support by default, and one can specify North Sámi as the preferred language. Don't know whether more support is needed.
Compounding
Compounding is principally free for nominal stems. Nouns can compound in Nominative or Genitive singular, using either the full stem, or with a reduced
Some "stems" require a hyphen(?) when compounding, usually abbreviations (e.g.
The compounding properties need to be specified in the data sent to Polderland,
TODO
- send correction/phonetic rules to Thomas (Peter)
- review the processed data sets (Thomas)
- make speller lexicon data for Polderland, with POS, compounding properties and
- try to send a first binary to Sjur (Peter)
- check whether hyphen is used when compounding abbreviations (TV-stuolla)