Meeting with Polderland 26.9.2006


  • Peter Beinema
  • Thomas Omma
  • Sjur Moshagen
  • Tomi Pieski


  • since last time
  • hyphenation data
  • questions and answers

Since last time

Peter sent phonetic speller correction rules, and Thomas has been working on them.

Binary version of the speller for Mac not yet available, the old version has to be updated to be Unicode-compatible with Office 2004/Mac. It is in the works.

Test results: the regression testing didn't cope well with the rather large data set (got into memory swapping: -( ), but it was tested with a smaller sample (the test code will be rewritten to deal better with data in the 600+ Mb size).

Test findings upto now: all words tested are OK, except for:

- words containing colons (":"), such as
    "ABC-company:aid", ...
- words containing asterisk characters ("*"), such as
    "Juvvelan*gorsa" - this is a typo, the only one, it is already fixed
- multi-word expressions, such as
    "Beakka Hánno gieddila...":


# = word boundary
^ = possible hyphenation point
- = hard hyphen

Hyphenation at Divvun has improved, but there are still issues. Will be investigated.

Tasks since last time

  • send correction/phonetic rules to Thomas (Peter)
    • done
  • review the processed data sets (Thomas)
    • no need to do this, as the delivered data isn't split any more.
  • make speller lexicon data for Polderland, with POS, compounding properties and hyphenation (Divvun, mainly Tomi)
    • started
  • try to send a first binary to Sjur (Peter)
    • we'll wait for the Mac version
  • check whether hyphen is used when compounding abbreviations (TV-stuolla) ( Thomas)
    • yes, hyphen


  • continue to improve hyphenation (Sjur and Thomas)
  • continue with speller data generation/conversion (Tomi)