meeting_2006-11-21

Meeting with Polderland 21.11.2006

Participants:

  • Peter Beinema
  • Sjur Moshagen
  • Thomas Omma

Agenda

  • since last time
  • questions and answers

Since last time

Polderland:

  • speller: include large lexicon
    • -> adapt "mklex" lexicon compiler for large files (found bug in gcc 3.2)
  • mklex: transfer mklex to macintosh OS/X, to be delivered to divvun
    • make Sami-specific; leave out some general functionality
    • create scripts to run mklex on your side
  • hyphenator:
    • insert lexicon lookup-step before pattern matching
    • adapt pattern matching (ascii-based, not UTF8/UCS2)

Divvun:

  • first test run of PLX data done, including hyphenation points

Alpha version

Spellers

Both sme and smj. The sme version will be using the latest, 20Gb lexicon, if possible.

Hyphenators

Will use only the limited data delivered to Polderland, and use the fallback algorithm for all words not in the lexicon. It will provide a nice test case for the fallback algorithm: -)

Divvun hyphenation marks:
#   - word boundary
^   - soft hyphen
-   - hard hyphen

Polderland hyphenation marks:
-- hard hyphen
- soft hyphen

Possible issues

Clitics

Clitics should be applied to all inflected forms. These are normally marked as rightmost only.

How can we specify the clitics such that they can be combined with inflected forms? Or do we have to pregenerate all word forms with clitics? If so, the size of the generated data will increase more than 10 times (> 250 Gb!)

sample word: xyzzy NR will + operator work in this case: ish +N,A

go Vt goes VRI ing +Vt gå +V

schaaps NL (sheep-)

Next meeting

Next Tuesday (28.11.) at the usual time.

TODO:

  • continue to improve hyphenation (Sjur and Thomas)
  • provide new batch of hyphenated data (Sjur)
  • send clitic problem via e-mail to Polderland (Sjur)
  • send first round of PLX data to Polderland (Tomi, Børre)
  • make complete PLX data set (Tomi)
  • get language codes to work with Mac Office 2004 (and check MacOffice 2007) ( Polderland)
  • deliver Alpha versions (Polderland) including mklex + hyphen script
  • try to find proper compiler version for Adobe Indesign (old version will probably do)