meeting_2007-01-16

Meeting with Polderland 16.1.2007

Participants:

  • Peter Beinema
  • Sjur Moshagen

Agenda

  • Since last time
  • Possible issues
  • Next meeting

Since last time

Polderland:

  • no response from MS on language codes for Mac yet, will poll them again
  • mklex: beta, awaiting test results

Divvun:

PLX conversion now progressing fine again, all major POSes, + closed POSes are converted. Børre and Tomi are now working on compounding. We hope to have a first installment of a compounding lexicon by the end of this week.

Possible issues

The big lexicon file (25+ Gb) contains large portions of words starting with:

  • - (hyphen; 0x2d) (3 Gb)
  • e (0x65; 8 Gb)

Could it be earon— or something similar?

It is eahpá-'', which corresponds to un-'' in English. It used to be prefixed to all nouns, verbs and adjectives, roughly adding 50% to the lexicon size (closed POSes and names were excluded). I have now removed it from the word list generation, it will be added later as a PLX prefix when we deliver the PLX word list.

Name "prefixes"

Some nouns are common as prefixes to names, mainly words for North, South, etc. Even though these are regular nouns, when prefixing proper nouns they should be capitalized, and there needs to be a hyphen between the prefix and the second part. Example (spelling error -> correct form):

  • davvi-Norgga -> Davvi-Norgga (= North(ern) Norway)

That is:

name-prefix + name => upper case + hyphen

Thus, to correctly handle these cases, we need to identify names as different from other nouns, such that we can direct the upppercased and hyphenated prefixes to the correct second parts.

Hyphen as prefix

In constructions of coordinated compounds with common first part (YX and YZ => YX and -Z: bussbillett og -sjåfør) it is customary to replace the common part with a hyphen. We could reduce the size of the noun lexicon by 50% if we can specify the hyphen as a prefix, instead of, as now, pregenerate all nouns in a version with a prefixed hyphen.

Next meeting

Next Tuesday (23.1.) at the usual time.

TODO:

  • get back on linguistic issue regarding proper nouns vs. common nouns
  • get back on linguistic issue re. hyphen as prefix
  • check if North Sámi hyphenation can be disabled when processing Lule Sámi (PLD)
  • make complete PLX data set (Tomi)
    • progressing
  • get language codes to work with Mac Office 2004 (and check MacOffice 2007) ( Polderland)
  • deliver mklex + hyphen script
  • try to find proper compiler version for Adobe Indesign (old version will probably do)
  • try to get an answer to the language codes in MS Office for Mac question from other sources (Sjur)
  • investigate the initial "e" group of words (8 Gb) (Sjur)
    • done