Meeting_2007-05-29
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 29.05.2007
- Time: 09.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 12: 50.
Present: Børre, Maaren, Per-Eric, Sjur, Steinar, Thomas, Tomi, Trond
Absent: Saara
Agenda accepted as is.
2. Updated task status since last meeting
Boerre
- add sma texts to the corpus repository
- run all known spelling errors in the prooftest corpus through the speller
- add extraction of all known spelling errors in the regular corpus (not the
prooftest corpus), and check that they are properly corrected - update and fix our documentation and infrastructure as Steinar finds
problem areas - low priority - study the Hunspell formalism in detail - after beta release
- test the typos.txt list, and check that all entries are properly corrected
- add info to front page (incl. download links)
- write separate page with detailed info (incl. download links)
- a separate page for the beta speller, with installation instructions, etc.
- a separate page for the beta speller, with installation instructions, etc.
- translate press release, web pages
- contact Davvi Girji / Mikal Aase
- include improved numbers in beta release speller
- test download and installation of beta release candidate on all platforms
- test deinstallation of beta release candidate on all platforms
- linguistic testing of speller beta release candidate
- update Win installer package with latest speller lexicons
- write and prepare README file
- fix bugs!
Maaren
- lexicalise actio compounds
- Manually mark speller test documents for typos
Per-Eric
- expand the smj typos list
- add missing smj words
Saara
- improve cgi-bin scripts
- add new features to the paradigm generator
- paradigm generator for Lule Sámi
- add new features to the paradigm generator
- add new XSL/XML headers for proofing test docs
- compilation of verb lists
- read the manual for graphical corpus interface and try to add files with Lars.
- fix bugs!
Sjur
- finish press release for the beta
- run all known spelling errors in the corpus through the speller
- document the AppleScript testing tool
- integrate regression self tests with the make file
- improve speller test bench
- integrate the ccat speller testing options in the make file
- fix internet setup for Per-Eric's satelite modem
- look over the Bugzilla status mails
- contact Davvi Girji / Mikal Aase
- ask Xerox for a commercial lisense for the xfst tools on the G5
- check with Sámi publishing houses whether support for CS2 is still needed
- send Beta RC Win and Mac installers to Polderland for testing
-
ń not considered word char in smj, not corrected to ŋ
- forward list of PR recipients to SD
- update Mac installer package with latest speller lexicons
- write and prepare README file
- fix bugs!
Steinar
- test actio compounds before public beta release
- test download and installation on all platforms, following our instructions
- test deinstallation
- Beta testing: Align manually (shorter texts)
- Manually mark speller test texts for typos (making them into gold standards),
add the texts to the orig/catalogue - Complete the semantic sets in sme-dis.rle
- missing lists
- fix bugs!
Thomas
- work with compounding
- Lack of lowering before hyphen: Twol rewrite.
- translate beta release docs to sme and smj
- implement –k ending in twol
- Test Actio compounds in speller beta release candidate
- linguistic testing of speller beta release candidate
-
smj: öä not accepted, only øæ (except for lexicalised names)
- fix bugs!
Tomi
- add compounding restrictions to the PLX convertion
- make PLX conversion test sample; add conversion testing to the make file
- improve prefix and middle-noun PLX conversion
- integrate the ccat speller testing options in the Makefile
- fix clitics in beta release sme speller
- first part of multiword expressions not accepted
- fix bugs!
Trond
- Work on the web corpus issues
- Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological
analysis, cf. the propernoun-smj-lex.txt - Go through the Num bugs
- update the smj proper noun lexicon, and refine the morphological
- fix bugs!.
3. Documentation
The open documentation issues fall into these three categories:
- Beta documentation for testers
- Documentation for the online corpora
- General documentation improvement after Steinar's test (for open-source
release)
TODO:
- write form to request corpus user account (Børre, Sjur, Trond)
- delayed till after the beta release
- delayed till after the beta release
- document how to apply for access to closed corpus, and details on the corpus
and its use in general (Børre, Sjur, Trond)- delayed till after the beta release
- delayed till after the beta release
- correct and improve it based on feedback from Steinar ( Børre)
- low priority
- low priority
- beta documentation (see separate beta section below)
4. Corpus gathering
TODO:
-
sme texts: no new additions, fix corpus errors during this month
( Børre, Trond, Saara) - missing nob parallel texts should be added if such holes are found
( Børre, Trond) - Go through the list of missing or errouneous nob texts, based upon
Saara's perfect list (Børre, Trond) - add sma texts to the corpus repository (Børre)
- contact Davvi Girji / Mikal Aase ( Børre, Sjur)
5. Corpus infrastructure
Nothing?
6. Infrastructure
TODO:
- update and fix our documentation and infrastructure as Steinar finds
problem areas (Børre) - fix internet setup for Per-Eric's satelite modem (Sjur, Børre)
- this influences iChat, SEE sharing, and ARD connetions
- this influences iChat, SEE sharing, and ARD connetions
- Translate the paradigm pages to Sámi, fix all the se links
( Børre, Trond)
7. Linguistics
North Sámi
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
( Maaren)- possibly turn on free compounding as part of the PLX conversions (ie free
compounding in the spellers, but not in the analyzers/transducers)
- possibly turn on free compounding as part of the PLX conversions (ie free
- fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
- postponed till after the public beta
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
( Thomas, Trond) - write two-level rules to handle -ge/-k clitic alternation (Thomas)
8. Name lexicon infrastructure
Decisions made in Tromsø can be found in this meeting memo.
TODO:
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara) - convert propernoun-($lang)-lex.txt to a derived file from common xml files
( Sjur, Tomi, Saara) - implement data synchronisation between risten.no and
the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way) - start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary) (Thomas, Maaren, linguists) - merge placenames which are errouneously in different entries: e.g. Helsinki,
Helsingfors, Helsset (linguists) - publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
( linguists)
9. Spellers
OOo speller(s)
TODO after the MS Office Beta is delivered:
- add Hunspell data generation to the lexc2xspell (Tomi - after the
PLX data generation is finished) - study the Hunspell formalism in detail (Børre, Sjur, Tomi)
Testing
Spelling Error Markup
TODO:
- Manually mark test texts for typos (making them into gold standards)
( Steinar) - Set up ways of adding meta-information (source info, used in testing or not,
added to lexicon or not) - Conduct tests on new beta versions on the basis of the unspoiled gold standard
documents (whoever has time), and fill in data from testing (the testers: who?) - include the files already publically tested into the prooftest cataloge
( Steinar, Trond) - test each version before beta release
Testing tools
TODO:
- document the AppleScript testing tool (Sjur)
- improve speller test bench (Sjur)
- integrate the ccat speller testing options in the Makefile (Sjur, Tomi)
Regression tests
TODO:
- add extraction of all known spelling errors in the corpus (not the
prooftest corpus), and check that they are properly corrected ( Børre, Sjur) - test the typos.txt list, and check that all entries are properly corrected
( Børre, Sjur) - consider how to do a regression self-test, ie, how to test the full
wordlist (Børre, Sjur)- extract all the base forms in the lexicon, and run them through the speller
- extract all SUB-marked entries, and run them through the lexicon
- integrate these in the make file (Sjur)
- extract all the base forms in the lexicon, and run them through the speller
Localisation
We need to translate the info added to our front page (and a separate page)
TODO:
- translate beta release docs to sme ( Thomas)
- translate beta release docs to smj ( Thomas)
Lexicon conversion to the PLX format
TODO:
- install larger disks, new RAM on the G5 when they arrive (Børre)
- ask for mklex for Linux (victorio) from Polderland (Sjur)
- ask Xerox for a commercial lisense for the xfst tools on the G5 (Sjur)
- add compounding restrictions to the PLX convertion (Tomi)
Compounding restrictions
How to include compounding restriction comment tags in the transducers:
giv0ri:giv'ri ALBMI ; !+SgNomCmp +SgGenCmp +PlGenCmp => (using a perl script or similar) +SgNomCmp+SgGenCmp+PlGenCmpgiv0ri:giv'ri ALBMI ; !+SgNomCmp +SgGenCmp +PlGenCmp
TODO:
- improve prefix conversion to PLX (Tomi)
- improve middle noun conversion to PLX (Tomi)
- improve noun + adjective PLX conversion: ( Tomi)
- compounding stems - how do we generate them? Using the java client?
+SgNomCmp+Cmpnd = sáme–, should give the correct compounding stem, shouldn't it? We want to optionally go from: sáme- NLI to sáme NL: - NLI (->) NL, which means we should be able to extract correct compounding stems using xfst methods only. - compounding tags - we need to obey them when making the transducers.
Suggestion - see above.
- compounding stems - how do we generate them? Using the java client?
- make conversion test sample; add conversion testing to the make file
( Tomi)- to regression test / QA the PLX conversion.
- to regression test / QA the PLX conversion.
- improve number conversion (Børre, Tomi)
- ask for larger disk for the web server (Trond, Børre)
Errors in latest speller build
TODO after the final public beta:
- first part of multiword expressions not accepted (Tomi)
- compound restrictions/rules not yet integrated in speller fsts (but the tags
necessary to do this can now be included in the transducer) (Tomi) - suomasápmelaš vs suopmasápmelaš - now both are accepted (which is an
improvement - earlier only suopmasápmelaš was accepted). What is correct? Only the first one. The second one should be banned by the compound restrictions. -
smj: öä not accepted, only øæ (except for lexicalised names)
( Thomas) - ŋ and ñ and ń in smj ( Sjur, Polderland)
Public Beta release
Schedule:
DEADLINE/RELEASE: Tuesday May 29.
TEST before public beta release:
- Actio compounds (Thomas, Steinar)
- download and installation on all platforms, following our instructions
( Steinar, Børre) - deinstallation (Steinar, Børre)
- linguistic testing (typos, at least one of the documents marked up by
Steinar) (Thomas, Børre)
TODO:
- fix clitics (Tomi)
- include improved numbers (Børre)
- finish press release (Sjur)
- add info to front page (incl. download links) (Børre)
- download page made, only needs to add the speller beta when it is ready.
- download page made, only needs to add the speller beta when it is ready.
- write separate page with detailed info (incl. download links) (Børre)
- a separate page for the beta speller, with installation instructions, etc.
- a separate page for the beta speller, with installation instructions, etc.
- translate press release, web pages (Børre, Thomas, whoever)
- forward list of PR recipients to Berit Karen Paulsen
( Sjur)- update: the persons to contact are: Pål Hivand or
John Markus Kuhmunen
- update: the persons to contact are: Pål Hivand or
- update installer packages with latest speller lexicons (Børre (Win),
Sjur (Mac)) - README file: ( Børre, Sjur)
- list of known issues
- how to give us feedback
- what to give us feedback about (including: send in your user dictionary after
a month or so) - inform about the rights to the source, as well as the availability of the
speller - needs to be written in all languages
- make a PDF of the readme, to avoid encoding problems; text versions should be
included too
- list of known issues
10. Other
Summer vacation
When are we taking it? Please fill in the table below:
| Name | Starting | Ending |
|---|---|---|
| Børre | x|x | |
| Maaren | x|x | |
| Per-Eric | x|x | |
| Saara | x|x | |
| Sjur | x|x | |
| Steinar | x|x | |
| Thomas | 9.7. | 12.8. |
| Tomi | 9.7. | 5.8. |
| Trond | 25.6. |x |
Divvun people also need to send the dates to Julie Eira.
Corpus contracts
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
( Sjur)- delayed until the public beta is out the door
Bug fixing
41 open Divvun/Disamb bugs, and 23 risten.no bugs
TODO:
- look over the Bugzilla status mails (Børre)
11. Next meeting, closing
The next meeting is 4.6.2007, 10: 30 Norwegian time.
The meeting was closed at 11: 56.
Appendix - task lists for the next week
Boerre
- add sma texts to the corpus repository
- run all known spelling errors in the prooftest corpus through the speller
- add extraction of all known spelling errors in the regular corpus (not the
prooftest corpus), and check that they are properly corrected - update and fix our documentation and infrastructure as Steinar finds
problem areas - low priority - study the Hunspell formalism in detail - after beta release
- test the typos.txt list, and check that all entries are properly corrected
- add info to front page (incl. download links)
- write separate page with detailed info (incl. download links)
- a separate page for the beta speller, with installation instructions, etc.
- a separate page for the beta speller, with installation instructions, etc.
- translate press release, web pages
- contact Davvi Girji / Mikal Aase
- include improved numbers in beta release speller
- test download and installation of beta release candidate on all platforms
- test deinstallation of beta release candidate on all platforms
- linguistic testing of speller beta release candidate
- update Win installer package with latest speller lexicons
- write and prepare README file
- fix bugs!
Maaren
- lexicalise actio compounds
- Manually mark speller test documents for typos
Per-Eric
- expand the smj typos list
- add missing smj words
Saara
- improve cgi-bin scripts
- add new features to the paradigm generator
- paradigm generator for Lule Sámi
- add new features to the paradigm generator
- add new XSL/XML headers for proofing test docs
- compilation of verb lists
- read the manual for graphical corpus interface and try to add files with Lars.
- fix bugs!
Sjur
- finish press release for the beta
- run all known spelling errors in the corpus through the speller
- document the AppleScript testing tool
- integrate regression self tests with the make file
- improve speller test bench
- integrate the ccat speller testing options in the make file
- fix internet setup for Per-Eric's satelite modem
- look over the Bugzilla status mails
- contact Davvi Girji / Mikal Aase
- ask Xerox for a commercial lisense for the xfst tools on the G5
- check with Sámi publishing houses whether support for CS2 is still needed
- send Beta RC Win and Mac installers to Polderland for testing
-
ń not considered word char in smj, not corrected to ŋ
- forward list of PR recipients to SD
- update Mac installer package with latest speller lexicons
- write and prepare README file
- fix bugs!
Steinar
- test actio compounds before public beta release
- test download and installation on all platforms, following our instructions
- test deinstallation
- Beta testing: Align manually (shorter texts)
- Manually mark speller test texts for typos (making them into gold standards),
add the texts to the orig/catalogue - Complete the semantic sets in sme-dis.rle
- missing lists
- fix bugs!
Thomas
- work with compounding
- Lack of lowering before hyphen: Twol rewrite.
- translate beta release docs to sme and smj
- implement –k ending in twol
- Test Actio compounds in speller beta release candidate
- linguistic testing of speller beta release candidate
-
smj: öä not accepted, only øæ (except for lexicalised names)
- fix bugs!
Tomi
- add compounding restrictions to the PLX convertion
- make PLX conversion test sample; add conversion testing to the make file
- improve prefix and middle-noun PLX conversion
- integrate the ccat speller testing options in the Makefile
- fix clitics in beta release sme speller
- first part of multiword expressions not accepted
- fix bugs!
Trond
- Work on the web corpus issues
- Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological
analysis, cf. the propernoun-smj-lex.txt - Go through the Num bugs
- update the smj proper noun lexicon, and refine the morphological
- fix bugs!.

