Conversion scripts

Conversion scripts

The conversion scripts are located in gt/script. They are of two different types: perl scripts (*.pl) and xfst scripts. The xfst scripts are compiled, they have filename.regex as source file names and filename.fst as binary file names.

The scripts have different functions. Some scripts convert input text to the internal format used by the program, whereas other scripts convert the output of the program into a format suitable for output.

Note that the unix utility iconv contains ready-made conversion routines for many code tables. The syntax is as follows:

$ iconv --from-code=ISO-8859-1 --to-code=UTF-8 < old_file > new_file

A list of code tables is listed with iconv --list. This of course does not help in converting text to our internal format, but in the future it may be used for conversion to utf-8.

Naming the scripts

The scripts are named "sourceform-targetform.scripttype". The perl script converting Latin 6 input to the internal 7-bit digraph system á, c1, d1, n1, s1, t1, z1, is called latin6-7bit.pl.

There are at the moment script for converting from ws2, Latin6 and mac (here called "linmac", since mac files are translated to something else when the files are moved to Linux. "Something else" is here called "linmac" (mac as observed on Linux), and taken as a starting point for the conversion script.

Scripts converting input text to and from internal digraphs ("7bit")

Perl scripts

The perl scripts contain conversion lines of the format s/\273/t1/g. This line converts a t-stroke to t1. The code position (in the code table Latin 6, used a.o. by Statens Kartverk) is hexadecimal BB. Perl uses octal notation, and the octal value of BB is 273.

Note that there are two different scripts, utf8-7bit.pl and utf8.pl. The former converts from utf8 to 7bit, the other one is some sort of all-in-one-script that converts from different formats (mac saved as utf8, text written on Win9x saved as utf8, etc. to 7-bit. Testing is needed to see whether this is a relevant partition, in any case, the utf8-7bit.pl works in cases where the input signal has not been corrupted, i.e. it takes real utf8 as input.

xfst scripts

The <encoding>-7bit.regex files are files that convert from the given encoding to the internal format.

The 7bit-<encoding>.regex files are files that convert to the given encoding from the internal format.

Compiling .regex files

To make use of the .regex files you may have to compile them to .fst files. Go to the script directory and have a look at the .regex and .fst files. If the .regex file is older than the .fst file with the same name, you may use the .fst file right on, and you do not need to compile. If the .fst file is older or do not exist, you must compile it. Do that by while in the script directory type the command:

make all
Using the resulting .fst files for North Sámi

In order to convert from encoding X to internal format, be in the script directory, and type the following command:

cat <&lt;encoding>-filename> | lookup -flags mbTT -f <encoding>-7bit.fst
        

It will will convert a file from a given encoding to the internal format.

In order to analyse a file in a given encoding, go to the gt/sme directory. To analyze a file in the ws2 (aka. levi, WinSam2) encoding type the command

cat  | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -f ws2-sme | less
        

Upon executing this command the input file will first be tokenized, then converted to the internal format, analyzed and the output will be in the same encoding as the input file.

Note
XXX But is this a good idea? we must evaluate this. It is hard to see how anyone would like his input back to ws2 on a Linux terminal. Tests on input is needed here.TT

The lookup file ws2-file has this content (The file format is documented in Beesley/Karttunen, p. 442):

sme sme.fst
fws2    ../script/ws2-7bit.fst
tws2    ../script/7bit-ws2.fst

fws2 sme tws2
        

This file converts the input from the ws2 encoding to the internal format. The input will then be analyzed with the sme.fst file and the result is converted back to the ws2 format.

The other <encoding>-sme files follow the same pattern.

The case conversion scripts

Initial capital letter

The most improtant caseconvertion scripts are case.regex (caseconv.fst). They are different form language to language, and located in the language-specific directories. They form an integrated part of the Makefiles, and the resulting parsers contain the ability of recognising initial capital letters.

Letters in all caps

There are also scripts to allow for words written in all caps, called allcaps.regex. By the help of such scripts, ("Duodji" is accepted, as is "DUODJI", but "DuoDji" is not. These are also located in the src directories (so far only for sme), and are integrated in the Makefile. But the resulting allcaps.fst is not compiled together with sme.fst into a single transducer, as this would have resulted in a too large network. Instead, it is kept separate in the sme/bin directory, and when needed, it may be invoked by the following command (assuming you stand in gt/sme):

... | lookup -flags mbTT -f src/cap-sme | ...

Note that the lookup script file is located in sme/src, but the binary allcaps.fst that the cap-sme file refers to, is located in sme/bin.

The spellrelax scripts

South and Lule Saami have scripts to allow for different practices for writing � (as �or i) and for the Norwegian/Swedish ��and � mix. These are xfst scripts, integrated in the makefiles of sma and smj.

The scripts converting 7bit to html

Børre?

or should this be documented on the webinterace page?

Scripts converting from "alien" fileformats to 7bit

pdf to 7bit converters

The script pdfto7bit.pl is a script that converts pdf files to 7bit. It is used like this:

pdfto7bit.pl [option] <filename>

The options allowed are:

  • -e: output the even pages
  • -o: output the odd pages

To use it you will have to have the gt/script catalog in your path. Type this at the command prompt.

PATH="~/[path to the gt directory]/gt/script:$PATH"
      

After this you can type "pdfto7bit.pl" at the command prompt to use it. Typical uses are shown below

  • To analyze a pdf file, go to the gt/sme directory, and type: pdfto7bit.pl <filename.pdf> | preprocess --abbr=bin/abbr.txt |lookup -flags mbTT sme.fst | less. The more advanced uses, documented in the sme-manual can also be used.
  • for pdffile in [directory of pdf files]/*.pdf do pdfto7bit.pl $pdffile > [directory of text files]/`basename $pdffile .pdf`.txt done . This command takes a batch of pdf files, converts them to text files and saves them in a given directory. The command `basename $pdffile .pdf`.txt assures that a pdf file named: foo.pdf is saved as foo.txt.
  • Some pdf documents have sámi and norwegian text on every other page. The options -e and -o is to overcome this problem. If the sámi text is on the even pages of the offending document type the following at the command prompt:
    pdfto7bit.pl -e <name of offending file>

Last modified: $Date: 2012-11-22 13:41:43 +0100 (Thu, 22 Nov 2012) $, by $Author: trond $

by Trond Trosterud, Børre Gaup