Corpus infrastructure


This document will teach the user how to convert files in the corpus repositories to xml, and how to extract text from these documents.

Converting the corpus to xml

To be able to convert files in our repository, you have to check out our tools and do a basic setup.

The corpus is stored using svn. Check out the free part and, if you have access to it, the restricted part.

To be able to easily use the content of the corpus, we need to convert the original files stored in various formats to a u format. This is achieved using the script convert2xml.

To convert all the original files inside a repository to xml, open a terminal and go to the working copy, and issue the command:

convert2xml orig

If you want to convert a specific file, issue the command:

convert2xml <filename>

The graphical show that follows indicate lines looking like this: "...|......". Here, each symbol represents a file. The dots indicate "(formally) successful conversion", and the pipe symbols indicate flawed conversion. Note that the conversion will be flawed if you are offline while converting, as the conversion scripts takes the dtd from the net (we should consider making this more robust).

Extract text from the corpus

The converted xml files are found in the converted/ catalogue. To get all North Saami text, issue the command ccat -a -r -l sme converted. To get other languages, exchange sme with e.g. sma (South Sami) or smj (Lule Sami). The options available for ccat are listed with the command ccat -h.

Old links (kept for now)