bpng — Poliqarp corpus builder

Synopsis

bpng { -h | --help | -v | --version }

bpng [option...] base-name [xml-root-dir...]

Description

bpng builds a binary corpus in the Poliqarp format from XML sources in the following formats:

IPI PAN variant of the XCES format;
NKJP variant of the TEI format.

Options

-h, --help

Display help and exit.

-v, --version

Output version information and exit.

-c, --continue

Continue a partially-successful build or add more files to the existing corpus.

-j, --parallel

Use multiple threads of execution.

Details of runtime behaviour with respect to parallelism can be controlled by several environment variables. Please refer to the OpenMP API specification for details.

Source format

Document files

Each document consists of two files: a header file (typically named header.xml) and a text with morphosyntactic annotations (typically named morph.xml or ann_morphosyntax.xml). Files of separate documents need to reside in separate directories. Gzip-compressed files (with a .gz suffix) will be decompressed on the fly.

Morphosyntax

Text with morphosyntactic annotations should either:

follow the IPI PAN variant of the XCES format:
- http://korpus.pl/download/frek.dtd/xcesAnaIPI.dtd
- http://korpus.pl/download/frek.dtd/xheaderIPI.elt
or, follow the NKJP variant of the TEI format.

Header

No particular header format is required. The way header information is converted to the binary format can be customized in the configuration file.

Configuration file

The base-name.bp.conf is used to customize the corpus build process. The file consists of sections, led by a ‘[section]’ header and followed by ‘keyword = setting’ entries. Empty lines and lines starting with # are comments.

`[locale]`

locale = locale-name

Specifies the corpus language and possible other regional preferences. Currently, only the string collation is affected by this setting.

A locale name is typically of the form language_territory.codeset where language is an ISO 639 language code, territory is an ISO 3166 country code, and codeset is a character set or encoding identifier like ISO-8859-1 or UTF-8.

On Unix systems, you can use the ‘locale -a’ command to list all the available locales.

Windows systems uses a different locale names convention. However, bpng is able to translate the usual locale name forms to the Windows-specific ones.

On Unix systems, the selected locale is required to support the UTF-8 encoding. It is allowed to omit the ‘.UTF-8’ suffix from the locale name.

This section and this entry is required.

`[filenames]`

Each entry is in the form ‘file-type = file-names’. file-names is a whitespace separated list of file names.

header = file-names: Specifies to possible file names of header files. The default is ‘header.xml’.
morphosyntax = file-names: Specifies to possible file names of texts with morphosyntactic annotations. The default is: ‘ann_morphosyntax.xml morph.xml’.

`[xmlns]`

Setup namespaces for XPath 1.0 expressions.

prefix = uri: Bind prefix to the namespace uri.

By default:

the ‘tei’ prefix is bound to http://www.tei-c.org/ns/1.0;
the ‘nkjp’ prefix is bound to http://www.nkjp.pl/ns/1.0.
the ‘poliqarp’ prefix is bound to http://poliqarp.sourceforge.net/ns/2009.

Note that it is not possible to declare a default (i.e., a prefix-less) namespace for XPath 1.0.

`[meta]`

Multiple [meta] sections are allowed. Each one describes a metadata key.

name = name

Specifies the name of the key.

This entry is required.

type = string

Allows any string value is possible for the key. This is the default.

type = date

Specifies that dates are possible values for the key.

type = enum

Specifies that the set of possible values for the key is a fixed set of strings.

values = values

Specifies the set of possible values for the key.

multiple = true-or-false

Specifies if a document can have more than one value for the key. The default is false.

required = true-or-false

Specifies if each document is required to have a value for the key. The default is false.

path = xpath-expression

Specifies where to look up metadata values for the key in the header file. xpath-expression is an XPath 1.0 expression.

More that one entry is allowed. For each document, a path is inspected only if no values were found along all previously defined paths.

At least one entry is required.

Poliqarp format

Corpus configuration file

*.cfg

Corpus definition file

*.cdf: The only supported binary format version is 2.

Binary files created by bpng

*.poliqarp.corpus.image: a sequence of segments
*.poliqarp.chunk.image: a sequence of document ranges
*.poliqarp.subchunk.image, *.poliqarp.subchunk.offset, *.poliqarp.subchunk.item.*: a dictionary of possible subdocument types (e.g., paragraphs, sentences) and sequences of subdocuments ranges
*.poliqarp.orth.image, *.poliqarp.orth.index.alpha, *.poliqarp.orth.index.atergo, *.poliqarp.orth.offset: a dictionary of possible orthographics forms
*.poliqarp.tag.image, *.poliqarp.tag.offset: a dictionary of possible morphosyntactic tags
*.poliqarp.base1.image, *.poliqarp.base1.offset: a dictionary of possible disambiguated base forms
*.poliqarp.base2.image, *.poliqarp.base2.offset: a dictionary of possible ambiguous base forms
*.poliqarp.interp1.image, *.poliqarp.interp1.offset: a dictionary of possible disambiguated interpretations
*.poliqarp.interp2.image, *.poliqarp.interp2.offset: a dictionary of possible ambiguous interpretations
*.meta.cfg, *.poliqarp.meta-key.image, *.poliqarp.meta-key.offset: a dictionary of possible metadata keys
*.poliqarp.meta-value.image, *.poliqarp.meta-value.offset, *.poliqarp.meta.image: a dictionary of possible metadata key-value pairs and a sequence of key-value pairs

Binary files created by bpindexer

*.poliqarp.rindex.*: See bpindexer(1) for details.

Bugs, limitations, missing features

bzip2 on-the-fly decompression is not supported.

A corpus cannot contain more than 2.1G segments or more than 2.1G metadata entries.

Parsing XML documents with many xml:id attributes can be very slow. This is due to limitations of libxml2, the underlying XML library. The source distribution of Poliqarp contains a patch for libxml2 to work around this problem.