bpng
bpng — Poliqarp corpus builder
bpng
{ -h
| --help
| -v
| --version
}
bpng
[option
...] base-name
[xml-root-dir
...]
bpng builds a binary corpus in the Poliqarp format from XML sources in the following formats:
IPI PAN variant of the XCES format;
NKJP variant of the TEI format.
-h
, --help
Display help and exit.
-v
, --version
Output version information and exit.
-c
, --continue
Continue a partially-successful build or add more files to the existing corpus.
-j
, --parallel
Use multiple threads of execution.
Details of runtime behaviour with respect to parallelism can be controlled by several environment variables. Please refer to the OpenMP API specification for details.
Each document consists of two files: a header file (typically named header.xml
) and
a text with morphosyntactic annotations (typically named morph.xml
or
ann_morphosyntax.xml
). Files of separate documents need to reside in separate
directories.
Gzip-compressed files (with a .gz
suffix) will be decompressed on the fly.
Text with morphosyntactic annotations should either:
follow the IPI PAN variant of the XCES format:
or, follow the NKJP variant of the TEI format.
The
is used to customize the corpus build
process. The file consists of sections, led by a ‘base-name
.bp.conf[
’
header and followed by
‘section
]
’ entries.
Empty lines and lines starting with keyword
= setting
#
are comments.
[locale]
locale = locale-name
Specifies the corpus language and possible other regional preferences. Currently, only the string collation is affected by this setting.
A locale name is typically of the form
where language is an
ISO 639
language code, territory is an
ISO 3166
country code, and codeset is a character set or encoding identifier like
language
_territory
.codeset
ISO-8859-1
or UTF-8
.
On Unix systems, you can use the ‘locale -a
’ command to list all the
available locales.
Windows systems uses a different locale names convention. However, bpng is able to translate the usual locale name forms to the Windows-specific ones.
On Unix systems, the selected locale is required to support the UTF-8 encoding.
It is allowed to omit the ‘.UTF-8
’ suffix from the locale name.
This section and this entry is required.
[filenames]
Each entry is in the form
‘
’.
file-type
= file-names
file-names
is a whitespace separated list of file names.
header = file-names
Specifies to possible file names of header files.
The default is ‘header.xml
’.
morphosyntax = file-names
Specifies to possible file names of texts with morphosyntactic annotations.
The default is: ‘ann_morphosyntax.xml morph.xml
’.
[xmlns]
Setup namespaces for XPath 1.0 expressions.
prefix
= uri
Bind prefix
to the namespace uri
.
By default:
the ‘tei
’ prefix is bound to http://www.tei-c.org/ns/1.0
;
the ‘nkjp
’ prefix is bound to http://www.nkjp.pl/ns/1.0
.
the ‘poliqarp
’ prefix is bound to
http://poliqarp.sourceforge.net/ns/2009
.
Note that it is not possible to declare a default (i.e., a prefix-less) namespace for XPath 1.0.
[meta]
Multiple [meta]
sections are allowed. Each one describes a metadata key.
name = name
Specifies the name of the key.
This entry is required.
type = string
Allows any string value is possible for the key. This is the default.
type = date
Specifies that dates are possible values for the key.
type = enum
Specifies that the set of possible values for the key is a fixed set of strings.
values = values
Specifies the set of possible values for the key.
multiple = true-or-false
Specifies if a document can have more than one value for the key. The default is
false
.
required = true-or-false
Specifies if each document is required to have a value for the key. The default is
false
.
path = xpath-expression
Specifies where to look up metadata values for the key in
the header file. xpath-expression
is an
XPath 1.0 expression.
More that one entry is allowed. For each document, a path is inspected only if no values were found along all previously defined paths.
At least one entry is required.
*
.poliqarp.corpus.image
a sequence of segments
*
.poliqarp.chunk.image
a sequence of document ranges
*
.poliqarp.subchunk.image
, *
.poliqarp.subchunk.offset
, *
.poliqarp.subchunk.item.*
a dictionary of possible subdocument types (e.g., paragraphs, sentences) and sequences of subdocuments ranges
*
.poliqarp.orth.image
, *
.poliqarp.orth.index.alpha
, *
.poliqarp.orth.index.atergo
, *
.poliqarp.orth.offset
a dictionary of possible orthographics forms
*
.poliqarp.tag.image
, *
.poliqarp.tag.offset
a dictionary of possible morphosyntactic tags
*
.poliqarp.base1.image
, *
.poliqarp.base1.offset
a dictionary of possible disambiguated base forms
*
.poliqarp.base2.image
, *
.poliqarp.base2.offset
a dictionary of possible ambiguous base forms
*
.poliqarp.interp1.image
, *
.poliqarp.interp1.offset
a dictionary of possible disambiguated interpretations
*
.poliqarp.interp2.image
, *
.poliqarp.interp2.offset
a dictionary of possible ambiguous interpretations
*
.meta.cfg
, *
.poliqarp.meta-key.image
, *
.poliqarp.meta-key.offset
a dictionary of possible metadata keys
*
.poliqarp.meta-value.image
, *
.poliqarp.meta-value.offset
, *
.poliqarp.meta.image
a dictionary of possible metadata key-value pairs and a sequence of key-value pairs
bzip2 on-the-fly decompression is not supported.
A corpus cannot contain more than 2.1G segments or more than 2.1G metadata entries.
Parsing XML documents with many xml:id
attributes can be very slow. This is due to
limitations of libxml2, the underlying XML library. The source distribution of
Poliqarp contains a patch for libxml2 to work around this
problem.