What is Poliqarp?
Poliqarp is a utility for searching large corpora.
Its features are:
- Support for tagged corpora
- The searched collection can contain not only raw text, but also information about the words and texts that constitute it (grammatical forms of words; structure of the texts; various meta-information about the texts such as authorship and date of writing).
- Expressive query language
- Poliqarp's query language is based on regular expressions and allows you to search not only for a given word or sequences of words,
but also, for example, for:
- an adjective followed by a noun
- five nouns in a row
- five, six, or seven nouns in a row
- a given word occurring close, but not necessarily next, to another given word
- words starting with 'z' that occur in texts published in the 19th century
- sentences longer than 100 words
- ...and many more
- Support for positional tagsets
- The tags assigned to words can have an internal structure, and this structure may be incorporated in queries. For instance, nouns might have gender, number or case, verbs might have aspect, and so on.
This is especially useful with languages that are rich in inflection, such as Polish (in fact, Poliqarp was originally developed and is used within a Polish corpus project — the IPI PAN Corpus).
- Does not depend on a particular tagset
- Support for Unicode
- You can create corpora of texts written in almost any language in its native script — be it English, Polish, Japanese or Thai — as long as they are encoded in the UTF-8 format.
- Support for ambiguities
- Tags of a word are not necessarily unique: there might occur situations where a word can be interpreted in several ways (and thus have several tags assigned to it). Poliqarp can handle such situations and allows you to say whether your query must match any of the possible interpretations or all of them. Few, if any, other concordancers have this ability.
- Multi-platform
- Poliqarp is written in Java and portable C, and is thus available for Windows and most Unix-like systems, including Linux, *BSD and Solaris. Currently, it supports only little-endian architectures, but work is underway to make it endian-neutral.
- Efficient
- It is hard to estimate the average time of searching a corpus, since it heavily depends on the structure of the query. However, simple queries (for a word or phrase) take a few seconds even on corpora containing more than a hundred million words (in terms of raw texts, that's several gigabytes including tags and metadata!) More complex query take longer to execute, but even then, you get the results as soon as they are found, so you don't have to wait long.
- Free
- Poliqarp is free/open source software, available under the terms of the GNU General Public License.
System requirements
To use Poliqarp, you will need:
- a PC running Windows 98, ME, 2000, XP or 2003, or a little-endian machine running a modern Unix-like system such as GNU/Linux
- Java Runtime Environment (JRE), at least version 5.0
- 128 MB of RAM (the more, the better)
- at least a 200MHz CPU
Please note that these requirements are approximate and will vary according to the size of corpora you will use Poliqarp with.
In addition, to build Poliqarp from source, you will need:
- A Unix-like environment. This means either a Unix-like system, or Windows with MinGW and MSYS installed (Cygwin might also work, but hasn't been tested).
- GNU make, preferably version 3.80. Other versions of make will most likely not work.
- A decent C compiler. GCC is fine (versions 2.95 and 3.x have been tested).
- Working lex and yacc (GNU flex and GNU bison are fine).
- The Expat library to build the corpus builder.
- Sun's Java JDK 5.0 or newer to build the GUI.
License
Poliqarp is copyright © 2004-2008 by Instytut Podstaw Informatyki Polskiej Akademii Nauk (IPI PAN; Institute of Computer Science, Polish Academy of Sciences; cf. www.ipipan.waw.pl). All rights reserved.
It may be distributed and/or modified under the terms of the GNU General Public License version 2 as published by the Free Software Foundation.