Package: wnpp Severity: wishlist * Package name : lemur Version : 4.7 Upstream Author : The Lemur Project * URL : http://www.lemurproject.org/ * License : MIT/X Programming Lang: C, C++ Description : Toolkit for Language Modeling and Information Retrieval
The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval. Lemur supports a wide range of industrial and research language applications such as ad-hoc retrieval, site-search, and text mining. The toolkit supports indexing of large-scale text databases, the construction of simple language models for documents, queries, or subcollections, and the implementation of retrieval systems based on language models as well as a variety of other retrieval models. The system is written in the C and C++ languages. Below is a summary listing of the features found within the Lemur Toolkit: * Sophisticated structured query languages (using InQuery and Indri) * Support for XML and structured document retrieval * Used commonly with a wide range of research test collections (e.g., TREC CDs 1-5, wt10g, RCV1, gov, gov2) * Index your web pages with an "out-of-the-box" site search capability * Interactive interfaces for Windows, Linux, and Web * Distributed information retrieval and document clustering applications * Cross-platform, fast and modular code written in C++ * C++, Java and C# APIs * In use since 2002 by a large and growing user community Indexing features: * Multiple indexing methods for small, medium and large-scale (terabyte) collections * Built-in support for English, Chinese and Arabic text * Porter and Krovetz word stemming * Incremental indexing * Out-of-the-box indexing support for TREC Text, TREC Web, plain text, HTML, XML, PDF, MBox, Microsoft Word, and Microsoft PowerPoint * Indexes inline and offset text annotations (e.g., part-of-speech and named entities) * Indexes document attributes Retrieval features: * Supports major language modeling approaches such as Indri and KL-divergence, as well as vector space, tf.idf, Okapi and InQuery * Relevance- and pseudo-relevance feedback * Wildcard term expansion (using Indri) * Passage and XML element retrieval * Cross-lingual retrieval * Smoothing via Dirichlet priors and Markov chains * Supports arbitrary document priors (e.g., Page Rank, URL depth) ----------------------------------------------------------------------------- I'll start working on the packaging myself today. My work will likely appear somewhere on http://non-gnu.uvt.nl/. Bye, Joost -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]