Search over multiple indexes

2011-11-28 Thread Valeriy Felberg
Hello,

I'm trying to implement automatic document classification and store
the classified attributes as an additional field in Solr document.
Then the search goes against that field like
q=classified_category:xyz. The document classification is currently
implemented as an UpdateRequestProcessor and works quite well. The
only problem: for each change in the classification algorithm every
document has to be re-indexed which, of course, makes tests and
experimentation difficult and binds resources (other than Solr) for
several hours.

So, my idea would be to store classified attributes in a meta-index
and search over the main and meta indexes simultaneously. For example:
main index has got fields like color and meta index has got
classified_category. The query "q=classified_category:xyz AND
color:black" should be then split over the main and meta index. This
way, the classification could run on Solr over the main index and
store classified fields in the meta index so that only Solr resources
are bound.

Has anybody already done something like that? It's a little bit like
sharding but different in that each shard would process its part of
the query and live in the same Solr instance.

Regards,
Valeriy


Re: A tool for frequent re-indexing...

2012-04-06 Thread Valeriy Felberg
I've implemented something like described in
https://issues.apache.org/jira/browse/SOLR-3246. The idea is to add an
update request processor at the end of the update chain in the core
you want to copy. The processor converts the SolrInputDocument to XML
(there is some utility method for doing this) and dumps the XML into a
file which can be fed into Solr again with curl. If you have many
documents you will probably want to distribute the XML files into
different directories using some common prefix in the id field.

On Fri, Apr 6, 2012 at 5:18 AM, Ahmet Arslan  wrote:
>> I am considering writing a small tool that would read from
>> one solr core
>> and write to another as a means of quick re-indexing of
>> data.  I have a
>> large-ish set (hundreds of thousands) of documents that I've
>> already parsed
>> with Tika and I keep changing bits and pieces in schema and
>> config to try
>> new things often.  Instead of having to go through the
>> process of
>> re-indexing from docs (and some DBs), I thought it may be
>> much more faster
>> to just read from one core and write into new core with new
>> schema, analysers and/or settings.
>>
>> I was wondering if anyone else has done anything similar
>> already?  It would
>> be handy if I can use this sort of thing to spin off another
>> core write to
>> it and then swap the two cores discarding the older one.
>
> You might find these relevant :
>
> https://issues.apache.org/jira/browse/SOLR-3246
>
> http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
>
>


Re: Lexical analysis tools for German language data

2012-04-12 Thread Valeriy Felberg
If you want that query "jacke" matches a document containing the word
"windjacke" or "kinderjacke", you could use a custom update processor.
This processor could search the indexed text for words matching the
pattern ".*jacke" and inject the word "jacke" into an additional field
which you can search against. You would need a whole list of possible
suffixes, of course. It would slow down the update process but you
don't need to split words during search.

Best,
Valeriy

On Thu, Apr 12, 2012 at 12:39 PM, Paul Libbrecht  wrote:
>
> Michael,
>
> I'm on this list and the lucene list since several years and have not found 
> this yet.
> It's been one "neglected topics" to my taste.
>
> There is a CompoundAnalyzer but it requires the compounds to be dictionary 
> based, as you indicate.
>
> I am convinced there's a way to build the de-compounding words efficiently 
> from a broad corpus but I have never seen it (and the experts at DFKI I asked 
> for for also told me they didn't know of one).
>
> paul
>
> Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
>
>> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
>> like the code that prepares the data for the index (tokenizer etc) to
>> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
>> would include the "Windjacke" document in its result set.
>>
>> It appears to me that such an analysis requires a dictionary-backed
>> approach, which doesn't have to be perfect at all; a list of the most
>> common 2000 words would probably do the job and fulfil a criterion of
>> reasonable usefulness.
>>
>> Do you know of any implementation techniques or working implementations
>> to do this kind of lexical analysis for German language data? (Or other
>> languages, for that matter?) What are they, where can I find them?
>>
>> I'm sure there is something out (commercial or free) because I've seen
>> lots of engines grokking German and the way it builds words.
>>
>> Failing that, what are the proper terms do refer to these techniques so
>> you can search more successfully?
>>
>> Michael
>


Join backport to 3.x

2012-05-10 Thread Valeriy Felberg
Hi,

i've applied the patch from
https://issues.apache.org/jira/browse/SOLR-2604 to Solr 3.5. It works
but noticeably slows down the query time. Did someone already solve
that problem?

Cheers,
Valeriy