Re: Diversifying Search Results - Custom Collector

2012-08-20 Thread Mikhail Khludnev
Hello,

I've got the problem description below. Can you explain the expected user
experience, and/or solution approach before diving into the algorithm
design?

Thanks

On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
karthick.soundara...@gmail.com> wrote:

> My problem is that when there are a lot of documents representing products,
> products from same manufacturer seem to appear in close proximity in the
> results and therefore, it doesnt provide brand diversity. When you search
> for sofas, you get sofas from a manufacturer A dominating the first page
> while the sofas from manufacturer B dominating the second page, etc. The
> issue here is that a manufacturer tends to describes the different sofas he
> produces the same way and therefore there is a very little difference
> between the documents representing two sofas.
>



-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics


 


Upgrade solr 3.4 to solr 3.6.1 without rebuilding the existing index ?

2012-08-20 Thread Dominique Bejean

Hi,

I think the response is yes, but I need to check.

Is it possible to upgrade from solr 3.4 to solr 3.6.1 without rebuilding 
the existing index ?


Thank you.

Dominique


Re: how to retrieve total token count per collection/index

2012-08-20 Thread tech.vronk

Am 09.08.2012 18:02, schrieb Robert Muir:

On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk  wrote:

Hello,

I wonder how to figure out the total token count in a collection (per
index), i.e. the size of a corpus/collection measured in tokens.


You want to use this statistic, which tells you number of tokens for
an indexed field:
http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/index/Terms.html#getSumTotalTermFreq%28%29



just to say:
thank you, this seems to work well!

matej


RE: Get results only from the last hour

2012-08-20 Thread Markus Jelsma
Date queries are described here: http://wiki.apache.org/solr/SolrQuerySyntax

You must first make sure your dates end up in a Date fieldType and are in the 
proper format.

 
-Original message-
> From:Dotan Cohen 
> Sent: Mon 20-Aug-2012 13:57
> To: solr-user@lucene.apache.org
> Subject: Get results only from the last hour
> 
> Hi all, new Solr user. I am retrieving record which look something like this:
> 
>  1246152402 
>  someUser 
>  1034 
> 
>  34859 
> 
> 
> 
> How can I limit my search to only those records created in the past 60
> minutes? Note that the created_at field is a Unix timestamp. I have
> gone over the Common Query Parameters [1] page but could not find the
> answer there nor by casual googling.
> 
> Thanks.
> 
> [1] http://wiki.apache.org/solr/CommonQueryParameters
> 
> -- 
> Dotan Cohen
> 
> http://gibberish.co.il
> http://what-is-what.com
> 


Re: Get results only from the last hour

2012-08-20 Thread Dotan Cohen
On Mon, Aug 20, 2012 at 3:00 PM, Markus Jelsma
 wrote:
> Date queries are described here: http://wiki.apache.org/solr/SolrQuerySyntax
>

Terrific, thank you!


> You must first make sure your dates end up in a Date fieldType and are in the 
> proper format.
>

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: scanned pdf with solr cell

2012-08-20 Thread Michael Della Bitta
It's pretty easy to accidentally run into the AWT stuff if you're
doing anything that involves image processing, which I would expect a
generic RTF parser might do.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Sun, Aug 19, 2012 at 10:29 PM, Lance Norskog  wrote:
> The backstory here is that Tika uses a library that for some crazy
> reason is inside the Java AWG graphics toolkit. (I think the RTF
> parser?)
>
> On Wed, Aug 15, 2012 at 5:57 AM, Ahmet Arslan  wrote:
>>> You can try passing
>>> -Djava.awt.headless=true as one of the arguments
>>> when you start Jetty to see if you can get this to go away
>>> with no ill
>>> effects.
>>
>> I started jetty using : 'java -Djava.awt.headless=true -jar start.jar' and 
>> successfully indexed two pdf files. That icon didn't appeared :) Thanks!
>
>
>
> --
> Lance Norskog
> goks...@gmail.com


Re: How to index multivalued field tokens by their attached metadata?

2012-08-20 Thread Fuu
After pondering it for a while I decided to take the advice and write the
processing as a separate program.  It will probably be easier to pre-format
the data with a scripting language anyways.

Thank you for taking your time to reply. :)

- Fuu



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-index-multivalued-field-tokens-by-their-attached-metadata-tp4001627p4002163.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Diversifying Search Results - Custom Collector

2012-08-20 Thread Karthick Duraisamy Soundararaj
Hello Mikhail,
Thank you for the reply. In terms of user
experience, I want to spread out the products from same brand farther from
each other, *atleast* in the first 50-100 results we display. I am thinking
about two different approaches as solution.

  1. For first few results, display one top scoring
product of a manufacturer  (For a given field, display the top scoring
results of the unique field values for the first N matches) . This N could
be either a percentage relative to total matches or a configurable absolute
value.
  2. Enforce a penalty on  the score for the results
that have duplicate field values. The penalty can be enforced such a way
that, the results with higher scores will not be affected as against the
ones with lower score.

Both of the solutions can be implemented while sorting the documents with
TopFieldCollector / TopScoreDocCollector.

Does this answer your question?  Please let me know if you have any more
questions.

Thanks,
Karthick

On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello,
>
> I've got the problem description below. Can you explain the expected user
> experience, and/or solution approach before diving into the algorithm
> design?
>
> Thanks
>
>
> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
> karthick.soundara...@gmail.com> wrote:
>
>> My problem is that when there are a lot of documents representing
>> products,
>> products from same manufacturer seem to appear in close proximity in the
>> results and therefore, it doesnt provide brand diversity. When you search
>> for sofas, you get sofas from a manufacturer A dominating the first page
>> while the sofas from manufacturer B dominating the second page, etc. The
>> issue here is that a manufacturer tends to describes the different sofas
>> he
>> produces the same way and therefore there is a very little difference
>> between the documents representing two sofas.
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> 
>  
>
>


Re: Upgrade solr 3.4 to solr 3.6.1 without rebuilding the existing index ?

2012-08-20 Thread Jack Krupansky
At the Lucene level the index should be 100% compatible, but I don't know 
with 100% certainty whether there may be subtle changes in any field type 
analyzers or token filters, such as in the example schema. You might want to 
read SOLR-2519 and see whether your fields and field types may be impacted 
in a way that might be incompatible between the 3.6 index-time analyzers and 
your 3.4 field type analyzers. But if you stick with your own field types, 
you should be okay.


-- Jack Krupansky

-Original Message- 
From: Dominique Bejean

Sent: Monday, August 20, 2012 5:13 AM
To: solr-user@lucene.apache.org
Subject: Upgrade solr 3.4 to solr 3.6.1 without rebuilding the existing 
index ?


Hi,

I think the response is yes, but I need to check.

Is it possible to upgrade from solr 3.4 to solr 3.6.1 without rebuilding
the existing index ?

Thank you.

Dominique 



solr finds allways all documents

2012-08-20 Thread robert rottermann

Hi there,
I am new to solr et all. Besides I am a  java noob.

What I am doing:
I want to do full text retrival on office documents. The metadata of 
these documents are maintained in Postgesql.

So the only intormation I need to get out of solr is a documet ID.

My problem no is, that my index seem to be done badly.
(nearly) What ever I look up, returns all documents.

I would be very glad, if somebody could give me an idea what I shoul change.

thanks
Robert


What I am using is the sample configuration that comes with solr 3.6.
I removed all the fields and added the following:



required="true"/>
required="false"/>
required="false"/>
required="false"/>
required="false"/>
required="false"/>




docid





Re: solr finds allways all documents

2012-08-20 Thread Sven Maurmann
Dear Robert,

could you give me a little more information about your setting? For example the 
complete solrconfig.xml and
the complete schema.xml would definitely help.

Best,

Sven

-- 
kippdata informationstechnologie GmbH
Sven Maurmann   Tel: 0228 98549 -12
Bornheimer Str. 33a Fax: 0228 98549 -50
D-53111 Bonn
sven.maurm...@kippdata.de

HRB 8018 Amtsgericht Bonn / USt.-IdNr. DE 196 457 417
Geschäftsführer: Dr. Thomas Höfer, Rainer Jung, Sven Maurmann




Am 20.08.2012 um 16:39 schrieb robert rottermann:

> Hi there,
> I am new to solr et all. Besides I am a  java noob.
> 
> What I am doing:
> I want to do full text retrival on office documents. The metadata of these 
> documents are maintained in Postgesql.
> So the only intormation I need to get out of solr is a documet ID.
> 
> My problem no is, that my index seem to be done badly.
> (nearly) What ever I look up, returns all documents.
> 
> I would be very glad, if somebody could give me an idea what I shoul change.
> 
> thanks
> Robert
> 
> 
> What I am using is the sample configuration that comes with solr 3.6.
> I removed all the fields and added the following:
> 
> 
> 
> required="true"/>
> required="false"/>
> required="false"/>
> required="false"/>
> required="false"/>
> required="false"/>
>
> 
> 
> docid
> 
> 
> 



Re: Upgrade solr 3.4 to solr 3.6.1 without rebuilding the existing index ?

2012-08-20 Thread Erick Erickson
The CHANGES.txt file (make sure to look in the Lucene version as well
as Solr) will
have, for each new version, a section about "upgrading from" that
should answer
for you...

Best
Erick

On Mon, Aug 20, 2012 at 3:13 AM, Dominique Bejean
 wrote:
> Hi,
>
> I think the response is yes, but I need to check.
>
> Is it possible to upgrade from solr 3.4 to solr 3.6.1 without rebuilding the
> existing index ?
>
> Thank you.
>
> Dominique


Re: Diversifying Search Results - Custom Collector

2012-08-20 Thread Karthick Duraisamy Soundararaj
Tanguy,
  You idea is perfect for cases where there is a too many
documents with 80-90% documents having same value for a particular field.
As an example, your idea is ideal for, lets say we have 10 documents in
total like this,

 doc1 :  Kellog's 
 doc2 :  Kellog's 
 doc3 :  Kellog's 
 doc4 :  Kellog's 
 doc5 :  Kellog's 
 doc6 :  Kellog's 
 doc7 :  Kellog's 
 doc8 :  Nestle 
 doc9 :  Kellog's 
 doc10 :  Kellog's 

But I have
 doc1 :  Maggi 
 doc2 :  Maggi  
 doc3 :  M&M's 
 doc4 :  M&M's 
 doc5 :  Hershey's 
 doc6 :  Hershey's 
 doc7 :  Nestle 
 doc8 :  Nestle 
 doc9 :  Kellog's 
 doc10 :  Kellog's 


Thanks,
Karthick

On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal  wrote:

> Hello,
>
> I don't know if that could help, but if I understood your issue, you have
> a lot of documents with the same or very close scores. Moreover I think you
> get your matches in Merchant order (more or less) because they must be
> indexed in that very same order, so solr returns documents of same scores
> in insertion order (although there is no contract specifying this)
>
> You could work around that issue by :
> 1/ Turning off tf/idf because you're searching in documents with little
> text where only the match counts, but frequencies obviously aren't helping.
> 2/ Add a random number to each document at index time, and boost on that
> random value at query time, this will shuffle your results, that's probably
> the simplest thing to do.
>
> Hope this helps,
>
> Tanguy
>
> 2012/8/20 Karthick Duraisamy Soundararaj 
>
>> Hello Mikhail,
>> Thank you for the reply. In terms of user
>> experience, I want to spread out the products from same brand farther from
>> each other, *atleast* in the first 50-100 results we display. I am
>> thinking about two different approaches as solution.
>>
>>   1. For first few results, display one top scoring
>> product of a manufacturer  (For a given field, display the top scoring
>> results of the unique field values for the first N matches) . This N could
>> be either a percentage relative to total matches or a configurable absolute
>> value.
>>   2. Enforce a penalty on  the score for the results
>> that have duplicate field values. The penalty can be enforced such a way
>> that, the results with higher scores will not be affected as against the
>> ones with lower score.
>>
>> Both of the solutions can be implemented while sorting the documents with
>> TopFieldCollector / TopScoreDocCollector.
>>
>> Does this answer your question?  Please let me know if you have any more
>> questions.
>>
>> Thanks,
>> Karthick
>>
>> On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
>> mkhlud...@griddynamics.com> wrote:
>>
>>> Hello,
>>>
>>> I've got the problem description below. Can you explain the expected
>>> user experience, and/or solution approach before diving into the algorithm
>>> design?
>>>
>>> Thanks
>>>
>>>
>>> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
>>> karthick.soundara...@gmail.com> wrote:
>>>
 My problem is that when there are a lot of documents representing
 products,
 products from same manufacturer seem to appear in close proximity in the
 results and therefore, it doesnt provide brand diversity. When you
 search
 for sofas, you get sofas from a manufacturer A dominating the first page
 while the sofas from manufacturer B dominating the second page, etc. The
 issue here is that a manufacturer tends to describes the different
 sofas he
 produces the same way and therefore there is a very little difference
 between the documents representing two sofas.

>>>
>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> Tech Lead
>>> Grid Dynamics
>>>
>>> 
>>>  
>>>
>>>
>>
>>
>


Re: solr finds allways all documents

2012-08-20 Thread Jack Krupansky

How are you ingesting the offic documents? SolrCell, or some other method?

Do you have CopyFields? What fields are you querying on?

What does your "text" field type look like?

-- Jack Krupansky
-Original Message- 
From: robert rottermann

Sent: Monday, August 20, 2012 10:39 AM
To: solr-user@lucene.apache.org
Cc: robert rottermann
Subject: solr finds allways all documents

Hi there,
I am new to solr et all. Besides I am a  java noob.

What I am doing:
I want to do full text retrival on office documents. The metadata of
these documents are maintained in Postgesql.
So the only intormation I need to get out of solr is a documet ID.

My problem no is, that my index seem to be done badly.
(nearly) What ever I look up, returns all documents.

I would be very glad, if somebody could give me an idea what I shoul change.

thanks
Robert


What I am using is the sample configuration that comes with solr 3.6.
I removed all the fields and added the following:












docid




Solr Custom Filter Factory - How to pass parameters?

2012-08-20 Thread ksu wildcats
We are using SOLR and are in the process of adding custom filter factory to
handle the processing of words/tokens to suit our needs.

Here is what our custom filter factory does
1) Reads the tokens and does some analysis and writes the result of analysis
to database.

We are using Embedded Solr with multi-core (separate core for each index).

We have Custom Filter Factory information configured in the Schema.xml

The problem we are running into is - not able to pass parameters to our
custom filter factory.
We need to be able to pass some additional information (index specific and
this would be different for each index) to our custom filter factory.

Can anyone please tell if this is possible with Solr or do we need to switch
back to using Lucene-APIs?

Thanks
-K



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Custom Filter Factory - How to pass parameters?

2012-08-20 Thread Jack Krupansky

First, the obvious question: What kind of information? Be specific.

Second, you can pass parameters to your filter factory in your field type 
definitions. You could have separate schemas or separate field types for the 
different indexes. Is there anything this doesn't cover?


You can also provide an "update processor" that could supply whatever 
parameters you want.


-- Jack Krupansky

-Original Message- 
From: ksu wildcats

Sent: Monday, August 20, 2012 1:19 PM
To: solr-user@lucene.apache.org
Subject: Solr Custom Filter Factory - How to pass parameters?

We are using SOLR and are in the process of adding custom filter factory to
handle the processing of words/tokens to suit our needs.

Here is what our custom filter factory does
1) Reads the tokens and does some analysis and writes the result of analysis
to database.

We are using Embedded Solr with multi-core (separate core for each index).

We have Custom Filter Factory information configured in the Schema.xml

The problem we are running into is - not able to pass parameters to our
custom filter factory.
We need to be able to pass some additional information (index specific and
this would be different for each index) to our custom filter factory.

Can anyone please tell if this is possible with Solr or do we need to switch
back to using Lucene-APIs?

Thanks
-K



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr Custom Filter Factory - How to pass parameters?

2012-08-20 Thread ksu wildcats
Thanks Jack.

The information I want to pass is the "databasename" into which the analyzed
data needs to be inserted.

As i was saying earlier, the set up we have is
1) we use embedded solr server with multi cores  - embedded into our webapp
2) support one index for each client - each client has a separate database
(rdbms) and separate index (core)
3) dynamically create the config files when client request comes into our
service for first time.
   config files (schema xml) are separate but content is identifical for all
cores.

The custom filter factory we want to add to chain of filters in schema.xml
will process tokens and write them to the clients database.
I am trying to figure out a way to retrieve the database name based on the
information coming in request from client.

I hope this is clear.

Regarding your suggestion on ability to pass parameters in filed type
definitions. Can you please point me to documentation or example on how to
retrieve these parameter values from within filter factory?

Also I am not familiar with "update processor". Any link to additional
information on how to provide "update processor" will be greatly helpful.

   




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002231.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr Custom Filter Factory - How to pass parameters?

2012-08-20 Thread Markus Jelsma


 
 
-Original message-
> From:ksu wildcats 
> Sent: Mon 20-Aug-2012 20:28
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Custom Filter Factory - How to pass parameters?
> 
> Thanks Jack.
> 
> The information I want to pass is the "databasename" into which the analyzed
> data needs to be inserted.
> 
> As i was saying earlier, the set up we have is
> 1) we use embedded solr server with multi cores  - embedded into our webapp
> 2) support one index for each client - each client has a separate database
> (rdbms) and separate index (core)
> 3) dynamically create the config files when client request comes into our
> service for first time.
>config files (schema xml) are separate but content is identifical for all
> cores.
> 
> The custom filter factory we want to add to chain of filters in schema.xml
> will process tokens and write them to the clients database.
> I am trying to figure out a way to retrieve the database name based on the
> information coming in request from client.
> 
> I hope this is clear.
> 
> Regarding your suggestion on ability to pass parameters in filed type
> definitions. Can you please point me to documentation or example on how to
> retrieve these parameter values from within filter factory?

You extend a TokenFilterFactory:
http://lucene.apache.org/core/4__0-BETA/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html

which extends AbstractAnalysisFactory:
http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html

Use the get() method to get the parameters defined in the XML. Check how the 
stopfilter retrieves it's parameters:

http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopFilterFactory.java?view=markup

> 
> Also I am not familiar with "update processor". Any link to additional
> information on how to provide "update processor" will be greatly helpful.
> 
>
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002231.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


Re: Upgrade solr 3.4 to solr 3.6.1 without rebuilding the existing index ?

2012-08-20 Thread Dominique Bejean

Thank you to both of you.


Le 20/08/12 17:28, Erick Erickson a écrit :

The CHANGES.txt file (make sure to look in the Lucene version as well
as Solr) will
have, for each new version, a section about "upgrading from" that
should answer
for you...

Best
Erick

On Mon, Aug 20, 2012 at 3:13 AM, Dominique Bejean
 wrote:

Hi,

I think the response is yes, but I need to check.

Is it possible to upgrade from solr 3.4 to solr 3.6.1 without rebuilding the
existing index ?

Thank you.

Dominique




Re: Diversifying Search Results - Custom Collector

2012-08-20 Thread Mikhail Khludnev
Hello,

I don't believe your task can be solved by playing with scoring/collector
or shuffling.
For me it's absolutely Grouping usecase (despite I don't really know this
feature well).

> Grouping cannot solve the problem because I dont want to limit the number
of results showed based on the grouping field.

I'm not really getting it. why you can set limit to 11 and just show the
labels like "[+] show 6 result.." or if you have 11 "[+] show more than 10
.."

If you experience problem with constructing search result page, I can
suggest submit search request with rows=0&facet.field=BRAND, then your
algorithm can choose number of necessary items per every brand and submit
rows=X&fq=BRAND:Y it gives you arbitrarily sizes for "groups".

Will this work for you?

On Mon, Aug 20, 2012 at 8:28 PM, Karthick Duraisamy Soundararaj <
d.s.karth...@gmail.com> wrote:

> Tanguy,
>   You idea is perfect for cases where there is a too many
> documents with 80-90% documents having same value for a particular field.
> As an example, your idea is ideal for, lets say we have 10 documents in
> total like this,
>
>  doc1 :  Kellog's 
>  doc2 :  Kellog's 
>  doc3 :  Kellog's 
>  doc4 :  Kellog's 
>  doc5 :  Kellog's 
>  doc6 :  Kellog's 
>  doc7 :  Kellog's 
>  doc8 :  Nestle 
>  doc9 :  Kellog's 
>  doc10 :  Kellog's 
>
> But I have
>  doc1 :  Maggi 
>  doc2 :  Maggi  
>  doc3 :  M&M's 
>  doc4 :  M&M's 
>  doc5 :  Hershey's 
>  doc6 :  Hershey's 
>  doc7 :  Nestle 
>  doc8 :  Nestle 
>  doc9 :  Kellog's 
>  doc10 :  Kellog's 
>
>
> Thanks,
> Karthick
>
> On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal wrote:
>
>> Hello,
>>
>> I don't know if that could help, but if I understood your issue, you have
>> a lot of documents with the same or very close scores. Moreover I think you
>> get your matches in Merchant order (more or less) because they must be
>> indexed in that very same order, so solr returns documents of same scores
>> in insertion order (although there is no contract specifying this)
>>
>> You could work around that issue by :
>> 1/ Turning off tf/idf because you're searching in documents with little
>> text where only the match counts, but frequencies obviously aren't helping.
>> 2/ Add a random number to each document at index time, and boost on that
>> random value at query time, this will shuffle your results, that's probably
>> the simplest thing to do.
>>
>> Hope this helps,
>>
>> Tanguy
>>
>> 2012/8/20 Karthick Duraisamy Soundararaj 
>>
>>> Hello Mikhail,
>>> Thank you for the reply. In terms of user
>>> experience, I want to spread out the products from same brand farther from
>>> each other, *atleast* in the first 50-100 results we display. I am
>>> thinking about two different approaches as solution.
>>>
>>>   1. For first few results, display one top scoring
>>> product of a manufacturer  (For a given field, display the top scoring
>>> results of the unique field values for the first N matches) . This N could
>>> be either a percentage relative to total matches or a configurable absolute
>>> value.
>>>   2. Enforce a penalty on  the score for the results
>>> that have duplicate field values. The penalty can be enforced such a way
>>> that, the results with higher scores will not be affected as against the
>>> ones with lower score.
>>>
>>> Both of the solutions can be implemented while sorting the documents
>>> with TopFieldCollector / TopScoreDocCollector.
>>>
>>> Does this answer your question?  Please let me know if you have any more
>>> questions.
>>>
>>> Thanks,
>>> Karthick
>>>
>>> On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
>>> mkhlud...@griddynamics.com> wrote:
>>>
 Hello,

 I've got the problem description below. Can you explain the expected
 user experience, and/or solution approach before diving into the algorithm
 design?

 Thanks


 On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
 karthick.soundara...@gmail.com> wrote:

> My problem is that when there are a lot of documents representing
> products,
> products from same manufacturer seem to appear in close proximity in
> the
> results and therefore, it doesnt provide brand diversity. When you
> search
> for sofas, you get sofas from a manufacturer A dominating the first
> page
> while the sofas from manufacturer B dominating the second page, etc.
> The
> issue here is that a manufacturer tends to describes the different
> sofas he
> produces the same way and therefore there is a very little difference
> between the documents representing two sofas.
>



 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics

 
  


>>>
>>>
>>
>
>


-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics


 


RE: Solr Custom Filter Factory - How to pass parameters?

2012-08-20 Thread ksu wildcats
Thanks Markus.
Links are helpful. I will give it a try and see if that solves my problem.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002248.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Near Real Time + Facets + Hierarchical Faceting (Pivot Table) with Date Range: huge data set

2012-08-20 Thread Fuad Efendi
NRT does not work because index updates hundreds times per second vs.
"cache" warm-up time few minutesŠ and we are in a loopŠ

> allowing you to query
> your huge index in ms.

Solr also allows to query in ms. What is the difference? No one can sort
1,000,000 terms in descending "counts" order faster than current Solr
implementation, and FieldCache & UnInvertedCache can't be used together
with NRTŠ cache discarded few times per second!

- Fuad
http://www.tokenizer.ca




On 12-08-14 8:17 AM, "Nagendra Nagarajayya"
 wrote:

>You should try realtime NRT available with Apache Solr 4.0 with
>RankingAlgorithm 1.4.4, allows faceting in realtime.
>
>RankingAlgorithm 1.4.4 also provides an age feature that allows you to
>retrieve the most recent changed docs in realtime, allowing you to query
>your huge index in ms.
>
>You can get more information and also download from here:
>
>http://solr-ra.tgels.org
>
>Regards
>
>- Nagendra Nagarajayya
>http://solr-ra.tgels.org
>http://rankingalgorithm.tgels.org
>
>ps. Note: Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external
>implementation
>
>
>On 8/13/2012 11:38 AM, Fuad Efendi wrote:
>> SOLR-4.0
>>
>> I am trying to implement this; funny idea to share:
>>
>> 1. http://wiki.apache.org/solr/HierarchicalFaceting
>> unfortunately it does not support date ranges. However, workaround: use
>> "String" type instead of "*_tdt" and define fields such as
>> published_hour
>> published_day
>> published_week
>> S(
>>
>> Of course you will need to stick with timezone; but you can add an
>>index(es)
>> for each timezone. And most important, "string" facets are much faster
>>than
>> "Date Trie" ranges.
>>
>>
>>
>> 2. Our index is overs 100 millions (from social networks) and rapidly
>>grows
>> (millions a day); cache warm up takes few minutes; Near-Real-Time does
>>not
>> work with faceting.
>>
>> HoweverS( another workaround: we can have Daily Core (optimized at
>>midnight),
>> plus Current Core (only today's data, optimized), plus Last Hour Core
>>(near
>> real time)
>>
>> "Last Hour Data" is small enough and we can use Facets with Near Real
>>Time
>> feature
>>
>> Service layer will accumulate search results from three layers, it will
>>be
>> near real time.
>>
>>
>>
>> Any thoughts? Thanks,
>>
>>
>>
>>
>




Shingle and PositionFilterFactory question

2012-08-20 Thread Carrie Coy
I am trying to use shingles and position filter to make a query for 
"foot print", for example, match either "foot print" or "footprint".   
From the docs: using the PositionFilter 
 in combination makes it 
possible to make all shingles synonyms of each other.


I've configured my analyzer like this:


maxShingleSize="2" outputUnigrams="true" 
outputUnigramsIfNoShingles="false" tokenSeparator=""/>




user query:  "foot print"

Without PositionFilterFactory, parsed query:+(((title:foot) 
(title:print))~2) (title:"(foot footprint) print")


With PositionFilterFactory, parsed query: +(((title:foot) 
(title:print))~2) ()


Why, when I add PositionFilterFactory into the mix, is the "footprint" 
shingle is omitted?


Output of analysis:

WT

text
raw_bytes
start
end
position
type


foot
[66 6f 6f 74]
0
4
1
word


print
[70 72 69 6e 74]
5
10
2
word

SF

text
raw_bytes
start
end
positionLength
type
position


foot
[66 6f 6f 74]
0
4
1
word
1


footprint
[66 6f 6f 74 70 72 69 6e 74]
0
10
2
shingle
1


print
[70 72 69 6e 74]
5
10
1
word
2



Thanks,
Carrie Coy









UnInvertedField limitations

2012-08-20 Thread Fuad Efendi

Hi All,


I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
possibly have 256,000,000 different terms in a fieldŠ or 16,000,000? Can I
temporarily disable tho feature?

Thanks!


2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
org.apache.solr.common.SolrException: Too many values for UnInvertedField
faceting on field enrich_keywords_string_mv
at 
org.apache.solr.request.UnInvertedField.(UnInvertedField.java:179)
at 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j
ava:668)
at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4
23)
at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:85)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
ler.java:204)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




-- 
Fuad Efendi
http://www.tokenizer.ca





Grammar for ComplexPhraseQueryParser

2012-08-20 Thread vempap
Hello,

   Does anyone have the grammar file (.jj file) for the complex phrase query
parser. The patch from https://issues.apache.org/jira/browse/SOLR-1604 does
not have the grammar file as part of it.

Thanks,
Phani.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Grammar-for-ComplexPhraseQueryParser-tp4002263.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Diversifying Search Results - Custom Collector

2012-08-20 Thread Karthick Duraisamy Soundararaj
Hi Mikhail,
  You are correct.  "[+] show 6 result.."  will work but
it wouldn't suit my requirements. This is a question of user experience
right?

Imagine if the product manager comes to you and says I dont want to see
 "[+] show 6 result.." and I want the results to be diverse but should be
showed like any other search results.

I think grouping does this by two pass collection. First pass, it figures
out all the groups and then in the second  pass, it collects the results
into these groups.


Thanks,
Karthick

On Mon, Aug 20, 2012 at 3:24 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello,
>
> I don't believe your task can be solved by playing with scoring/collector
> or shuffling.
> For me it's absolutely Grouping usecase (despite I don't really know this
> feature well).
>
> > Grouping cannot solve the problem because I dont want to limit the
> number of results showed based on the grouping field.
>
> I'm not really getting it. why you can set limit to 11 and just show the
> labels like "[+] show 6 result.." or if you have 11 "[+] show more than 10
> .."
>
> If you experience problem with constructing search result page, I can
> suggest submit search request with rows=0&facet.field=BRAND, then your
> algorithm can choose number of necessary items per every brand and submit
> rows=X&fq=BRAND:Y it gives you arbitrarily sizes for "groups".
>
> Will this work for you?
>
>
> On Mon, Aug 20, 2012 at 8:28 PM, Karthick Duraisamy Soundararaj <
> d.s.karth...@gmail.com> wrote:
>
>> Tanguy,
>>   You idea is perfect for cases where there is a too many
>> documents with 80-90% documents having same value for a particular field.
>> As an example, your idea is ideal for, lets say we have 10 documents in
>> total like this,
>>
>>  doc1 :  Kellog's 
>>  doc2 :  Kellog's 
>>  doc3 :  Kellog's 
>>  doc4 :  Kellog's 
>>  doc5 :  Kellog's 
>>  doc6 :  Kellog's 
>>  doc7 :  Kellog's 
>>  doc8 :  Nestle 
>>  doc9 :  Kellog's 
>>  doc10 :  Kellog's 
>>
>> But I have
>>  doc1 :  Maggi 
>>  doc2 :  Maggi  
>>  doc3 :  M&M's 
>>  doc4 :  M&M's 
>>  doc5 :  Hershey's 
>>  doc6 :  Hershey's 
>>  doc7 :  Nestle 
>>  doc8 :  Nestle 
>>  doc9 :  Kellog's 
>>  doc10 :  Kellog's 
>>
>>
>> Thanks,
>> Karthick
>>
>> On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal wrote:
>>
>>> Hello,
>>>
>>> I don't know if that could help, but if I understood your issue, you
>>> have a lot of documents with the same or very close scores. Moreover I
>>> think you get your matches in Merchant order (more or less) because they
>>> must be indexed in that very same order, so solr returns documents of same
>>> scores in insertion order (although there is no contract specifying this)
>>>
>>> You could work around that issue by :
>>> 1/ Turning off tf/idf because you're searching in documents with little
>>> text where only the match counts, but frequencies obviously aren't helping.
>>> 2/ Add a random number to each document at index time, and boost on that
>>> random value at query time, this will shuffle your results, that's probably
>>> the simplest thing to do.
>>>
>>> Hope this helps,
>>>
>>> Tanguy
>>>
>>> 2012/8/20 Karthick Duraisamy Soundararaj 
>>>
 Hello Mikhail,
 Thank you for the reply. In terms of user
 experience, I want to spread out the products from same brand farther from
 each other, *atleast* in the first 50-100 results we display. I am
 thinking about two different approaches as solution.

   1. For first few results, display one top scoring
 product of a manufacturer  (For a given field, display the top scoring
 results of the unique field values for the first N matches) . This N could
 be either a percentage relative to total matches or a configurable absolute
 value.
   2. Enforce a penalty on  the score for the
 results that have duplicate field values. The penalty can be enforced such
 a way that, the results with higher scores will not be affected as against
 the ones with lower score.

 Both of the solutions can be implemented while sorting the documents
 with TopFieldCollector / TopScoreDocCollector.

 Does this answer your question?  Please let me know if you have any
 more questions.

 Thanks,
 Karthick

 On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
 mkhlud...@griddynamics.com> wrote:

> Hello,
>
> I've got the problem description below. Can you explain the expected
> user experience, and/or solution approach before diving into the algorithm
> design?
>
> Thanks
>
>
> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
> karthick.soundara...@gmail.com> wrote:
>
>> My problem is that when there are a lot of documents representing
>> products,
>> products from same manufacturer seem to appear in close proximity in
>> the
>> res

Many fields versus join

2012-08-20 Thread Steven Livingstone Pérez
Hi folks. I read some posts in the past about this subject but nothing that 
definitively answer my question.
I am trying to understand the trade off when you use a large number of fields 
(now sure what a quantative value of large is in Solr .. say 200 fields) versus 
a join - and even a multi value join.
The reason being, I have a document that has a set of core fields and then a 
load of metadata that is a repeating structure.
D1 F1 F2 F3 F4 F5 . S1a S1b S1c S2a S2b S2c 
I'm not sure whether to create a load of fields up to SNx and a single document 
or to have multiple documents with each SNx in a separate document with a 
parent id that points to a parent document (or a multivalue metadata pointer 
field).
I hope that comes across reasonable well - please ask if not. Oh, if anyone 
knows of any quantative studies in Solr fields/documents i'd love to see the 
hard stats to improve my knowledge.
Loving Solr.
Cheers,/Steven

UnInvertedField limitations

2012-08-20 Thread Fuad Efendi
Hi All,


I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?

Thanks!


2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
org.apache.solr.common.SolrException: Too many values for UnInvertedField
faceting on field enrich_keywords_string_mv
at 
org.apache.solr.request.UnInvertedField.(UnInvertedField.java:179)
at 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j
ava:668)
at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4
23)
at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:85)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
ler.java:204)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




-- 
Fuad Efendi
http://www.tokenizer.ca





Re: Switch from Sphinx to Solr - some basics please

2012-08-20 Thread Lance Norskog
"I have for example jobs form country A, jobs from country B and so on until
100 countries. I need to have for each country an separate index, because if
someone search for jobs in country A I need to query only the index for
country A. How to solve this problem?"

Ah! Will the text be in different languages? You probably want a
separate index for each language base. Solr/Lucene has very good
facilities for text in different languages.

In general, this will be a learning experience. After you do the
deployment, you will discover problems in search design, deployment
design, scaling architecture, and operations tools. You should plan to
do two Solr deployments.

On Wed, Aug 15, 2012 at 8:27 AM, Walter Underwood  wrote:
> These do require some Sphinx knowledge. I could answer them on StackOverflow 
> because I converted Chegg from Sphinx to Solr this year.
>
> As I said there, read about Solr cores. They are independent search 
> configurations and indexes within one Solr server: 
> http://wiki.apache.org/solr/CoreAdmin
>
> For your jobs example, I would use filter queries to limit the search to a 
> single country. Filter them to country:us or country:de or country:fr and you 
> will only get result from that country.
>
> Solr does not use the term "rotate" for indexes. You can delete with a query, 
> so you could delete all the jobs for one country, reindex those, then commit.
>
> Separate cores are best when you have different kinds of data. At Chegg, we 
> search books and college courses. Those are in different cores and have very 
> different schemas.
>
> wunder
>
> On Aug 15, 2012, at 5:11 AM, nnikolay wrote:
>
>> HI iorixxx, thanks for the reply.
>>
>> Well you don't need sphinx knowledge to answer my questions.
>>
>> I have write you what I want:
>>
>> 1. I need to have 2 seprate indexes. In Stackoverlfow I became the answer I
>> need to start 2 cores for example. How many cores can I run for solr? I have
>> for example over 100 different indexes, that they should seeing as separate
>> data. This indexes should be reindexed in different times and the data of
>> them should not mixed with each other.
>>
>> You need to understand follow situation:
>>
>> I have for example jobs form country A, jobs from country B and so on until
>> 100 countries. I need to have for each country an separate index, because if
>> someone search for jobs in country A I need to query only the index for
>> country A. How to solve this problem?
>>
>> How to do this? Is there are good tutorial? In the wiki of solr, it is very
>> bad explained.
>>
>> 2. When I become new data for example: Should I rotate the whole index
>> again, or can I include the new rows and delete the old rows. What is your
>> suggestion?
>>
>> Thanks
>> Nik
>>
>>
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Switch-from-Sphinx-to-Solr-some-basics-please-tp4001234p4001379.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>
> --
> Walter Underwood
> wun...@wunderwood.org
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Atomic Multicore Operations - E.G. Move Docs

2012-08-20 Thread Nicholas Ball

hi lance,

how would that work? generation is essentially versioning right?
i also don't see why you need to use zk to do this as it's all on a single
machine, was hoping for a simpler solution :)

On Sun, 19 Aug 2012 19:26:41 -0700, Lance Norskog 
wrote:
> I would use generation numbers on documents, and communicate a global
> generation number in ZK.
> 
> On Thu, Aug 16, 2012 at 2:22 AM, Nicholas Ball
>  wrote:
>>
>> I've been close to implementing a 2PC protocol before for something
else,
>> however for this it's not needed.
>> As the move operation will be done on a single node which has both the
>> cores, this could be done differently. Just not entirely sure how to do
>> it.
>>
>> When a commit is done at the moment, the core must get locked somehow,
it
>> is at this point where we should lock the other core too if a move
>> operation is being executed.
>>
>> Nick
>>
>> On Thu, 16 Aug 2012 10:32:10 +0800, Li Li  wrote:
>>>
>>
http://zookeeper.apache.org/doc/r3.3.6/recipes.html#sc_recipes_twoPhasedCommit
>>>
>>> On Thu, Aug 16, 2012 at 7:41 AM, Nicholas Ball
>>>  wrote:

 Haven't managed to find a good way to do this yet. Does anyone have
any
 ideas on how I could implement this feature?
 Really need to move docs across from one core to another atomically.

 Many thanks,
 Nicholas

 On Mon, 02 Jul 2012 04:37:12 -0600, Nicholas Ball
  wrote:
> That could work, but then how do you ensure commit is called on the
>> two
> cores at the exact same time?
>
> Cheers,
> Nicholas
>
> On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog

> wrote:
>> Index all documents to both cores, but do not call commit until
both
>> report that indexing worked. If one of the cores throws an
exception,
>> call roll back on both cores.
>>
>> On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball
>>  wrote:
>>>
>>> Hey all,
>>>
>>> Trying to figure out the best way to perform atomic operation
across
>>> multiple cores on the same solr instance i.e. a multi-core
 environment.
>>>
>>> An example would be to move a set of docs from one core onto
another
> core
>>> and ensure that a softcommit is done as the exact same time. If
one
> were
>>> to
>>> fail so would the other.
>>> Obviously this would probably require some customization but
wanted
>> to
>>> know what the best way to tackle this would be and where should I
be
>>> looking in the source.
>>>
>>> Many thanks for the help in advance,
>>> Nicholas a.k.a. incunix


Re: UnInvertedField limitations

2012-08-20 Thread Jack Krupansky
It appears that there is a hard limit of 24-bits or 16M for the number of 
bytes to reference the terms in a single field of a single document. It 
takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes, that 
would allow 16/4 or 4 million unique terms - per document. Do you have such 
large documents? This appears to be a hard limit based of 24-bytes in a Java 
int.


You can try facet.method=enum, but that may be too slow.

What release of Solr are you running?

-- Jack Krupansky

-Original Message- 
From: Fuad Efendi

Sent: Monday, August 20, 2012 4:34 PM
To: Solr-User@lucene.apache.org
Subject: UnInvertedField limitations

Hi All,


I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?

Thanks!


2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
org.apache.solr.common.SolrException: Too many values for UnInvertedField
faceting on field enrich_keywords_string_mv
   at
org.apache.solr.request.UnInvertedField.(UnInvertedField.java:179)
   at
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j
ava:668)
   at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
   at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4
23)
   at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
   at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:85)
   at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
ler.java:204)
   at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:129)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




--
Fuad Efendi
http://www.tokenizer.ca





Re: UnInvertedField limitations

2012-08-20 Thread Lance Norskog
Is this required by your application? Is there any way to reduce the
number of terms?

A work around is to use shards. If your terms follow Zipf's Law each
shard will have fewer than the complete number of terms. For N shards,
each shard will have ~1/N of the singleton terms. For 2-count terms,
1/N or 2/N will have that term.

Now I'm interested but not mathematically capable: what is the general
probabilistic formula for splitting Zipf's Law across shards?

On Mon, Aug 20, 2012 at 3:51 PM, Jack Krupansky  wrote:
> It appears that there is a hard limit of 24-bits or 16M for the number of
> bytes to reference the terms in a single field of a single document. It
> takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes, that
> would allow 16/4 or 4 million unique terms - per document. Do you have such
> large documents? This appears to be a hard limit based of 24-bytes in a Java
> int.
>
> You can try facet.method=enum, but that may be too slow.
>
> What release of Solr are you running?
>
> -- Jack Krupansky
>
> -Original Message- From: Fuad Efendi
> Sent: Monday, August 20, 2012 4:34 PM
> To: Solr-User@lucene.apache.org
> Subject: UnInvertedField limitations
>
>
> Hi All,
>
>
> I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
> possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?
>
> Thanks!
>
>
> 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
> org.apache.solr.common.SolrException: Too many values for UnInvertedField
> faceting on field enrich_keywords_string_mv
>at
> org.apache.solr.request.UnInvertedField.(UnInvertedField.java:179)
>at
> org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j
> ava:668)
>at
> org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
>at
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4
> 23)
>at
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
>at
> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
> :85)
>at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
> ler.java:204)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
> java:129)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)
>
>
>
>
> --
> Fuad Efendi
> http://www.tokenizer.ca
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Atomic Multicore Operations - E.G. Move Docs

2012-08-20 Thread Lance Norskog
Yes, by generations I meant versioning. The problem is that you have
to have a central holder of the current generation number. ZK does
this very well. It is a distributed synchronized file system for very
small files. If you have a more natural place to store the current
generation number, that's fine also.

On Mon, Aug 20, 2012 at 2:47 PM, Nicholas Ball
 wrote:
>
> hi lance,
>
> how would that work? generation is essentially versioning right?
> i also don't see why you need to use zk to do this as it's all on a single
> machine, was hoping for a simpler solution :)
>
> On Sun, 19 Aug 2012 19:26:41 -0700, Lance Norskog 
> wrote:
>> I would use generation numbers on documents, and communicate a global
>> generation number in ZK.
>>
>> On Thu, Aug 16, 2012 at 2:22 AM, Nicholas Ball
>>  wrote:
>>>
>>> I've been close to implementing a 2PC protocol before for something
> else,
>>> however for this it's not needed.
>>> As the move operation will be done on a single node which has both the
>>> cores, this could be done differently. Just not entirely sure how to do
>>> it.
>>>
>>> When a commit is done at the moment, the core must get locked somehow,
> it
>>> is at this point where we should lock the other core too if a move
>>> operation is being executed.
>>>
>>> Nick
>>>
>>> On Thu, 16 Aug 2012 10:32:10 +0800, Li Li  wrote:

>>>
> http://zookeeper.apache.org/doc/r3.3.6/recipes.html#sc_recipes_twoPhasedCommit

 On Thu, Aug 16, 2012 at 7:41 AM, Nicholas Ball
  wrote:
>
> Haven't managed to find a good way to do this yet. Does anyone have
> any
> ideas on how I could implement this feature?
> Really need to move docs across from one core to another atomically.
>
> Many thanks,
> Nicholas
>
> On Mon, 02 Jul 2012 04:37:12 -0600, Nicholas Ball
>  wrote:
>> That could work, but then how do you ensure commit is called on the
>>> two
>> cores at the exact same time?
>>
>> Cheers,
>> Nicholas
>>
>> On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog
> 
>> wrote:
>>> Index all documents to both cores, but do not call commit until
> both
>>> report that indexing worked. If one of the cores throws an
> exception,
>>> call roll back on both cores.
>>>
>>> On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball
>>>  wrote:

 Hey all,

 Trying to figure out the best way to perform atomic operation
> across
 multiple cores on the same solr instance i.e. a multi-core
> environment.

 An example would be to move a set of docs from one core onto
> another
>> core
 and ensure that a softcommit is done as the exact same time. If
> one
>> were
 to
 fail so would the other.
 Obviously this would probably require some customization but
> wanted
>>> to
 know what the best way to tackle this would be and where should I
> be
 looking in the source.

 Many thanks for the help in advance,
 Nicholas a.k.a. incunix



-- 
Lance Norskog
goks...@gmail.com


Re: Many fields versus join

2012-08-20 Thread Erick Erickson
Join works best with a small number of unique values. Unfortunately,
people often want to join on , which is by definition
unique per document.

The usual advice is to first try to flatten your data as much as possible.
There's also some ongoing work on "block joins" that you may want to
look at the JIRA for, explicitly for parent/child relationships but I confess
I haven't a real clue what the details are

Best
Erick

On Mon, Aug 20, 2012 at 2:56 PM, Steven Livingstone Pérez
 wrote:
> Hi folks. I read some posts in the past about this subject but nothing that 
> definitively answer my question.
> I am trying to understand the trade off when you use a large number of fields 
> (now sure what a quantative value of large is in Solr .. say 200 fields) 
> versus a join - and even a multi value join.
> The reason being, I have a document that has a set of core fields and then a 
> load of metadata that is a repeating structure.
> D1 F1 F2 F3 F4 F5 . S1a S1b S1c S2a S2b S2c 
> I'm not sure whether to create a load of fields up to SNx and a single 
> document or to have multiple documents with each SNx in a separate document 
> with a parent id that points to a parent document (or a multivalue metadata 
> pointer field).
> I hope that comes across reasonable well - please ask if not. Oh, if anyone 
> knows of any quantative studies in Solr fields/documents i'd love to see the 
> hard stats to improve my knowledge.
> Loving Solr.
> Cheers,/Steven


Solr Score threshold 'reasonably', independent of results returned

2012-08-20 Thread Ramzi Alqrainy
Usually, search results are sorted by their score (how well the document
matched the query), but it is common to need to support the sorting of
supplied data too.
Boosting affects the scores of matching documents in order to affect ranking
in score-sorted search results. Providing a boost value, whether at the
document or field level, is optional.
When the results are returned with scores, we want to be able to only "keep"
results that are above some score (i.e. results of a certain quality only).
Is it possible to do this when the returned subset could be anything?
I ask because it seems like on some queries a score of say 0.008 is
resulting in a decent match, whereas other queries a higher score results in
a poor match.
I have written pseudo code to achieve what I said.
Note: I have attached my code as screenshot

http://lucene.472066.n3.nabble.com/file/n4002312/Screen_Shot_2012-08-21_at_5.30.38_AM.png
 

https://issues.apache.org/jira/browse/SOLR-3747



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Score-threshold-reasonably-independent-of-results-returned-tp4002312.html
Sent from the Solr - User mailing list archive at Nabble.com.


mergeindex: what happens if there is deletion during index merging

2012-08-20 Thread Yandong Yao
Hi guys,

>From http://wiki.apache.org/solr/MergingSolrIndexes,  it said 'Using
"srcCore", care is taken to ensure that the merged index is not corrupted
even if writes are happening in parallel on the source index'.

What does it means? If there are deletion request during merging, will this
deletion be processed correctly after merging finished?

1)
eg:  I have an existing core 'core0', and I want to merge core 'core1' and
'core2' to core 'core0', so I will use
http://localhost:8983/solr/admin/cores?action=mergeindexes&core=core0&srcCore=core1&srcCore=core2
,

During the merging happens, core0, core1, core2 have received deletion
request to delete some old documents, will the final core 'core0' contains
all content from 'core1' and 'core2' and also all documents matches
deletion criteria has been deleted?

2)
And if core0, core1, and core2 are processing deletion request, at the same
time core merge request comes in, what will happen then? Will merge request
block until deletion finished on all cores?

Thanks very much in advance!

Regards,
Yandong


RE: Solr Custom Filter Factory - How to pass parameters?

2012-08-20 Thread ksu wildcats
Thanks for your help. I was able to get it working with using the parameters
from filedtype definition in config files.

I am now stuck on next step.
Can you please tell if there is a way to identify/intercept last token that
gets added to index (across all documents) ?

Here is my scenario
1) I have custom implementation in "incrementToken" method in CustomFilter
2) I am trying to collect all tokens from all documents and then do some
analysis on those tokens and then write the result to database.
3) I have the results saved in-memory and am writing them to database after
last token is parsed.
if (!input.incrementToken()) {
 // custom logic that writes the data from in-memory to database
}
4) I noticed that this approach increased too many db calls (one per each
document)
5) To avoid too many calls to database I tried to batch results from
multiple documents and then write them all at once to database but what I
couldn't figure out is how can i determine when to flush the results from
CustomFilter to database.

Is there any method in FilterFactory or Filter class that I can use to know
that Indexing is completed?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002323.html
Sent from the Solr - User mailing list archive at Nabble.com.