Re: Solr Scoring

2012-04-13 Thread Li Li
another way is to use payload http://wiki.apache.org/solr/Payloads
the advantage of payload is that you only need one field and can make frq
file smaller than use two fields. but the disadvantage is payload is stored
in prx file, so I am not sure which one is fast. maybe you can try them
both.

On Fri, Apr 13, 2012 at 8:04 AM, Erick Erickson wrote:

> GAH! I had my head in "make this happen in one field" when I wrote my
> response, without being explicit. Of course Walter's solution is pretty
> much the standard way to deal with this.
>
> Best
> Erick
>
> On Thu, Apr 12, 2012 at 5:38 PM, Walter Underwood 
> wrote:
> > It is easy. Create two fields, text_exact and text_stem. Don't use the
> stemmer in the first chain, do use the stemmer in the second. Give the
> text_exact a bigger weight than text_stem.
> >
> > wunder
> >
> > On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote:
> >
> >> No, I don't think there's an OOB way to make this happen. It's
> >> a recurring theme, "make exact matches score higher than
> >> stemmed matches".
> >>
> >> Best
> >> Erick
> >>
> >> On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue 
> wrote:
> >>> Hi,
> >>>
> >>> I have a field in my index called itemDesc which i am applying
> >>> EnglishMinimalStemFilterFactory to. So if i index a value to this field
> >>> containing "Edges", the EnglishMinimalStemFilterFactory applies
> stemming
> >>> and "Edges" becomes "Edge". Now when i search for "Edges", documents
> with
> >>> "Edge" score better than documents with the actual search word -
> "Edges".
> >>> Is there a way i can make documents with the actual search word in this
> >>> case "Edges" score better than document with "Edge"?
> >>>
> >>> I am using Solr 3.5. My field definition is shown below:
> >>>
> >>>  positionIncrementGap="100">
> >>>  
> >>>
> >>>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >>>  >>>ignoreCase="true"
> >>>words="stopwords_en.txt"
> >>>enablePositionIncrements="true"
> >>> 
> >>>
> >>>
> >>>  
> >>>  
> >>>
> >>> synonyms="synonyms.txt"
> >>> ignoreCase="true" expand="true"/>
> >>> >>>ignoreCase="true"
> >>>words="stopwords_en.txt"
> >>>enablePositionIncrements="true"
> >>>/>
> >>>
> >>>
> >>> >>> protected="protwords.txt"/>
> >>>
> >>>  
> >>>
> >>>
> >>> Thanks.
> >
> >
> >
> >
> >
>


Re: EmbeddedSolrServer and StreamingUpdateSolrServer

2012-04-13 Thread Mikhail Khludnev
Did I get right that you have two separate processes (different app) access
the same LuceneDIrectory simultaneously? In this case I suggest to read
about Locking mechanism. I'm not really experienced in it.
You showed logs from StrUpdHandler failure, it's clear. Can you show logs
from Embeded server commit, which is supposed to be successful?

On Fri, Apr 13, 2012 at 9:34 AM, pcrao  wrote:

> Hi Shawn,
>
> Thanks for sharing your opinion.
>
> Mikhail Khludnev, what do you think of Shawn's opinion?
>
> Thanks,
> PC Rao.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-and-StreamingUpdateSolrServer-tp3889073p3907223.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
ge...@yandex.ru


 


Re: How to read SOLR cache statistics?

2012-04-13 Thread Li Li
http://wiki.apache.org/solr/SolrCaching

On Fri, Apr 13, 2012 at 2:30 PM, Kashif Khan  wrote:

> Does anyone explain what does the following parameters mean in SOLR cache
> statistics?
>
> *name*:  queryResultCache
> *class*:  org.apache.solr.search.LRUCache
> *version*:  1.0
> *description*:  LRU Cache(maxSize=512, initialSize=512)
> *stats*:  lookups : 98
> *hits *: 59
> *hitratio *: 0.60
> *inserts *: 41
> *evictions *: 0
> *size *: 41
> *warmupTime *: 0
> *cumulative_lookups *: 98
> *cumulative_hits *: 59
> *cumulative_hitratio *: 0.60
> *cumulative_inserts *: 39
> *cumulative_evictions *: 0
>
> AND also this
>
>
> *name*:  fieldValueCache
> *class*:  org.apache.solr.search.FastLRUCache
> *version*:  1.0
> *description*:  Concurrent LRU Cache(maxSize=1, initialSize=10,
> minSize=9000, acceptableSize=9500, cleanupThread=false)
> *stats*:  *lookups *: 8
> *hits *: 4
> *hitratio *: 0.50
> *inserts *: 2
> *evictions *: 0
> *size *: 2
> *warmupTime *: 0
> *cumulative_lookups *: 8
> *cumulative_hits *: 4
> *cumulative_hitratio *: 0.50
> *cumulative_inserts *: 2
> *cumulative_evictions *: 0
> *item_ABC *:
>
> {field=ABC,memSize=340592,tindexSize=1192,time=1360,phase1=1344,nTerms=7373,bigTerms=1,termInstances=11513,uses=4}
> *item_BCD *:
>
> {field=BCD,memSize=341248,tindexSize=1952,time=1688,phase1=1688,nTerms=8075,bigTerms=0,termInstances=13510,uses=2}
>
> Without understanding these terms i cannot configure server for better
> cache
> usage. The point is searches are very slow. These stats were taken when
> server was down and restarted. I just want to understand what these terms
> mean actually
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-read-SOLR-cache-statistics-tp3907294p3907294.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


AW: Lexical analysis tools for German language data

2012-04-13 Thread Michael Ludwig
> Von: Tomas Zerolo

> > > There can be transformations or inflections, like the "s" in
> > > "Weinachtsbaum" (Weinachten/Baum).
> >
> > I remember from my linguistics studies that the terminus technicus
> > for these is "Fugenmorphem" (interstitial or joint morpheme) [...]
> 
> IANAL (I am not a linguist -- pun intended ;) but I've always read
> that as a genitive. Any pointers?

Admittedly, that's what you'd think, and despite linguistics telling me
otherwise I'd maintain there's some truth in it. For this case, however,
consider: "die Weihnacht" declines like "die Nacht", so:

nom. die Weihnacht
gen. der Weihnacht
dat. der Weihnacht
akk. die Weihnacht

As you can see, there's no "s" to be found anywhere, not even in the
genitive. But my gut feeling, like yours, is that this should indicate
genitive, and I would make a point of well-argued gut feeling being at
least as relevant as formalist analysis.

Michael


Re: two structures in solr

2012-04-13 Thread tkoomzaaskz
Thank you very much Erick for your reply!

So should it go something like the following:

http://lucene.472066.n3.nabble.com/file/n3907393/solr_index.png 
sorry for an ugly drawing ;)

In this example, the index will have 13 columns: 6 for project, 6 for
contractor and one to define the type. Is that right?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/two-structures-in-solr-tp3905143p3907393.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boost differences in two environments for same query and config

2012-04-13 Thread Kerwin
Hi Erick,

Thanks for your suggestions.
I did an optimize on the remote installation and this time with the
same number of documents but still face the same issue as seen from
the debug output below:

9.950362E-4 = (MATCH) sum of:
9.950362E-4 = (MATCH) weight(RECORD_TYPE:info in 35916), product of:
9.950362E-4 = queryWeight(RECORD_TYPE:info), product of:
1.0 = idf(docFreq=58891, maxDocs=8181811)
9.950362E-4 = queryNorm
1.0 = (MATCH) fieldWeight(RECORD_TYPE:info in 35916), product 
of:
1.0 = tf(termFreq(RECORD_TYPE:info)=1)
1.0 = idf(docFreq=58891, maxDocs=8181811)
1.0 = fieldNorm(field=RECORD_TYPE, doc=35916)
0.0 = (MATCH) product of:
1.0945399 = (MATCH) sum of:
0.99503624 = (MATCH) weight(CD:ee123^1000.0 in 35916), 
product of:
0.99503624 = queryWeight(CD:ee123^1000.0), 
product of:
1000.0 = boost
1.0 = idf(docFreq=1, maxDocs=8181811)
9.950362E-4 = queryNorm
1.0 = (MATCH) fieldWeight(CD:ee123 in 35916), 
product of:
1.0 = tf(termFreq(CD:ee123)=1)
1.0 = idf(docFreq=1, maxDocs=8181811)
1.0 = fieldNorm(field=CD, doc=35916)
0.09950362 = (MATCH)
ConstantScoreQuery(QueryWrapperFilter(CD:ee123 CD:ee123c CD:ee123c.
CD:ee123dc CD:ee123e CD:ee123e. CD:ee123en CD:ee123fx CD:ee123g
CD:ee123g.1 CD:ee123g1 CD:ee123ee123 CD:ee123l.1 CD:ee123l1 CD:ee123ll
CD:ee123lr CD:ee123m.z CD:ee123mg CD:ee123mz CD:ee123na CD:ee123nx
CD:ee123ol CD:ee123op CD:ee123p CD:ee123p.1 CD:ee123p1 CD:ee123pn
CD:ee123r.1 CD:ee123r1 CD:ee123s CD:ee123s.z CD:ee123sm CD:ee123sn
CD:ee123sp CD:ee123ss CD:ee123sz)), product of:
100.0 = boost
9.950362E-4 = queryNorm
0.0 = coord(2/3)


So I got the conf folder from the remote server location and replaced
my local conf folder with this one to see if the indexes were formed
differently but my local installation continues to work.I would expect
to see the same behaviour as on the remote installation but it did not
happen. (The only difference on the remote installation is that there
are cores while my local installation has no cores).
Anything else I could try?
Thanks for your help.

On 4/11/12, Erick Erickson  wrote:
> Well, you're matching a different number of records, so I have to assume
> your indexes are different on the two machines.
>
> Here is one case where doing an optimize might make sense, that'll purge
> the data associated with any deleted records from the index which should
> make comparisons better
>
> Additionally, you have to insure that your request handler is identical
> on both, have you made any changes to solrconfig.xml?
>
> About the coord (2/3), I'm pretty clueless. But also insure that your
> parsed query is identical on both, which is an additional check on
> whether you've changed something on one server and not the
> other.
>
> Best
> Erick
>
> On Wed, Apr 11, 2012 at 8:19 AM, Kerwin  wrote:
>> Hi All,
>>
>> I am firing the following Solr query against installations on two
>> environments one on my local Windows machine and the other on Unix
>> (Remote).
>>
>> RECORD_TYPE:info AND (NAME:ee123* OR CD:ee123^1000 OR CD:ee123*^100)
>>
>> There are no differences in the DataImportHandler configuration ,
>> Schema and Solrconfig for both these installations.
>> The correct expected result is given by the local installation of Solr
>> which also gives scores as expected for the boosts.
>>
>> CORRECT/Expected:
>> Debug query output for local installation:
>>
>> 10.822258 = (MATCH) sum of:
>>0.002170282 = (MATCH) weight(RECORD_TYPE:info in 35916), product
>> of:
>>3.65739E-4 = queryWeight(RECORD_TYPE:info), product of:
>>5.933964 = idf(docFreq=58891, maxDocs=8181811)
>>6.1634855E-5 = queryNorm
>>5.933964 = (MATCH) fieldWeight(RECORD_TYPE:info in 35916),
>> product of:
>>1.0 = tf(termFreq(RECORD_TYPE:info)=1)
>>5.933964 = idf(docFreq=58891, maxDocs=8181811)
>>1.0 = fieldNorm(field=RECORD_TYPE, doc=35916)
>>10.820087 = (MATCH) product of:
>>16.230131 = (MATCH) sum of:
>>16.223969 = (MATCH) weight(CD:ee123^1000.0 in
>> 35916), product of:
>>0.81 = queryWeight(CD:ee123^1000.0),
>> product of:
>>1000.0 = boost
>>16.224277 = idf(docFreq=1,
>

Re: Solr Scoring

2012-04-13 Thread Kissue Kissue
Thanks a lot. I had already implemented Walter's solution and was wondering
if this was the right way to deal with it. This has now given me the
confidence to go with the solution.

Many thanks.

On Fri, Apr 13, 2012 at 1:04 AM, Erick Erickson wrote:

> GAH! I had my head in "make this happen in one field" when I wrote my
> response, without being explicit. Of course Walter's solution is pretty
> much the standard way to deal with this.
>
> Best
> Erick
>
> On Thu, Apr 12, 2012 at 5:38 PM, Walter Underwood 
> wrote:
> > It is easy. Create two fields, text_exact and text_stem. Don't use the
> stemmer in the first chain, do use the stemmer in the second. Give the
> text_exact a bigger weight than text_stem.
> >
> > wunder
> >
> > On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote:
> >
> >> No, I don't think there's an OOB way to make this happen. It's
> >> a recurring theme, "make exact matches score higher than
> >> stemmed matches".
> >>
> >> Best
> >> Erick
> >>
> >> On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue 
> wrote:
> >>> Hi,
> >>>
> >>> I have a field in my index called itemDesc which i am applying
> >>> EnglishMinimalStemFilterFactory to. So if i index a value to this field
> >>> containing "Edges", the EnglishMinimalStemFilterFactory applies
> stemming
> >>> and "Edges" becomes "Edge". Now when i search for "Edges", documents
> with
> >>> "Edge" score better than documents with the actual search word -
> "Edges".
> >>> Is there a way i can make documents with the actual search word in this
> >>> case "Edges" score better than document with "Edge"?
> >>>
> >>> I am using Solr 3.5. My field definition is shown below:
> >>>
> >>>  positionIncrementGap="100">
> >>>  
> >>>
> >>>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >>>  >>>ignoreCase="true"
> >>>words="stopwords_en.txt"
> >>>enablePositionIncrements="true"
> >>> 
> >>>
> >>>
> >>>  
> >>>  
> >>>
> >>> synonyms="synonyms.txt"
> >>> ignoreCase="true" expand="true"/>
> >>> >>>ignoreCase="true"
> >>>words="stopwords_en.txt"
> >>>enablePositionIncrements="true"
> >>>/>
> >>>
> >>>
> >>> >>> protected="protwords.txt"/>
> >>>
> >>>  
> >>>
> >>>
> >>> Thanks.
> >
> >
> >
> >
> >
>


Re: Facets involving multiple fields

2012-04-13 Thread Marc SCHNEIDER
Hi,

Thanks for your answer.
Yes it works in this case when I know the facet name (Computer). What
if I want to automatically compute all facets?
facet.query=keyword:* short_title:* doesn't work, right?

Marc.

On Thu, Apr 12, 2012 at 2:08 PM, Erick Erickson  wrote:
> facet.query=keywords:computer short_title:computer
> seems like what you're asking for.
>
> On Thu, Apr 12, 2012 at 3:19 AM, Marc SCHNEIDER
>  wrote:
>> Hi,
>>
>> Thanks for your answer.
>> Let's say I have to fields : 'keywords' and 'short_title'.
>> For these fields I'd like to make a faceted search : if 'Computer' is
>> stored in at least one of these fields for a document I'd like to get
>> it added in my results.
>> doc1 => keywords : 'Computer' / short_title : 'Computer'
>> doc2 => keywords : 'Computer'
>> doc3 => short_title : 'Computer'
>>
>> In this case I'd like to have : Computer (3)
>>
>> I don't see how to solve this with facet.query.
>>
>> Thanks,
>> Marc.
>>
>> On Wed, Apr 11, 2012 at 5:13 PM, Erick Erickson  
>> wrote:
>>> Have you considered facet.query? You can specify an arbitrary query
>>> to facet on which might do what you want. Otherwise, I'm not sure what
>>> you mean by "faceted search using two fields". How should these fields
>>> be combined into a single facet? What that means practically is not at
>>> all obvious from your problem statement.
>>>
>>> Best
>>> Erick
>>>
>>> On Tue, Apr 10, 2012 at 8:55 AM, Marc SCHNEIDER
>>>  wrote:
 Hi,

 I'd like to make a faceted search using two fields. I want to have a
 single result and not a result by field (like when using
 facet.field=f1,facet.field=f2).
 I don't want to use a copy field either because I want it to be
 dynamic at search time.
 As far as I know this is not possible for Solr 3.x...
 But I saw a new parameter named "group.facet" for Solr4. Could that
 solve my problem? If yes could somebody give me an example?

 Thanks,
 Marc.


Re: How to read SOLR cache statistics?

2012-04-13 Thread Kashif Khan
Hi Li Li,

I have been through that WIKI before but that does not explain what is
*evictions*, *inserts*, *cumulative_inserts*, *cumulative_evictions*,
*hitratio *and all. These terms are foreign to me. What does the following
line mean? 

*item_ABC :
{field=ABC,memSize=340592,tindexSize=1192,time=1360,phase1=1344,nTerms=7373,bigTerms=1,termInstances=11513,uses=4}
*

I want that kind of explanation. I have read the wiki and the comments in
the solrconfig.xml file about all these things but does say how to read the
stats which is very *important!!!*.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-read-SOLR-cache-statistics-tp3907294p3907633.html
Sent from the Solr - User mailing list archive at Nabble.com.

Issues with language based indexing

2012-04-13 Thread JGar
Hello,

I am new to Solr. it is resulting some docs in my search for "Acciones y
Valores" string. When i go and search for the same word in the given doc
manually, i could not find those word. Pls help on what basis the doc is
found in the search .

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-language-based-indexing-tp3907601p3907601.html
Sent from the Solr - User mailing list archive at Nabble.com.

Realtime /get versus SearchHandler

2012-04-13 Thread Benson Margulies
A discussion over on the dev list led me to expect that the by-if
field retrievals in a SolrCloud query would come through the get
handler. In fact, I've seen them turn up in my search component in the
search handler that is configured with my custom QT. (I have a
'prepare' method that sets ShardParams.QT to my QT to get my
processing involved in the first of the two queries.) Did I overthink
this?


Re: Trouble handling Unit symbol

2012-04-13 Thread Erick Erickson
Please review:
http://wiki.apache.org/solr/UsingMailingLists

Especially the bit about adding &debugQuery=on
and showing the results. You're asking people
to guess at solutions without providing much
in the way of context.

You might try looking at your index with Luke to
see what's actually in your index, or perhaps
TermsComponent


Best
Erick

On Fri, Apr 13, 2012 at 2:29 AM, Rajani Maski  wrote:
> Hi All,
>
>   I tried to index with UTF-8  encode but the issue is still not fixed.
> Please see my inputs below.
>
> *Indexed XML:*
> 
> 
>  
>    0.100
>    µ
>  
> 
>
> *Search Query - * BODY:µ
>
> numfound : 0 results obtained.
>
> *What can be the reason for this? How do i need to make search query so
> that the above document is found.*
>
>
> Thanks & Regards
>
> Regards
> Rajani
>
>
>
> 2012/4/2 Rajani Maski 
>
>> Thank you for the reply.
>>
>>
>>
>> On Sat, Mar 31, 2012 at 3:38 AM, Chris Hostetter > > wrote:
>>
>>>
>>> : We have data having such symbols like :  ต
>>> : Indexed data has  -    Dose:"0 ตL"
>>> : Now , when  it is searched as  - Dose:"0 ตL"
>>>        ...
>>> : Query Q value observed  : S257:"0 ยตL/injection"
>>>
>>> First off: your "when searched as" example does not match up to your
>>> "Query Q" observed value (ie: field queries, extra "/injection" text at
>>> the end) suggesting that you maybe cut/paste something you didn't mean to
>>> -- so take the rest of this advice with a grain of salt.
>>>
>>> If i ignore your "when it is searched as" exampleand focus entirely on
>>> what you say you've indexed the data as, and the Q value you are sing (in
>>> what looks like the echoParams output) then the first thing that jumps out
>>> at me is that it looks like your servlet container (or perhaps your web
>>> browser if that's where you tested this) is not dealing with the unicode
>>> correctly -- because allthough i see a "ต" in the first three lines i
>>> quoted above (UTF8: 0xC2 0xB5) in your value observed i'm seeing it
>>> preceeded by a "ย" (UTF8: 0xC3 0x82) ... suggesting that perhaps the "ต"
>>> did not get URL encoded properly when the request was made to your servlet
>>> container?
>>>
>>> In particular, you might want to take a look at...
>>>
>>>
>>> https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
>>> http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
>>> The example/exampledocs/test_utf8.sh script included with solr
>>>
>>>
>>>
>>>
>>> -Hoss
>>
>>
>>


Re: two structures in solr

2012-04-13 Thread Erick Erickson
bq: Is that right?

I don't know, does it work ? You'll probably want an
additional field for unique id (just named "id" in the example)
that should be disjoint between your types.

Best
Erick

On Fri, Apr 13, 2012 at 3:41 AM, tkoomzaaskz  wrote:
> Thank you very much Erick for your reply!
>
> So should it go something like the following:
>
> http://lucene.472066.n3.nabble.com/file/n3907393/solr_index.png
> sorry for an ugly drawing ;)
>
> In this example, the index will have 13 columns: 6 for project, 6 for
> contractor and one to define the type. Is that right?
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/two-structures-in-solr-tp3905143p3907393.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boost differences in two environments for same query and config

2012-04-13 Thread Erick Erickson
Well, next thing I'd do is just copy your entire 
directory to the remote machine and try that. If that gives
identical results on both, then try moving just your
/data directory to the remote machine.

I suspect that you've done something different between the two
machines that's leading to this, but haven't a clue what.

If you copy your entire Solr installation over and _still_ get
this kind of thing, we're into whether the JVM or op system
are somehow changing things, which would surprise me a lot.

Best
Erick

On Fri, Apr 13, 2012 at 4:24 AM, Kerwin  wrote:
> Hi Erick,
>
> Thanks for your suggestions.
> I did an optimize on the remote installation and this time with the
> same number of documents but still face the same issue as seen from
> the debug output below:
>
> 9.950362E-4 = (MATCH) sum of:
>        9.950362E-4 = (MATCH) weight(RECORD_TYPE:info in 35916), product of:
>                9.950362E-4 = queryWeight(RECORD_TYPE:info), product of:
>                        1.0 = idf(docFreq=58891, maxDocs=8181811)
>                        9.950362E-4 = queryNorm
>                1.0 = (MATCH) fieldWeight(RECORD_TYPE:info in 35916), product 
> of:
>                        1.0 = tf(termFreq(RECORD_TYPE:info)=1)
>                        1.0 = idf(docFreq=58891, maxDocs=8181811)
>                        1.0 = fieldNorm(field=RECORD_TYPE, doc=35916)
>        0.0 = (MATCH) product of:
>                1.0945399 = (MATCH) sum of:
>                        0.99503624 = (MATCH) weight(CD:ee123^1000.0 in 35916), 
> product of:
>                                0.99503624 = queryWeight(CD:ee123^1000.0), 
> product of:
>                                        1000.0 = boost
>                                        1.0 = idf(docFreq=1, maxDocs=8181811)
>                                        9.950362E-4 = queryNorm
>                                1.0 = (MATCH) fieldWeight(CD:ee123 in 35916), 
> product of:
>                                        1.0 = tf(termFreq(CD:ee123)=1)
>                                        1.0 = idf(docFreq=1, maxDocs=8181811)
>                                        1.0 = fieldNorm(field=CD, doc=35916)
>                                0.09950362 = (MATCH)
> ConstantScoreQuery(QueryWrapperFilter(CD:ee123 CD:ee123c CD:ee123c.
> CD:ee123dc CD:ee123e CD:ee123e. CD:ee123en CD:ee123fx CD:ee123g
> CD:ee123g.1 CD:ee123g1 CD:ee123ee123 CD:ee123l.1 CD:ee123l1 CD:ee123ll
> CD:ee123lr CD:ee123m.z CD:ee123mg CD:ee123mz CD:ee123na CD:ee123nx
> CD:ee123ol CD:ee123op CD:ee123p CD:ee123p.1 CD:ee123p1 CD:ee123pn
> CD:ee123r.1 CD:ee123r1 CD:ee123s CD:ee123s.z CD:ee123sm CD:ee123sn
> CD:ee123sp CD:ee123ss CD:ee123sz)), product of:
>                                        100.0 = boost
>                                        9.950362E-4 = queryNorm
>                0.0 = coord(2/3)
>
>
> So I got the conf folder from the remote server location and replaced
> my local conf folder with this one to see if the indexes were formed
> differently but my local installation continues to work.I would expect
> to see the same behaviour as on the remote installation but it did not
> happen. (The only difference on the remote installation is that there
> are cores while my local installation has no cores).
> Anything else I could try?
> Thanks for your help.
>
> On 4/11/12, Erick Erickson  wrote:
>> Well, you're matching a different number of records, so I have to assume
>> your indexes are different on the two machines.
>>
>> Here is one case where doing an optimize might make sense, that'll purge
>> the data associated with any deleted records from the index which should
>> make comparisons better
>>
>> Additionally, you have to insure that your request handler is identical
>> on both, have you made any changes to solrconfig.xml?
>>
>> About the coord (2/3), I'm pretty clueless. But also insure that your
>> parsed query is identical on both, which is an additional check on
>> whether you've changed something on one server and not the
>> other.
>>
>> Best
>> Erick
>>
>> On Wed, Apr 11, 2012 at 8:19 AM, Kerwin  wrote:
>>> Hi All,
>>>
>>> I am firing the following Solr query against installations on two
>>> environments one on my local Windows machine and the other on Unix
>>> (Remote).
>>>
>>> RECORD_TYPE:info AND (NAME:ee123* OR CD:ee123^1000 OR CD:ee123*^100)
>>>
>>> There are no differences in the DataImportHandler configuration ,
>>> Schema and Solrconfig for both these installations.
>>> The correct expected result is given by the local installation of Solr
>>> which also gives scores as expected for the boosts.
>>>
>>> CORRECT/Expected:
>>> Debug query output for local installation:
>>>
>>> 10.822258 = (MATCH) sum of:
>>>        0.002170282 = (MATCH) weight(RECORD_TYPE:info in 35916), product
>>> of:
>>>                3.65739E-4 = queryWeight(RECORD_TYPE:info), product of:
>>>                        5.933964 = idf(docFreq=58891, maxDocs=8181811)
>>>                        6.1634855E-5 = queryNorm
>>>         

Re: Trouble handling Unit symbol

2012-04-13 Thread Rajani Maski
Fine. Thank you. I will look at it.


On Fri, Apr 13, 2012 at 5:21 PM, Erick Erickson wrote:

> Please review:
> http://wiki.apache.org/solr/UsingMailingLists
>
> Especially the bit about adding &debugQuery=on
> and showing the results. You're asking people
> to guess at solutions without providing much
> in the way of context.
>
> You might try looking at your index with Luke to
> see what's actually in your index, or perhaps
> TermsComponent
>
>
> Best
> Erick
>
> On Fri, Apr 13, 2012 at 2:29 AM, Rajani Maski 
> wrote:
> > Hi All,
> >
> >   I tried to index with UTF-8  encode but the issue is still not fixed.
> > Please see my inputs below.
> >
> > *Indexed XML:*
> > 
> > 
> >  
> >0.100
> >µ
> >  
> > 
> >
> > *Search Query - * BODY:µ
> >
> > numfound : 0 results obtained.
> >
> > *What can be the reason for this? How do i need to make search query so
> > that the above document is found.*
> >
> >
> > Thanks & Regards
> >
> > Regards
> > Rajani
> >
> >
> >
> > 2012/4/2 Rajani Maski 
> >
> >> Thank you for the reply.
> >>
> >>
> >>
> >> On Sat, Mar 31, 2012 at 3:38 AM, Chris Hostetter <
> hossman_luc...@fucit.org
> >> > wrote:
> >>
> >>>
> >>> : We have data having such symbols like :  ต
> >>> : Indexed data has  -Dose:"0 ตL"
> >>> : Now , when  it is searched as  - Dose:"0 ตL"
> >>>...
> >>> : Query Q value observed  : S257:"0 ยตL/injection"
> >>>
> >>> First off: your "when searched as" example does not match up to your
> >>> "Query Q" observed value (ie: field queries, extra "/injection" text at
> >>> the end) suggesting that you maybe cut/paste something you didn't mean
> to
> >>> -- so take the rest of this advice with a grain of salt.
> >>>
> >>> If i ignore your "when it is searched as" exampleand focus entirely on
> >>> what you say you've indexed the data as, and the Q value you are sing
> (in
> >>> what looks like the echoParams output) then the first thing that jumps
> out
> >>> at me is that it looks like your servlet container (or perhaps your web
> >>> browser if that's where you tested this) is not dealing with the
> unicode
> >>> correctly -- because allthough i see a "ต" in the first three lines i
> >>> quoted above (UTF8: 0xC2 0xB5) in your value observed i'm seeing it
> >>> preceeded by a "ย" (UTF8: 0xC3 0x82) ... suggesting that perhaps the
> "ต"
> >>> did not get URL encoded properly when the request was made to your
> servlet
> >>> container?
> >>>
> >>> In particular, you might want to take a look at...
> >>>
> >>>
> >>>
> https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
> >>> http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
> >>> The example/exampledocs/test_utf8.sh script included with solr
> >>>
> >>>
> >>>
> >>>
> >>> -Hoss
> >>
> >>
> >>
>


Re: Facets involving multiple fields

2012-04-13 Thread Erick Erickson
Nope. Information about your higher level use-case
would probably be a good thing, this is starting to
smell like an "XY" problem.

Best
Erick

On Fri, Apr 13, 2012 at 5:48 AM, Marc SCHNEIDER
 wrote:
> Hi,
>
> Thanks for your answer.
> Yes it works in this case when I know the facet name (Computer). What
> if I want to automatically compute all facets?
> facet.query=keyword:* short_title:* doesn't work, right?
>
> Marc.
>
> On Thu, Apr 12, 2012 at 2:08 PM, Erick Erickson  
> wrote:
>> facet.query=keywords:computer short_title:computer
>> seems like what you're asking for.
>>
>> On Thu, Apr 12, 2012 at 3:19 AM, Marc SCHNEIDER
>>  wrote:
>>> Hi,
>>>
>>> Thanks for your answer.
>>> Let's say I have to fields : 'keywords' and 'short_title'.
>>> For these fields I'd like to make a faceted search : if 'Computer' is
>>> stored in at least one of these fields for a document I'd like to get
>>> it added in my results.
>>> doc1 => keywords : 'Computer' / short_title : 'Computer'
>>> doc2 => keywords : 'Computer'
>>> doc3 => short_title : 'Computer'
>>>
>>> In this case I'd like to have : Computer (3)
>>>
>>> I don't see how to solve this with facet.query.
>>>
>>> Thanks,
>>> Marc.
>>>
>>> On Wed, Apr 11, 2012 at 5:13 PM, Erick Erickson  
>>> wrote:
 Have you considered facet.query? You can specify an arbitrary query
 to facet on which might do what you want. Otherwise, I'm not sure what
 you mean by "faceted search using two fields". How should these fields
 be combined into a single facet? What that means practically is not at
 all obvious from your problem statement.

 Best
 Erick

 On Tue, Apr 10, 2012 at 8:55 AM, Marc SCHNEIDER
  wrote:
> Hi,
>
> I'd like to make a faceted search using two fields. I want to have a
> single result and not a result by field (like when using
> facet.field=f1,facet.field=f2).
> I don't want to use a copy field either because I want it to be
> dynamic at search time.
> As far as I know this is not possible for Solr 3.x...
> But I saw a new parameter named "group.facet" for Solr4. Could that
> solve my problem? If yes could somebody give me an example?
>
> Thanks,
> Marc.


Solr data export to CSV File

2012-04-13 Thread Pavnesh
Hi Team,

 

A very-very thanks to you guy who had developed such a nice product. 

I have one query regarding solr that I have app 36 Million data in my solr
and I wants to export all the data to a csv file but I have found nothing on
the same  so please help me on this topic .

 

 

Regards

Pavnesh

 



Re: How to read SOLR cache statistics?

2012-04-13 Thread Erick Erickson
Well, the place to start is here:
*stats*:  lookups : 98
*hits *: 59
*hitratio *: 0.60
*inserts *: 41
*evictions *: 0
*size *: 41

the important bits are hitratio and evictions.
Caches only really start to "show their stuff"
when the hit ratio is quite high. That's
the percentage of requests that are satisfied
by entries already in the cache. You want
this number to be as high as possible, +0.90.

evictions are the number of entries that have been
removed from the cache. The pre-configured
number is usually 512, so when the 513th entry
is inserted in the cache, some are removed
to make room and tallied in the evictions
section.

Do note that some of the caches (documentCache
in particular) will rarely have a huge hit ratio due
to its nature, ditto with queryResultCache so you
can temporarily ignore those.

Best
Erick

On Fri, Apr 13, 2012 at 6:28 AM, Kashif Khan  wrote:
> Hi Li Li,
>
> I have been through that WIKI before but that does not explain what is
> *evictions*, *inserts*, *cumulative_inserts*, *cumulative_evictions*,
> *hitratio *and all. These terms are foreign to me. What does the following
> line mean?
>
> *item_ABC :
> {field=ABC,memSize=340592,tindexSize=1192,time=1360,phase1=1344,nTerms=7373,bigTerms=1,termInstances=11513,uses=4}
> *
>
> I want that kind of explanation. I have read the wiki and the comments in
> the solrconfig.xml file about all these things but does say how to read the
> stats which is very *important!!!*.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-read-SOLR-cache-statistics-tp3907294p3907633.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: performance impact using string or float when querying ranges

2012-04-13 Thread Erick Erickson
Well, I guess my first question is whether using stirngs
is "fast enough", in which case there's little reason to
make your life more complex.

But yes, range queries will be significantly faster with
any of the Trie types than with strings. Trie types are
all numeric types.


Best
Erick

On Fri, Apr 13, 2012 at 3:49 AM, crive  wrote:
> Hi All,
> is there a big difference in terms of performances when querying a range
> like [50.0 TO *] on a string field compared to a float field?
>
> At the moment I am using a dynamic field of type string to map some values
> coming from our database and their type can vary depending on the context
> (float/integer/string); it easier to use a dynamic field other than having
> to create a bespoke field for each type of value.
>
> Marco


Re: Issues with language based indexing

2012-04-13 Thread Erick Erickson
Please review:
http://wiki.apache.org/solr/UsingMailingLists

there's so little information to go on here that I
really can't say anything that isn't a guess.

At a minimum we need the raw input, the
fieldType definitions from your schema,
the results of adding &debugQuery=on
to your URL

Best
Erick

On Fri, Apr 13, 2012 at 6:04 AM, JGar  wrote:
> Hello,
>
> I am new to Solr. it is resulting some docs in my search for "Acciones y
> Valores" string. When i go and search for the same word in the given doc
> manually, i could not find those word. Pls help on what basis the doc is
> found in the search .
>
> Thanks
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Issues-with-language-based-indexing-tp3907601p3907601.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr data export to CSV File

2012-04-13 Thread Erick Erickson
Does this help?

http://wiki.apache.org/solr/CSVResponseWriter

Best
Erick

On Fri, Apr 13, 2012 at 7:59 AM, Pavnesh
 wrote:
> Hi Team,
>
>
>
> A very-very thanks to you guy who had developed such a nice product.
>
> I have one query regarding solr that I have app 36 Million data in my solr
> and I wants to export all the data to a csv file but I have found nothing on
> the same  so please help me on this topic .
>
>
>
>
>
> Regards
>
> Pavnesh
>
>
>


RE: Realtime /get versus SearchHandler

2012-04-13 Thread Darren Govoni

Yes

--- Original Message ---
On 4/13/2012  06:25 AM Benson Margulies wrote:A discussion over on the dev 
list led me to expect that the by-if
field retrievals in a SolrCloud query would come through the get
handler. In fact, I've seen them turn up in my search component in the
search handler that is configured with my custom QT. (I have a
'prepare' method that sets ShardParams.QT to my QT to get my
processing involved in the first of the two queries.) Did I overthink
this?




RE: Solr data export to CSV File

2012-04-13 Thread Ben McCarthy
A combination of the CSV response writer and SOLRJ to page through all of the 
results sending it to something like apache commons fileutils:

  FileUtils.writeStringToFile(new File(output.csv), outputLine 
("line.separator"), true);

Would be quiet quick to knock up in Java.

Thanks
Ben

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 13 April 2012 13:28
To: solr-user@lucene.apache.org
Subject: Re: Solr data export to CSV File

Does this help?

http://wiki.apache.org/solr/CSVResponseWriter

Best
Erick

On Fri, Apr 13, 2012 at 7:59 AM, Pavnesh  
wrote:
> Hi Team,
>
>
>
> A very-very thanks to you guy who had developed such a nice product.
>
> I have one query regarding solr that I have app 36 Million data in my
> solr and I wants to export all the data to a csv file but I have found
> nothing on the same  so please help me on this topic .
>
>
>
>
>
> Regards
>
> Pavnesh
>
>
>




This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: 
Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, 
Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and 
any files transmitted with it are confidential and may be legally privileged, 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error please notify the sender. 
This email message has been swept for the presence of computer viruses. 



Re: searching across multiple fields using edismax - am i setting this up right?

2012-04-13 Thread geeky2
thank you for the response.

it seems to be working well ;)

1) i tried your suggestion about removing the qt parameter - 

*somecore/partItemNoSearch*&q=dishwasher&debugQuery=on&rows=10

but this results in a 404 error message - is there some configuration i am
missing to support this short-hand syntax for specifying the requestHandler
in the url ?



2) ok - good suggestion.



3) yes it looks like it IS searching across all three (3) fields.

i noticed that for the itemNo field, it reduced the search string from
dishwasher to dishwash - it this because of stemming on the field type, used
for the itemNo field?

dishwasherdishwasher+DisjunctionMaxQuery((brand:dishwasher^0.5 |
*itemNo:dishwash* | productType:dishwasher^0.8))+(brand:dishwasher^0.5 | itemNo:dishwash |
productType:dishwasher^0.8)





--
View this message in context: 
http://lucene.472066.n3.nabble.com/searching-across-multiple-fields-using-edismax-am-i-setting-this-up-right-tp3906334p3907875.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: searching across multiple fields using edismax - am i setting this up right?

2012-04-13 Thread Erick Erickson
as to 1) you have to define your request handler with
a leading /, as in name= "/partItemNoSearch". Don't
forget to restart your server.

3) Of course. The input terms MUST be run through
the associated analysis chain to have any hope of
matching correctly.

Best
Erick

On Fri, Apr 13, 2012 at 8:36 AM, geeky2  wrote:
> thank you for the response.
>
> it seems to be working well ;)
>
> 1) i tried your suggestion about removing the qt parameter -
>
> *somecore/partItemNoSearch*&q=dishwasher&debugQuery=on&rows=10
>
> but this results in a 404 error message - is there some configuration i am
> missing to support this short-hand syntax for specifying the requestHandler
> in the url ?
>
>
>
> 2) ok - good suggestion.
>
>
>
> 3) yes it looks like it IS searching across all three (3) fields.
>
> i noticed that for the itemNo field, it reduced the search string from
> dishwasher to dishwash - it this because of stemming on the field type, used
> for the itemNo field?
>
> dishwasher name="querystring">dishwasher name="parsedquery">+DisjunctionMaxQuery((brand:dishwasher^0.5 |
> *itemNo:dishwash* | productType:dishwasher^0.8)) name="parsedquery_toString">+(brand:dishwasher^0.5 | itemNo:dishwash |
> productType:dishwasher^0.8)
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/searching-across-multiple-fields-using-edismax-am-i-setting-this-up-right-tp3906334p3907875.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Errors during indexing

2012-04-13 Thread Ben McCarthy
Hello

We have just switched to Solr4 as we needed the ability to return geodist() 
along with our results.

I use a simple multithreaded java app and solr to ingest the data.  We keep 
seeing the following:

13-Apr-2012 15:50:10 org.apache.solr.common.SolrException log
SEVERE: null:org.apache.solr.common.SolrException: Error handling 'status' 
action
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleStatusAction(CoreAdminHandler.java:546)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:156)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at 
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:359)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:175)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: /usr/solr4/data/index/_2jb.fnm (No 
such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccaessFile.java:216)
at 
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:219)
at 
org.apache.lucene.codecs.lucene40.Lucene40FieldInfosReader.read(Lucene40FieldInfosReader.java:47)
at 
org.apache.lucene.index.SegmentInfo.loadFieldInfos(SegmentInfo.java:201)
at 
org.apache.lucene.index.SegmentInfo.getFieldInfos(SegmentInfo.java:227)
at org.apache.lucene.index.SegmentInfo.files(SegmentInfo.java:415)
at org.apache.lucene.index.SegmentInfos.files(SegmentInfos.java:756)
at 
org.apache.lucene.index.StandardDirectoryReader$ReaderCommit.(StandardDirectoryReader.java:369)
at 
org.apache.lucene.index.StandardDirectoryReader.getIndexCommit(StandardDirectoryReader.java:354)
at 
org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:558)
at 
org.apache.solr.handler.admin.CoreAdminHandler.getCoreStatus(CoreAdminHandler.java:816)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleStatusAction(CoreAdminHandler.java:537)
... 16 more


This seems to happen when were using the new admin tool.  Im checking on the 
autocommit handler.

Has anyone seen anything similar?

Thanks
Ben




This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: 
Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, 
Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and 
any files transmitted with it are confidential and may be legally privileged, 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error please notify the sender. 
This email message has been swept for the presence of computer viruses. 



RE: solr 3.5 taking long to index

2012-04-13 Thread Rohit
Hi Shawn,

Thanks for the information, let me give this a try, since this is a live box I 
will try it during the weekend and update you.

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: 13 April 2012 11:01
To: solr-user@lucene.apache.org
Subject: Re: solr 3.5 taking long to index

On 4/12/2012 8:42 PM, Rohit wrote:
> The machine has a total ram of around 46GB. My Biggest concern is Solr index 
> time gradually increasing and then the commit stops because of timeouts, out 
> commit rate is very high, but I am not able to find the root cause of the 
> issue.

For good performance, Solr relies on the OS having enough free RAM to keep 
critical portions of the index in the disk cache.  Some numbers that I have 
collected from your information so far are listed below.  
Please let me know if I've got any of this wrong:

46GB total RAM
36GB RAM allocated to Solr
300GB total index size

This leaves only 10GB of RAM free to cache 300GB of index, assuming that this 
server is dedicated to Solr.  The critical portions of your index are very 
likely considerably larger than 10GB, which causes constant reading from the 
disk for queries and updates.  With a high commit rate and a relatively low 
mergeFactor of 10, your index will be doing a lot of merging during updates, 
and some of those merges are likely to be quite large, further complicating the 
I/O situation.

Another thing that can lead to increasing index update times is cache warming, 
also greatly affected by high I/O levels.  If you visit the 
/solr/corename/admin/stats.jsp#cache URL, you can see the warmupTime for each 
cache in milliseconds.

Adding more memory to the server would probably help things.  You'll want to 
carefully check all the server and Solr statistics you can to make sure that 
memory is the root of problem, before you actually spend the money.  At the 
server level, look for things like a high iowait CPU percentage.  For Solr, you 
can turn the logging level up to INFO in the admin interface as well as turn on 
the infostream in solrconfig.xml for extensive debugging.

I hope this is helpful.  If not, I can try to come up with more specific things 
you can look at.

Thanks,
Shawn




Solr is not extracting the CDATA part of xml

2012-04-13 Thread srini
I am trying to use method that is suggested in solr forum to remove CDATA
part of xml. but it is not working. result show whole xml content instead of
CDATA part.

schema.xml

  


  
  


mappings.txt
"" => ""

my xml content


 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908317.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread srini
not sure why CDATA part did not get interpreted. this is how xml content
looks like. I added quotes just to present the exact content xml content.

""

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: performance impact using string or float when querying ranges

2012-04-13 Thread Yonik Seeley
On Fri, Apr 13, 2012 at 8:11 AM, Erick Erickson  wrote:
> Well, I guess my first question is whether using stirngs
> is "fast enough", in which case there's little reason to
> make your life more complex.
>
> But yes, range queries will be significantly faster with
> any of the Trie types than with strings.

To elaborate on this point a bit... range queries on strings will be
the same speed as a numeric field with precisionStep=0.
You need a precisionStep > 0 (so the number will be indexed in
multiple parts) to speed up range queries on numeric fields.  (See
"int" vs "tint" in the solr schema).

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10




 Trie types are
> all numeric types.
>
>
> Best
> Erick
>
> On Fri, Apr 13, 2012 at 3:49 AM, crive  wrote:
>> Hi All,
>> is there a big difference in terms of performances when querying a range
>> like [50.0 TO *] on a string field compared to a float field?
>>
>> At the moment I am using a dynamic field of type string to map some values
>> coming from our database and their type can vary depending on the context
>> (float/integer/string); it easier to use a dynamic field other than having
>> to create a bespoke field for each type of value.
>>
>> Marco


mergePolicy element format change in 3.6 vs 3.5?

2012-04-13 Thread Peter Wolanin
Trying to maintain the Drupal integration module across multiple versions
of 3.x, we've gotten a bug report suggesting that Solr 3.6 needs this
change to solrconfig:

-
 org.apache.lucene.index.LogByteSizeMergePolicy
+


I don't see this mentioned in the release notes - is the second format
useable with 3.5, 3.4, etc?

-- 
Peter M. Wolanin, Ph.D.  : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";


RE: mergePolicy element format change in 3.6 vs 3.5?

2012-04-13 Thread Michael Ryan
It looks like the first format was removed in 3.6 as part of 
https://issues.apache.org/jira/browse/SOLR-1052. The second format works in all 
3.x versions.

-Michael

-Original Message-
From: Peter Wolanin [mailto:peter.wola...@acquia.com] 
Sent: Friday, April 13, 2012 12:32 PM
To: solr-user@lucene.apache.org
Subject: mergePolicy element format change in 3.6 vs 3.5?

Trying to maintain the Drupal integration module across multiple versions
of 3.x, we've gotten a bug report suggesting that Solr 3.6 needs this
change to solrconfig:

-
 org.apache.lucene.index.LogByteSizeMergePolicy
+


I don't see this mentioned in the release notes - is the second format
useable with 3.5, 3.4, etc?


Re: mergePolicy element format change in 3.6 vs 3.5?

2012-04-13 Thread Peter Wolanin
Ok, thanks for the info.  As long as the second one works, we can just use
that.

I just verified that it works for 3.5 at least.

-Peter

On Fri, Apr 13, 2012 at 1:12 PM, Michael Ryan  wrote:

> It looks like the first format was removed in 3.6 as part of
> https://issues.apache.org/jira/browse/SOLR-1052. The second format works
> in all 3.x versions.
>
> -Michael
>
> -Original Message-
> From: Peter Wolanin [mailto:peter.wola...@acquia.com]
> Sent: Friday, April 13, 2012 12:32 PM
> To: solr-user@lucene.apache.org
> Subject: mergePolicy element format change in 3.6 vs 3.5?
>
> Trying to maintain the Drupal integration module across multiple versions
> of 3.x, we've gotten a bug report suggesting that Solr 3.6 needs this
> change to solrconfig:
>
> -
>  org.apache.lucene.index.LogByteSizeMergePolicy
> +
>
>
> I don't see this mentioned in the release notes - is the second format
> useable with 3.5, 3.4, etc?
>



-- 
Peter M. Wolanin, Ph.D.  : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread Erick Erickson
Solr does not index arbitrary XML content. There is and XML
form of a solr document that can be sent to Solr, but it is
a specific form of XML.

An example of the XML you're trying to index and what you mean
by "not working" would be helpful.

Best
Erick

On Fri, Apr 13, 2012 at 11:50 AM, srini  wrote:
> not sure why CDATA part did not get interpreted. this is how xml content
> looks like. I added quotes just to present the exact content xml content.
>
> ""
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread srini
Erick,

Thanks for your reply. when you say Solr does not index arbitery xml
document, then below is the way my xml document looks like which is sitting
in oracle. Could you suggest the best of indexing it ? which method should I
follow? Should I use XPathEntityProcessor?

 
http://www.w3.org/2001/XMLSchema-instance";
xmlns="someurl" xmlns:csp="someurl.xsd" xsi:schemaLocation="somelocation
jar: id="002" message-type="create">

 
  100  
  115
  
 
 

Thanks in Advance
Erick Erickson wrote
> 
> Solr does not index arbitrary XML content. There is and XML
> form of a solr document that can be sent to Solr, but it is
> a specific form of XML.
> 
> An example of the XML you're trying to index and what you mean
> by "not working" would be helpful.
> 
> Best
> Erick
> 
> On Fri, Apr 13, 2012 at 11:50 AM, srini  wrote:
>> not sure why CDATA part did not get interpreted. this is how xml content
>> looks like. I added quotes just to present the exact content xml content.
>>
>> ""
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 

Erick Erickson wrote
> 
> Solr does not index arbitrary XML content. There is and XML
> form of a solr document that can be sent to Solr, but it is
> a specific form of XML.
> 
> An example of the XML you're trying to index and what you mean
> by "not working" would be helpful.
> 
> Best
> Erick
> 
> On Fri, Apr 13, 2012 at 11:50 AM, srini  wrote:
>> not sure why CDATA part did not get interpreted. this is how xml content
>> looks like. I added quotes just to present the exact content xml content.
>>
>> ""
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 

Erick Erickson wrote
> 
> Solr does not index arbitrary XML content. There is and XML
> form of a solr document that can be sent to Solr, but it is
> a specific form of XML.
> 
> An example of the XML you're trying to index and what you mean
> by "not working" would be helpful.
> 
> Best
> Erick
> 
> On Fri, Apr 13, 2012 at 11:50 AM, srini  wrote:
>> not sure why CDATA part did not get interpreted. this is how xml content
>> looks like. I added quotes just to present the exact content xml content.
>>
>> ""
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908791.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread Erick Erickson
Right, that will not work at all for direct transmission to
Solr.

You could write a Java program that parses this and sends
it to Solr via SolrJ.

Personally I haven't connected a database to Solr with
XPathEntityProcessor in the mix, but I believe I've seen
messages go by with this configuration. You might want
to search the mail archive...

Best
Erick

On Fri, Apr 13, 2012 at 3:13 PM, srini  wrote:
> Erick,
>
> Thanks for your reply. when you say Solr does not index arbitery xml
> document, then below is the way my xml document looks like which is sitting
> in oracle. Could you suggest the best of indexing it ? which method should I
> follow? Should I use XPathEntityProcessor?
>
> 
> http://www.w3.org/2001/XMLSchema-instance";
> xmlns="someurl" xmlns:csp="someurl.xsd" xsi:schemaLocation="somelocation
> jar: id="002" message-type="create">
> 
>     
>      100
>      115
>      
>
>  
>
> Thanks in Advance
> Erick Erickson wrote
>>
>> Solr does not index arbitrary XML content. There is and XML
>> form of a solr document that can be sent to Solr, but it is
>> a specific form of XML.
>>
>> An example of the XML you're trying to index and what you mean
>> by "not working" would be helpful.
>>
>> Best
>> Erick
>>
>> On Fri, Apr 13, 2012 at 11:50 AM, srini  wrote:
>>> not sure why CDATA part did not get interpreted. this is how xml content
>>> looks like. I added quotes just to present the exact content xml content.
>>>
>>> ""
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
> Erick Erickson wrote
>>
>> Solr does not index arbitrary XML content. There is and XML
>> form of a solr document that can be sent to Solr, but it is
>> a specific form of XML.
>>
>> An example of the XML you're trying to index and what you mean
>> by "not working" would be helpful.
>>
>> Best
>> Erick
>>
>> On Fri, Apr 13, 2012 at 11:50 AM, srini  wrote:
>>> not sure why CDATA part did not get interpreted. this is how xml content
>>> looks like. I added quotes just to present the exact content xml content.
>>>
>>> ""
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
> Erick Erickson wrote
>>
>> Solr does not index arbitrary XML content. There is and XML
>> form of a solr document that can be sent to Solr, but it is
>> a specific form of XML.
>>
>> An example of the XML you're trying to index and what you mean
>> by "not working" would be helpful.
>>
>> Best
>> Erick
>>
>> On Fri, Apr 13, 2012 at 11:50 AM, srini  wrote:
>>> not sure why CDATA part did not get interpreted. this is how xml content
>>> looks like. I added quotes just to present the exact content xml content.
>>>
>>> ""
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908791.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread Alexander Aristov
Hi

This is not solr format. You must re-format your XML into solr XML. you may
find examples on solr wiki or in solr examples dir.

Best Regards
Alexander Aristov


On 13 April 2012 23:13, srini  wrote:

> Erick,
>
> Thanks for your reply. when you say Solr does not index arbitery xml
> document, then below is the way my xml document looks like which is sitting
> in oracle. Could you suggest the best of indexing it ? which method should
> I
> follow? Should I use XPathEntityProcessor?
>
> 
> http://www.w3.org/2001/XMLSchema-instance";
> xmlns="someurl" xmlns:csp="someurl.xsd" xsi:schemaLocation="somelocation
> jar: id="002" message-type="create">
> 
> 
>  100
>  115
>  
>
>  
>
> Thanks in Advance
> Erick Erickson wrote
> >
> > Solr does not index arbitrary XML content. There is and XML
> > form of a solr document that can be sent to Solr, but it is
> > a specific form of XML.
> >
> > An example of the XML you're trying to index and what you mean
> > by "not working" would be helpful.
> >
> > Best
> > Erick
> >
> > On Fri, Apr 13, 2012 at 11:50 AM, srini  wrote:
> >> not sure why CDATA part did not get interpreted. this is how xml content
> >> looks like. I added quotes just to present the exact content xml
> content.
> >>
> >> ""
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
> Erick Erickson wrote
> >
> > Solr does not index arbitrary XML content. There is and XML
> > form of a solr document that can be sent to Solr, but it is
> > a specific form of XML.
> >
> > An example of the XML you're trying to index and what you mean
> > by "not working" would be helpful.
> >
> > Best
> > Erick
> >
> > On Fri, Apr 13, 2012 at 11:50 AM, srini  wrote:
> >> not sure why CDATA part did not get interpreted. this is how xml content
> >> looks like. I added quotes just to present the exact content xml
> content.
> >>
> >> ""
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
> Erick Erickson wrote
> >
> > Solr does not index arbitrary XML content. There is and XML
> > form of a solr document that can be sent to Solr, but it is
> > a specific form of XML.
> >
> > An example of the XML you're trying to index and what you mean
> > by "not working" would be helpful.
> >
> > Best
> > Erick
> >
> > On Fri, Apr 13, 2012 at 11:50 AM, srini  wrote:
> >> not sure why CDATA part did not get interpreted. this is how xml content
> >> looks like. I added quotes just to present the exact content xml
> content.
> >>
> >> ""
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908791.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread srini
Thanks Again for quick reply. Little curious about the procedure you
suggested. I thought of using same procedure as you suggested. Like writing
a java program to fetch xml record from db and parse the content hand it to
Solr for indexing.

but what if my database content get changed? should I re run my java program
to fetch xml and add to solr for re indexing?

the content of xml format does not match to solr example xml formats. Any
suggestions here?

when I import xml records from oracle and add it to solr and search for a
word, solr is displaying whole xml doc which has that word. what is wrong
with this procedure( I do see my search word in the content of xml, only bad
part is it is displaying whole doc instead CDATA part of it). Please suggest
if there is better of doing this task other than SolrJ

Thanks in Advance
Srini





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908825.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boosting StandardQuery scores with a "subquery"?

2012-04-13 Thread Chris Hostetter

: I'm having some trouble wrapping my head around boosting StandardQueries.
: It looks like the function: query(subquery, default)
:  is what I want, but the
: examples seem to focus on just returning a score (e.g. product of popularity
: and the score of the subquery). I assume my difficulty stems from the fact
: that I'd like to retrieve highlighting from one query, but impact score and
: 'relevance' by a different (sub)query.

if your primary concern is just having highlighting on some words, while 
lots of otherwords contribute to the score, then you should take a look at 
the hl.q param introduced in Solr 3.5...

http://wiki.apache.org/solr/HighlightingParameters#hl.q

That lets you completley seperate the two if you'd like.

you cna even use local param syntax to reduce duplication...

  q={!v=$qq}
  qq=content:(roi "return on investment" "return investment"~5)
  hl.q={!v=$qq}
  fq=extension:(pdf doc)
  boost=keywords:(financial investment profit loss) 
title:(financial investment profit loss) 
url:(investment investor relations phoenix)

...should work i think.

-Hoss


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-13 Thread Jan Høydahl
Hi,

For a web crawl+search like this you will probably need a lot of additional Big 
Data crunching, so a Hadoop based solution is wise.

In addition to those products mentioned we also now have Amazon's own 
CloudSearch http://aws.amazon.com/cloudsearch/ It's new, is not as cool as Solr 
(not even Lucene based), but gives you the elasticity you request I guess. If 
you run your Hadoop cluster in EC2 already it would be quite efficient to 
batch-load the crawled and processed data into a "SearchDomain" in the same 
availability zone. However, both cost and features may prohibit this as a 
realistic choice for you.

It would be cool to explore a Hadoop/HDFS + SolrCloud integration. SolrCloud 
would not build the indexes, but be pulling pre-built indexes from HDFS down to 
local disk every time it's told to. Or perhaps the SolrCloud nodes could be 
part of the hadoop cluster, being responsible for the Reduce part building the 
indexes?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 13. apr. 2012, at 04:23, Otis Gospodnetic wrote:

> Hello Ali,
> 
>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
> 
>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>> seconds.
> 
> 
> That's fine.  Whether it's doable with any tech will depend on how much 
> hardware you give it, among other things.
> 
>> Needless to mention, the search index needs to scale to 5Billion pages. It
>> is also possible that I might need to store multiple indexes -- one for
>> crawled content, and one for ancillary data that is also very large. Each
>> of these indices would likely require a logically distributed and
>> replicated index.
> 
> 
> Yup, OK.
> 
>> However, I would like for such a system to be homogenous with the Hadoop
>> infrastructure that is already installed on the cluster (for the crawl). In
>> other words, I would much prefer if the replication and distribution of the
>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>> using another scalability framework (such as SolrCloud). In addition, it
>> would be ideal if this environment was flexible enough to be dynamically
>> scaled based on the size requirements of the index and the search traffic
>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>> enough to automatically provision additional processing power into the
>> cluster without requiring server re-starts).
> 
> 
> There is no such thing just yet.
> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
> automatically index HBase content, but that was either not completed or not 
> committed into HBase.
> 
>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>> mature enough and would be the right architectural choice to go along with
>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>> above.
> 
> 
> Here is a summary on all of them:
> * Search on HBase - I assume you are referring to the same thing I mentioned 
> above.  Not ready.
> * Solandra - uses Cassandra+Solr, plus DataStax now has a different 
> (commercial) offering that combines search and Cassandra.  Looks good.
> * Lily - data stored in HBase cluster gets indexed to a separate Solr 
> instance(s)  on the side.  Not really integrated the way you want it to be.
> * ElasticSearch - solid at this point, the most dynamic solution today, can 
> scale well (we are working on a mny-B documents index and hundreds of 
> nodes with ElasticSearch right now), etc.  But again, not integrated with 
> Hadoop the way you want it.
> * IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
> sure about its future considering LinkedIn uses Zoie and Sensei already.
> * And there is SolrCloud, which is coming soon and will be solid, but is 
> again not integrated.
> 
> If I were you and I had to pick today - I'd pick ElasticSearch if I were 
> completely open.  If I had Solr bias I'd give SolrCloud a try first.
> 
>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>> estimate my needing with this setup, for regular web-data (HTML text) at
>> this scale?
> 
> I don't know off the topic of my head, but I'm guessing several hundred for 
> serving search requests.
> 
> HTH,
> 
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
> 
> Scalable Performance Monitoring - http://sematext.com/spm/index.html
> 
> 
>> Any architectural guidance would be greatly appreciated. The more details
>> provided, the wider my grin :).
>> 
>> Many many thanks in advance.
>> 
>> Thanks,
>> Safdar
>> 



Re: Post Sorting hook before the doc slicing.

2012-04-13 Thread Chris Hostetter

: Basically, I need to find item X in the result set and return say N items
: before and N items after.
: 
: < - N items -- Item X --- N items >
...
: So I might be wrong, but it looks like the only way would be to create a
: custom SolrIndexSearcher which will find the offset and create the related
: docslice. That slicing part doesn't seem to be well factored that I can
: see, so it seems to imply copy/pasting a significant chunk off the code. Am
: I looking at the wrong place ?

trying to do this as a hook into the SolrIndexSearcher would definitley be 
complicated ... laregley because of how matches are collected.

the most straight forward way i can think of to get the data you want is 
to consider what you are sorting on, and use that as a range filter, ie...

1) do your search, and filter on id:X
2) look at the values X has in the fields you are sorting on
3) search again, this time filter on those fields, asking for the first N 
docs with values greater then whatever id:X has
4) search again, this time reverse your sort, and reverse your filters 
(docs with values less hten whatever id:X has) and get the first N docs.


...even if your sort is "score" you can use the frange parser to filter 
(not usually recommended for score, but possible)



-Hoss


Re: Can I discover what part of a score is attributable to a subquery?

2012-04-13 Thread John Chee
On Fri, Apr 13, 2012 at 2:40 PM, Benson Margulies  wrote:
> Given a query including a subquery, is there any way for me to learn
> that subquery's contribution to the overall document score?
>
> I can provide 'why on earth would anyone ...' if someone wants to know.

Have you tried debugQuery=true?
http://wiki.apache.org/solr/CommonQueryParameters#debugQuery The
'explain' field of the result explains the scoring of each document.


Re: two structures in solr

2012-04-13 Thread Chris Hostetter

: I need to store *two big structures* in SOLR: projects and contractors.
: Contractors will search for available projects and project owners will
: search for contractors who would do it for them.

http://wiki.apache.org/solr/MultipleIndexes

: that *I want to have two structures*. I guess running two parallel solr
: instances is not the idea. I took a look at

there's nothing wrong with it, the real question is wether you ever need 
to do things with both sets of documents at once.

if contractors only ever search for projects, and project owners only ever 
serach for contractors, and no one ever searches for a mix of projects and 
contractors at the same time, then i would just suggest using multiple 
SolrCores...

http://wiki.apache.org/solr/MultipleIndexes#MultiCore
http://wiki.apache.org/solr/CoreAdmin


-Hoss


Re: term frequency outweighs exact phrase match

2012-04-13 Thread alxsss
Hello Hoss,

Here are the explain tags for two doc


0.021646015 = (MATCH) sum of:
  0.021646015 = (MATCH) sum of:
0.02141003 = (MATCH) max plus 0.01 times others of:
  2.84194E-4 = (MATCH) weight(content:apache^0.5 in 3578), product of:
0.0029881175 = queryWeight(content:apache^0.5), product of:
  0.5 = boost
  4.3554416 = idf(docFreq=126092, maxDocs=3613605)
  0.0013721307 = queryNorm
0.09510804 = (MATCH) fieldWeight(content:apache in 3578), product of:
  2.236068 = tf(termFreq(content:apache)=5)
  4.3554416 = idf(docFreq=126092, maxDocs=3613605)
  0.009765625 = fieldNorm(field=content, doc=3578)
  0.021407187 = (MATCH) weight(title:apache^1.2 in 3578), product of:
0.01371095 = queryWeight(title:apache^1.2), product of:
  1.2 = boost
  8.327043 = idf(docFreq=2375, maxDocs=3613605)
  0.0013721307 = queryNorm
1.5613205 = (MATCH) fieldWeight(title:apache in 3578), product of:
  1.0 = tf(termFreq(title:apache)=1)
  8.327043 = idf(docFreq=2375, maxDocs=3613605)
  0.1875 = fieldNorm(field=title, doc=3578)
2.359865E-4 = (MATCH) max plus 0.01 times others of:
  2.359865E-4 = (MATCH) weight(content:solr^0.5 in 3578), product of:
0.004071705 = queryWeight(content:solr^0.5), product of:
  0.5 = boost
  5.9348645 = idf(docFreq=25986, maxDocs=3613605)
  0.0013721307 = queryNorm
0.05795766 = (MATCH) fieldWeight(content:solr in 3578), product of:
  1.0 = tf(termFreq(content:solr)=1)
  5.9348645 = idf(docFreq=25986, maxDocs=3613605)
  0.009765625 = fieldNorm(field=content, doc=3578)

0.021465056 = (MATCH) sum of:
  1.8154096E-4 = (MATCH) sum of:
6.354771E-5 = (MATCH) max plus 0.01 times others of:
  6.354771E-5 = (MATCH) weight(content:apache^0.5 in 638040), product of:
0.0029881175 = queryWeight(content:apache^0.5), product of:
  0.5 = boost
  4.3554416 = idf(docFreq=126092, maxDocs=3613605)
  0.0013721307 = queryNorm
0.021266805 = (MATCH) fieldWeight(content:apache in 638040), product of:
  1.0 = tf(termFreq(content:apache)=1)
  4.3554416 = idf(docFreq=126092, maxDocs=3613605)
  0.0048828125 = fieldNorm(field=content, doc=638040)
1.1799325E-4 = (MATCH) max plus 0.01 times others of:
  1.1799325E-4 = (MATCH) weight(content:solr^0.5 in 638040), product of:
0.004071705 = queryWeight(content:solr^0.5), product of:
  0.5 = boost
  5.9348645 = idf(docFreq=25986, maxDocs=3613605)
  0.0013721307 = queryNorm
0.02897883 = (MATCH) fieldWeight(content:solr in 638040), product of:
  1.0 = tf(termFreq(content:solr)=1)
  5.9348645 = idf(docFreq=25986, maxDocs=3613605)
  0.0048828125 = fieldNorm(field=content, doc=638040)
  0.021283515 = (MATCH) weight(content:"apache solr"~1^30.0 in 638040), product 
of:
0.42358932 = queryWeight(content:"apache solr"~1^30.0), product of:
  30.0 = boost
  10.290306 = idf(content: apache=126092 solr=25986)
  0.0013721307 = queryNorm
0.050245635 = fieldWeight(content:"apache solr" in 638040), product of:
  1.0 = tf(phraseFreq=1.0)
  10.290306 = idf(content: apache=126092 solr=25986)
  0.0048828125 = fieldNorm(field=content, doc=638040)


 

 

 Although the second doc has exact match it is placed after the first one which 
does not have exact match.

I use the following request handler



edismax
explicit
0.01
host^30  content^0.5 title^1.2 anchor^1.2
content^30
url,id, site ,title
2<-1 5<-2 6<90%
1
true
*:*
content
0
165
title
0
url
regex
true
true
5
true
site
true


 spellcheck




and the query is as follows 

http://localhost:8983/solr/select/?q=apache 
solr&version=2.2&start=0&rows=10&indent=on&qt=search&debugQuery=true

Thanks.
Alex.


-Original Message-
From: Chris Hostetter 
To: solr-user 
Sent: Thu, Apr 12, 2012 7:43 pm
Subject: Re: term frequency outweighs exact phrase match



: I use solr 3.5 with edismax. I have the following issue with phrase 
: search. For example if I have three documents with content like
: 
: 1.apache apache
: 2. solr solr
: 3.apache solr
: 
: then search for apache solr displays documents in the order 1,.2,3 
: instead of 3, 2, 1 because term frequency in the first and second 
: documents is higher than in the third document. We want results be 
: displayed in the order as 3,2,1 since the third document has exact 
: match.

you need to give us a lot more info, like what other data is in the 
various fields for those documents, exactly what your query URL looks 
like, and what debugQuery=true gives you back in terms of score 
explanations ofr each document, because if that sample content is the only 
thing you've got indexed (even if it's in multiple fields), then documents 
#1 and #2 shouldn't even match your query using the mm you've specified...

: 2<-1 5<-2 6<90%

...because 

Re: Can I discover what part of a score is attributable to a subquery?

2012-04-13 Thread Benson Margulies
On Fri, Apr 13, 2012 at 6:43 PM, John Chee  wrote:
> On Fri, Apr 13, 2012 at 2:40 PM, Benson Margulies  
> wrote:
>> Given a query including a subquery, is there any way for me to learn
>> that subquery's contribution to the overall document score?

I need this number to be available in a SearchComponent that runs
after QueryComponent.


>>
>> I can provide 'why on earth would anyone ...' if someone wants to know.
>
> Have you tried debugQuery=true?
> http://wiki.apache.org/solr/CommonQueryParameters#debugQuery The
> 'explain' field of the result explains the scoring of each document.


Re: Can I discover what part of a score is attributable to a subquery?

2012-04-13 Thread Chris Hostetter

: Given a query including a subquery, is there any way for me to learn
: that subquery's contribution to the overall document score?

You have to just execute the subquery itself ... doc collection 
and score calculation doesn't keep track the subscores.

you could do this using functions in the "fl" but since you mentioned 
wanting this in SearchCOmponent just pass the "subquery" to 
SolrIndexSeracher using a DocSet filter of the current page (ie: make your 
own DocSet based on the current DocList)


-Hoss


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread Lance Norskog
This all comes from a database? Here is what you want.

The DataImportHandler includes a toolkit for doing full and
incremental loading from databases.

Read this first:
http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DIHQuickStart

Then these:
http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DataImportHandlerFaq
http://lucidworks.lucidimagination.com/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler

After you try the procedure in QuickStart and read the other two, if
you still have questions please ask.

Cheers!

On Fri, Apr 13, 2012 at 12:34 PM, srini  wrote:
> Thanks Again for quick reply. Little curious about the procedure you
> suggested. I thought of using same procedure as you suggested. Like writing
> a java program to fetch xml record from db and parse the content hand it to
> Solr for indexing.
>
> but what if my database content get changed? should I re run my java program
> to fetch xml and add to solr for re indexing?
>
> the content of xml format does not match to solr example xml formats. Any
> suggestions here?
>
> when I import xml records from oracle and add it to solr and search for a
> word, solr is displaying whole xml doc which has that word. what is wrong
> with this procedure( I do see my search word in the content of xml, only bad
> part is it is displaying whole doc instead CDATA part of it). Please suggest
> if there is better of doing this task other than SolrJ
>
> Thanks in Advance
> Srini
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908825.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com


Re: Can I discover what part of a score is attributable to a subquery?

2012-04-13 Thread Benson Margulies
On Fri, Apr 13, 2012 at 7:07 PM, Chris Hostetter
 wrote:
>
> : Given a query including a subquery, is there any way for me to learn
> : that subquery's contribution to the overall document score?
>
> You have to just execute the subquery itself ... doc collection
> and score calculation doesn't keep track the subscores.
>
> you could do this using functions in the "fl" but since you mentioned
> wanting this in SearchCOmponent just pass the "subquery" to
> SolrIndexSeracher using a DocSet filter of the current page (ie: make your
> own DocSet based on the current DocList)

I get it. Some fairly intricate dancing then can ensue with SolrCloud. Thanks.

>
>
> -Hoss


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-13 Thread Ali S Kureishy
Thanks Otis.

I really appreciate the details offered here. This was very helpful
information.

I'm going to go through Solandra and Elastic Search and see if those make
sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's
two recommendations for SolrCloud so far), so I will give that a shot when
it is available. However, do you know when SolrCloud IS expected to be
available?

Thanks again!

Warm regards,
Safdar



On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hello Ali,
>
> > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>
> > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> > crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> > seconds.
>
>
> That's fine.  Whether it's doable with any tech will depend on how much
> hardware you give it, among other things.
>
> > Needless to mention, the search index needs to scale to 5Billion pages.
> It
> > is also possible that I might need to store multiple indexes -- one for
> > crawled content, and one for ancillary data that is also very large. Each
> > of these indices would likely require a logically distributed and
> > replicated index.
>
>
> Yup, OK.
>
> > However, I would like for such a system to be homogenous with the Hadoop
> > infrastructure that is already installed on the cluster (for the crawl).
> In
> > other words, I would much prefer if the replication and distribution of
> the
> > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> > using another scalability framework (such as SolrCloud). In addition, it
> > would be ideal if this environment was flexible enough to be dynamically
> > scaled based on the size requirements of the index and the search traffic
> > at the time (i.e. if it is deployed on an Amazon cluster, it should be
> easy
> > enough to automatically provision additional processing power into the
> > cluster without requiring server re-starts).
>
>
> There is no such thing just yet.
> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to
> automatically index HBase content, but that was either not completed or not
> committed into HBase.
>
> > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> > be ideal for this scenario. I've heard mention of Solr-on-HBase,
> Solandra,
> > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
> is
> > mature enough and would be the right architectural choice to go along
> with
> > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
> aspects
> > above.
>
>
> Here is a summary on all of them:
> * Search on HBase - I assume you are referring to the same thing I
> mentioned above.  Not ready.
> * Solandra - uses Cassandra+Solr, plus DataStax now has a different
> (commercial) offering that combines search and Cassandra.  Looks good.
> * Lily - data stored in HBase cluster gets indexed to a separate Solr
> instance(s)  on the side.  Not really integrated the way you want it to be.
> * ElasticSearch - solid at this point, the most dynamic solution today,
> can scale well (we are working on a mny-B documents index and hundreds
> of nodes with ElasticSearch right now), etc.  But again, not integrated
> with Hadoop the way you want it.
> * IndexTank - has some technical weaknesses, not integrated with Hadoop,
> not sure about its future considering LinkedIn uses Zoie and Sensei already.
> * And there is SolrCloud, which is coming soon and will be solid, but is
> again not integrated.
>
> If I were you and I had to pick today - I'd pick ElasticSearch if I were
> completely open.  If I had Solr bias I'd give SolrCloud a try first.
>
> > Lastly, how much hardware (assuming a medium sized EC2 instance) would
> you
> > estimate my needing with this setup, for regular web-data (HTML text) at
> > this scale?
>
> I don't know off the topic of my head, but I'm guessing several hundred
> for serving search requests.
>
> HTH,
>
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
>
> Scalable Performance Monitoring - http://sematext.com/spm/index.html
>
>
> > Any architectural guidance would be greatly appreciated. The more details
> > provided, the wider my grin :).
> >
> > Many many thanks in advance.
> >
> > Thanks,
> > Safdar
> >
>


dynamic analyzer based on condition

2012-04-13 Thread srinir
Hi,

I want to pick different analyzers for the same field for different
languages. I can determine the language from a different field. I would have
different fieldTypes defined in my schema.xml such as text_en, text_de,
text_fr, etc where i specify which analyzer and filter to use during
indexing and query time. 


  



  
  



  


but i would like to define the field dynamically. for e.g

if lang=="en"

else if lang=="de"

...


Can I achieve this somehow ? If this approach cannot be done then i can just
create one field for every language. 

Thanks
Srini

--
View this message in context: 
http://lucene.472066.n3.nabble.com/dynamic-analyzer-based-on-condition-tp3909345p3909345.html
Sent from the Solr - User mailing list archive at Nabble.com.


remoteLink that change it's text

2012-04-13 Thread Marcelo Carvalho Fernandes
Hi!

I have the following gsp code...


   
   
   Select this product
   


How to have each remoteLink to change it's "Select this product" text to
what "addaction" renders?
The problem I'm facing is that I don't know what to put in 'what-to-put-here
' in order to achieve that.

Of course, I'm new to gsp tags. Any idea?

Thanks in advance,


Marcelo Carvalho Fernandes
+55 21 8272-7970
+55 21 2205-2786


Re: remoteLink that change it's text

2012-04-13 Thread Marcelo Carvalho Fernandes
Sorry! Wrong list!


Marcelo Carvalho Fernandes
+55 21 8272-7970
+55 21 2205-2786


On Fri, Apr 13, 2012 at 10:54 PM, Marcelo Carvalho Fernandes <
mcf2...@gmail.com> wrote:

> Hi!
>
> I have the following gsp code...
>
> 
>
>  id="${i}"
>  update="[success:'what-to-put-here',failure:'error']"
>  on404="alert('not found');">
>Select this product
>
> 
>
> How to have each remoteLink to change it's "Select this product" text to
> what "addaction" renders?
> The problem I'm facing is that I don't know what to put in '
> what-to-put-here' in order to achieve that.
>
> Of course, I'm new to gsp tags. Any idea?
>
> Thanks in advance,
>
> 
> Marcelo Carvalho Fernandes
> +55 21 8272-7970
> +55 21 2205-2786
>


Category the result search

2012-04-13 Thread hadi
hi

I am new to solr, I crawled about 1000 news site with nutch and i use solr
to browse the result, but i want to categorize the sites to some categories
like(sport news,politic news,science and etc ..)
I know i have to use solr faceting but i do not know how can i do such
implementation for solr or at least how can i force solr to know my category
fields?

thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Category-the-result-search-tp3909710p3909710.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can I discover what part of a score is attributable to a subquery?

2012-04-13 Thread Paul Libbrecht
Benson,
In mid 2009, I has such a question answered with a nifty score bitwise 
manipulation, and a little precision loss. For each result I could pick the 
language of a multilingual match.
If interested, I can dig.
Paul
--
Envoyé de mon téléphone Android avec K-9 Mail. Excusez la brièveté.


Benson Margulies  a écrit :

Given a query including a subquery, is there any way for me to learn
that subquery's contribution to the overall document score?

I can provide 'why on earth would anyone ...' if someone wants to know.