Re: Problem in faceting

2011-02-04 Thread Grijesh

change the default operator from "OR" to "AND" by using q.op or in schema

-
Thanx:
Grijesh
http://lucidimagination.com
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p248.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facet Query

2011-02-04 Thread Grijesh

No ,Facet query and fq parameters work with any type of query. when you will
search for facet.query=city:mumbai then it will return facet like


 3


Facet query is for faceting against perticullar query.

If you wants result for that query then you have to go for fq=city:mumbai

-
Thanx:
Grijesh
http://lucidimagination.com
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Facet-Query-tp2422212p2422267.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem in faceting

2011-02-04 Thread Bagesh Sharma

But i want results as it is as the above query is returning. There is no
problem with the results with it is returning.

Problem detail

I have implemented search for my company in which in search box user can
search any query. Now when a user search "water treatment plant". Then the
results come back according to above given query in which the documents
containing words "water" or "treatment" or "plant" or "water treatment
plant" is matching. All these searched results are correct and fulfill my
requirements . Along with these results i am doing faceting over cities to
display. Currently all cities are displayed if they are of a record matching
with any word "water" or "treatment" or "plant" or "water treatment plant".
But now my requirement is to keep the records as it is but do faceting over
only those cities for which complete text "water treatment plant" is
matching.

Is it possible by a single query to solr please suggest. Thanks a lot for
your response. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p2422353.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr faceting on score

2011-02-04 Thread Bagesh Sharma

Hi friends, Is it possible to do faceting over score. I want to results from
facets which have more score. Please suggest.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-faceting-on-score-tp2422076p2422076.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem in faceting

2011-02-04 Thread Grijesh

Try solr's new Local Params ,may that will help for your requirement.

http://wiki.apache.org/solr/LocalParams

-
Thanx:
Grijesh
http://lucidimagination.com
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p2422534.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Problem in faceting

2011-02-04 Thread Pierre GOSSE
Using a facet query like

facet.query=+water +treatement +plant 

... should give a count of 0 to documents not having all tree terms. This could 
do the trick, if I understand how this parameter works.



RE: Problem in faceting

2011-02-04 Thread Grijesh

facet.query=+water +treatement +plant will not return the city facet that is
needed by poster.
That will give the counts matching the query facet.query=+water +treatement
+plant only

-
Thanx:
Grijesh
http://lucidimagination.com
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p2422881.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR 1.4 and Lucene 3.0.3 index problem

2011-02-04 Thread Churchill Nanje Mambe
thanks Dominique
 I am on windows... how do I do this on a windows 7 machine... I have
netbeans and I have SVN and ant plugins
 regards

Mambe Churchill Nanje
237 33011349,
AfroVisioN Founder, President,CEO
http://www.afrovisiongroup.com | http://mambenanje.blogspot.com
skypeID: mambenanje
www.twitter.com/mambenanje



On Fri, Feb 4, 2011 at 8:10 AM, Dominique Bejean
wrote:

> Hi,
>
> I would not try to change the lucene version in Solr 1.4.1 from 2.9.x to
> 3.0.x.
>
> As said Koji, the best solution is to get the branch 3.x or the trunk and
> build it. You need svn and ant.
>
> 1. Create a working directory
>
> $ mkdir ~/solr
>
> 2. Get the source
>
> $ cd ~/solr
>
> $ svn co http://svn.apache.org/repos/asf/lucene/dev/trunk
> or
>
> $ svn co http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x
>
> 3. build
>
> $ cd ~/solr/modules
> $ ant compile
> $ cd ~/solr/lucene
> $ ant dist
> $ cd ~/solr/modules
> $ ant dist
>
> Dominique
>
> Le 02/02/11 12:47, Churchill Nanje Mambe a écrit :
>
>  thanks guys
>>  I will try the trunk
>>
>> as for unpacking the war and changing the lucene... I am not an expect and
>> this my get complicated for me maybe over time
>> when I am comfortable
>>
>> Mambe Churchill Nanje
>> 237 33011349,
>> AfroVisioN Founder, President,CEO
>> http://www.afrovisiongroup.com | http://mambenanje.blogspot.com
>> skypeID: mambenanje
>> www.twitter.com/mambenanje
>>
>>
>>
>> On Wed, Feb 2, 2011 at 8:03 AM, Grijesh  wrote:
>>
>>  You can extract the solr.war using java's jar -xvf solr.war  command
>>>
>>> change the lucene-2.9.jar with your lucene-3.0.3.jar in WEB-INF/lib
>>> directory
>>>
>>> then use jar -cxf solr.war * to again pack the war
>>>
>>> deploy that war hope that work
>>>
>>> -
>>> Thanx:
>>> Grijesh
>>> --
>>> View this message in context:
>>>
>>> http://lucene.472066.n3.nabble.com/SOLR-1-4-and-Lucene-3-0-3-index-problem-tp2396605p2403542.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>


Re: DataImportHandler usage with RDF database

2011-02-04 Thread Otis Gospodnetic
Hi Lewis,
 
> I am very interested in DataImportHandler. I have data stored  in an RDF db 
> and 
>wish to use this data to boost query results via Solr. I wish  to keep this 
>data 
>stored in db as I have a web app which directly maintains this  db. Is it 
>possible to use a DataImportHandler to read RDF data from db in  memory

I don't think DIH can read from a triple store today.  It can read from a 
RDBMS, 
RSS/Atom feeds, URLs, mail servers, maybe others...
Maybe what you should be looking at is the ManifoldCF instead, although I don't 
think it can fetch data from triple stores today either.

> without sending an index commit to Solr. As far as I can see  
> DataImportHandler 
>currently supports full and delta imports which mean I would  be indexing. 
>

I don't follow what you mean by this and how it relates to the first part.

> So far I have yet to find a requestHandler which is able to read  then store 
>data in memory, then use this data elsewhere prior to returning  documents via 
>queryResponseWriter.


I think you are talking about a custom SearchComponent that reads some data 
from 
somewhere (e.g. your triple store) and then uses it at search time for 
something.  This sounds doable, although you didn't provide details.  For 
example, we (Sematext) have implemented custom SearchComponents for e-commerce 
customers where frequently-changing information about product availability was 
fetched from external stores and applied to search results.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


RE: Problem in faceting

2011-02-04 Thread Pierre GOSSE
Yes, I see I didn't understand that facet.query parameter.

Have you consider submitting two queries ? One for results with q.op=OR, one 
for faceting with q.op=AND ?


-Message d'origine-
De : Grijesh [mailto:pintu.grij...@gmail.com] 
Envoyé : vendredi 4 février 2011 10:42
À : solr-user@lucene.apache.org
Objet : RE: Problem in faceting


facet.query=+water +treatement +plant will not return the city facet that is
needed by poster.
That will give the counts matching the query facet.query=+water +treatement
+plant only

-
Thanx:
Grijesh
http://lucidimagination.com
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p2422881.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: geodist and spacial search

2011-02-04 Thread Eric Grobler
Hi Grant,

Thanks for the tip
This seems to work:

q=*:*
fq={!func}geodist()
sfield=store
pt=49.45031,11.077721

fq={!bbox}
sfield=store
pt=49.45031,11.077721
d=40

fl=store
sort=geodist() asc


On Thu, Feb 3, 2011 at 7:46 PM, Grant Ingersoll  wrote:

> Use a filter query?  See the {!geofilt} stuff on the wiki page.  That gives
> you your filter to restrict down your result set, then you can sort by exact
> distance to get your sort of just those docs that make it through the
> filter.
>
>
> On Feb 3, 2011, at 10:24 AM, Eric Grobler wrote:
>
> > Hi Erick,
> >
> > Thanks I saw that example, but I am trying to sort by distance AND
> specify
> > the max distance in 1 query.
> >
> > The reason is:
> > running bbox on 2 million documents with a 20km distance takes only
> 200ms.
> > Sorting 2 million documents by distance takes over 1.5 seconds!
> >
> > So it will be much faster for solr to first filter the 20km documents and
> > then to sort them.
> >
> > Regards
> > Ericz
> >
> > On Thu, Feb 3, 2011 at 1:27 PM, Erick Erickson  >wrote:
> >
> >> Further down that very page ...
> >>
> >> Here's an example of sorting by distance ascending:
> >>
> >>  -
> >>
> >>  ...&q=*:*&sfield=store&pt=45.15,-93.85&sort=geodist()
> >> asc<
> >>
> http://localhost:8983/solr/select?wt=json&indent=true&fl=name,store&q=*:*&sfield=store&pt=45.15,-93.85&sort=geodist()%20asc
> >>>
> >>
> >>
> >>
> >>
> >> The key is just the &sort=geodist(), I'm pretty sure that's independent
> of
> >> the bbox, but
> >> I could be wrong.
> >>
> >> Best
> >> Erick
> >>
> >> On Wed, Feb 2, 2011 at 11:18 AM, Eric Grobler <
> impalah...@googlemail.com
> >>> wrote:
> >>
> >>> Hi
> >>>
> >>> In http://wiki.apache.org/solr/SpatialSearch
> >>> there is an example of a bbox filter and a geodist function.
> >>>
> >>> Is it possible to do a bbox filter and sort by distance - combine the
> >> two?
> >>>
> >>> Thanks
> >>> Ericz
> >>>
> >>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


Re: What is the best protocol for data transfer rate HTTP or RMI?

2011-02-04 Thread Otis Gospodnetic
Gustavo,

I haven't used RMI in 5 years, but last time I used it I remember it being 
problematic - this is in the context of Lucene-based search involving some 40 
different shards/servers, high query rates, and some 2 billion documents, if I 
remember correctly.  I remember us wanting to get away from RMI to something 
simpler, less problematic, more HTTP-like.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Gustavo Maia 
> To: solr-user@lucene.apache.org
> Sent: Thu, February 3, 2011 1:05:16 PM
> Subject: What is the best protocol for data transfer rate HTTP or RMI?
> 
> Hello,
> 
> 
> 
> I am doing a comparative study between Lucene and Solr and  wish to obtain
> more concrete data on the data transfer using the lucene  RemoteSearch that
> uses RMI and data transfer of SOLR that uses the HTTP  protocol.
> 
> 
> 
> 
> Gustavo Maia
> 


Re: value for maxFieldLength

2011-02-04 Thread Otis Gospodnetic
Lewis,

A large maxFieldLength may not necessarily result in OOM - it depends on -Xmx 
you are using, the number of concurrent documents being processed, and such.
So the first thing I'd look would be my machine's RAM, then -Xmx I can afford, 
then based on that set maxFieldLengthmay.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: "McGibbney, Lewis John" 
> To: "solr-user@lucene.apache.org" 
> Sent: Wed, February 2, 2011 10:20:58 AM
> Subject: value for maxFieldLength
> 
> Hello list,
> 
> I am aware that setting the value of maxFieldLength in  solrconfig.xml too 
> high 
>may/will result in out-of-mem errors. I wish to provide  content extraction on 
>a 
>number of pdf documents which are large, by large I mean  8-11MB (occasionally 
>more), and I am also not sure how many terms reside in each  field when it is 
>indexed. My question is therefore what is a sensible number to  set this value 
>to in order to include the majority/all terms within documents of  this size.
> 
> Thank you
> 
> Lewis
> 
> 
> Glasgow Caledonian  University is a registered Scottish charity, number 
>SC021474
> 
> Winner:  Times Higher Education's Widening Participation Initiative of the 
> Year 
>2009 and  Herald Society's Education Initiative of the Year 2009.
>http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>l
> 
> Winner:  Times Higher Education's Outstanding Support for Early Career 
>Researchers of  the Year 2010, GCU as a lead with Universities Scotland 
>partners.
>http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>l
> 


Re: Using terms and N-gram

2011-02-04 Thread Otis Gospodnetic
Hi,

The main difference is that CommonGrams will take 2 adjacent words and put them 
together, while NGram* stuff will take a single word and chop it up in 
sequences 
of one or more characters/letters.

If you are stuck with auto-complete stuff, consider 
http://sematext.com/products/autocomplete/index.html

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: openvictor Open 
> To: solr-user@lucene.apache.org
> Sent: Thu, February 3, 2011 10:15:47 AM
> Subject: Re: Using terms and N-gram
> 
> Thank you, I will do that and hopefuly it will be handy !
> 
> But can someone  explain me difference between CommonGramFIlterFactory et
> NGramFilterFactory ?  ( Maybe the solution is there)
> 
> Thank you all,
> best  regards
> 
> 2011/2/3 Grijesh 
> 
> >
> >  Use analysis.jsp to see what happening at index time and query time  with
> > your
> > input data.You can use highlighting to see if match  found.
> >
> > -
> > Thanx:
> > Grijesh
> > http://lucidimagination.com
> > --
> > View this message in  context:
> > 
>http://lucene.472066.n3.nabble.com/Using-terms-and-N-gram-tp2410938p2411244.html
> >  Sent from the Solr - User mailing list archive at Nabble.com.
> >
> 


Re: phrase, inidividual term, prefix, fuzzy and stemming search

2011-02-04 Thread Otis Gospodnetic
Hi,

I'll admit I didn't read your email closely, but the first part makes me thing 
that ngrams, which I don't think you mentioned, might be handy for you here, 
allowing for misspellings without the implementation complexity.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: cyang2010 
> To: solr-user@lucene.apache.org
> Sent: Mon, January 31, 2011 5:22:19 PM
> Subject: phrase, inidividual term, prefix, fuzzy and stemming search
> 
> 
> My current project has the requirement to support search when user inputs  any
> number of terms across a few index fields (movie title, actor,  director).
> 
> In order to maximize result, I plan to support all those  searches listed in
> the subject, phrase, individual term, prefix, fuzzy and  stemming.  Of
> course, score relevance in the right order is also  important.
> 
> I have considered using dismax query.  However, it does  not support prefix
> query.  I am not sure if it supports fuzzy query, my  guess is does not.
> 
> Therefore, i still need to use standard query.For example, if someone
> searches "deim moer" (typo for demi moore), i compare  the phrase and terms
> with each searchable fields (title, actor,  director):
> 
> 
> title_display: "deim moer"~30 actors: "deim moer"~30  directors: "deim
> moer"~30<--  OR
> 
> title_display:  deim<-- OR
> actors: deim 
> directors: deim 
> 
> title_display: deim*   <-- OR
> actors: deim* 
> directors:  deim* 
> 
> title_display: deim~0.6   <-- OR
> actors: deim~0.6 
> directors: deim~0.6 
> 
> title_display: moer<--  OR
> actors: moer 
> directors: moer 
> 
> title_display: moer*<-- OR
> actors: moer* 
> directors: moer* 
> 
> title_display:  moer~0.6<-- OR
> actors: moer~0.6 
> directors:  moer~0.6
> 
> The solr relevance score is sum for all those OR.  In that  way, i can make
> sure relevance score are in order.  For example, for the  exact match ("deim
> moer"), it will match phrase, term, prefix and fuzzy query  all at the same
> time.   Therefore, it will score higher than some input  text only matchs
> term, or prefix or fuzzy. At the same time, i  can apply boost to a
> particular search field if requirement  needs.
> 
> 
> Does it sound right to you?  Is there better ways to  achieve the same thing? 
> My concern is my query is not going to perform,  since it tries to do too
> much.  But isn't that what people want to get  (maximize result) when they
> just type in a few search words?
> 
> Another  question is that:  Can i combine the result of two query together? 
> For  example, first i query phrase and term match, next I query for  prefix
> match.  Can I just append the result for prefix match to that  for
> phrase/term match?   I thought two queries have different  queryNorm,
> therefore, the score is not comparable to each other so as to  combine.  Is
> it correct?
> 
> 
> Thanks.  love to hear what your  thought is.
> 
> 
> -- 
> View this message in context: 
>http://lucene.472066.n3.nabble.com/phrase-inidividual-term-prefix-fuzzy-and-stemming-search-tp239p239.html
>
> Sent  from the Solr - User mailing list archive at Nabble.com.
> 


Re: Solr Indexing Performance

2011-02-04 Thread Otis Gospodnetic
Hi,

2 GB for ramBufferSize is probably too much and not needed, but you could 
increase it from default 32 MB to something like 128 MB or even 512 MB, if you 
really have that much data where that would make a difference (you mention only 
49 PDF files).  I'd leave mergeFactor at 10 for now.  The slowness (if there is 
slowness - how long is it taking?) could be from:
* slow DB
* suboptimal SQL
* PDF content extraction
* indexing itself
* ...

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Tomás Fernández Löbbe 
> To: solr-user@lucene.apache.org
> Sent: Mon, January 31, 2011 10:13:32 AM
> Subject: Re: Solr Indexing Performance
> 
> Well, I would say that the best way to be sure is to benchmark  different
> configurations.
> As far as I know, it's usually not recommended  such a big RAM Buffer size,
> default is 32 MB and probably won't get any  improvements using more than 128
> MB.
> The same with the mergeFactor, I know  that a larger merge factor it's better
> for indexing, but 50 sounds like a  lot. Anyway, as I said before, the best
> thing to do is benchmark different  configurations and see which one works
> better for you.
> 
> Have you tried  assigning less memory to the JVM? That would leave more
> memory available to  the OS.
> 
> Tomás
> 
> On Sun, Jan 30, 2011 at 1:54 AM, Darx Oman  wrote:
> 
> >  Hi guys
> >
> >
> >
> > I'm running a solr instance  (trunk)  in my dev. Server to test my
> > configuration.  I'm  doing a DIH full import to index 49 PDF files with
> > their
> >  corresponding database records.  Both the PDF files and database are  local
> > in the server.
> >
> > *Server : *
> >
> > ·  Windows 2008 R2
> >
> > ·  MS SQL server 2008 R2
> >
> > · 16  core processor
> >
> > · 16 GB  ram
> >
> > *Tomcat (7.0.5) : *
> >
> > ·  Set JAVA_OPTS = %JAVA_OPTS%  -Xms1024M   -Xmx8192M
> >
> > *Solrconfig:*
> >
> > ·  Main index configurations
> > 2048
> > 50
> >
> > *DIH  configuration:*
> >
> > · 2 data sources  defined  jdbcDataSource and BinFileDataSource
> >
> > ·  One main entity with 3 sub entities
> >
> >  
> >
> > 
> >
> >  
> >
> >  
> >
> >  
> >
> > · Total schema  fields are 8, three of which are text type and
> >  multivalued.
> >
> > *My DIH import Status Messages:*
> >
> >  · Total Requests made to DataSource =  99**
> >
> > · Total Rows Fetched =  2124**
> >
> > · Total DocumentsProcessed =  49**
> >
> > · Time Taken =  *0:2:3:880***
> >
> > *
> > Is this time reasonable or it can be  improved?*
> >
>


Re: Detect Out of Memory Errors

2011-02-04 Thread Otis Gospodnetic
Hi,

There are external tools that one can use to watch Java processes, listen for 
errors, and restart processes if they die - monit, daemontools, and some 
Java-specific ones.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: saureen 
> To: solr-user@lucene.apache.org
> Sent: Thu, January 27, 2011 9:41:56 AM
> Subject: Detect Out of Memory Errors
> 
> 
> Hi,
> 
> is ther a way by which i could detect the out of memory errors in  solr so
> that i could implement some functionality such as restarting the  tomcat or
> alert me via email whenever such error is detected.?
> -- 
> View  this message in context: 
>http://lucene.472066.n3.nabble.com/Detect-Out-of-Memory-Errors-tp2362872p2362872.html
>
> Sent  from the Solr - User mailing list archive at Nabble.com.
> 


Re: Performance optimization of Proximity/Wildcard searches

2011-02-04 Thread Otis Gospodnetic
Salman,

I only skimmed your email, but wanted to say that this part sounds a little 
suspicious:

> Our warm up script currently  executes all distinct queries in our logs
> having count > 5. It was run  yesterday (with all the indexing update every

It sounds like this will make warmup take a long time, assuming you have 
more than a handful distinct queries in your logs.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Salman Akram 
> To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk
> Sent: Tue, January 25, 2011 6:32:48 AM
> Subject: Re: Performance optimization of Proximity/Wildcard searches
> 
> By warmed index you only mean warming the SOLR cache or OS cache? As I  said
> our index is updated every hour so I am not sure how much SOLR cache  would
> be helpful but OS cache should still be helpful, right?
> 
> I  haven't compared the results with a proper script but from manual  testing
> here are some of the observations.
> 
> 'Recent' queries which are  in cache of course return immediately (only if
> they are exactly same - even  if they took 3-4 mins first time). I will need
> to test how many recent  queries stay in cache but still this would work only
> for very common queries.  User can run different queries and I want at least
> them to be at 'acceptable'  level (5-10 secs) even if not very fast.
> 
> Our warm up script currently  executes all distinct queries in our logs
> having count > 5. It was run  yesterday (with all the indexing update every
> hour after that) and today when  I executed some of the same queries again
> their time seemed a little less  (around 15-20%), I am not sure if this means
> anything. However, still their  time is not acceptable.
> 
> What do you think is the best way to compare  results? First run all the warm
> up queries and then execute same randomly and  compare?
> 
> We are using Windows server, would it make a big difference if  we move to
> Linux? Our load is not high but some queries are really  complex.
> 
> Also I was hoping to move to SSD in last after trying out all  software
> options. Is that an agreed fact that on large indexes (which don't  fit in
> RAM) proximity/wildcard/phrase queries (on common words) would be slow  and
> it can be only improved by cache warm up and better hardware? Otherwise  with
> an index of around 150GB such queries will take more than a  min?
> 
> If that's the case I know this question is very subjective but if a  single
> query takes 2 min on SAS 10K RPM what would its approx time be on a  good SSD
> (everything else same)?
> 
> Thanks!
> 
> 
> On Tue, Jan 25,  2011 at 3:44 PM, Toke Eskildsen 
wrote:
> 
> >  On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote:
> > > Cache  warming is a good option too but the index get updated every hour
> >  so
> > > not sure how much would that help.
> >
> > What is the  time difference between queries with a warmed index and a
> > cold one? If  the warmed index performs satisfactory, then one answer is
> > to upgrade  your underlying storage. As always for IO-caused performance
> > problem in  Lucene/Solr-land, SSD is the answer.
> >
> >
> 
> 
> -- 
> Regards,
> 
> Salman Akram
> 


Re: Performance optimization of Proximity/Wildcard searches

2011-02-04 Thread Otis Gospodnetic
Hi,


> Sharding is an  option too but that too comes with limitations so want to
> keep that as a last  resort but I think there must be other things coz 150GB
> is not too big for  one drive/server with 32GB Ram.

Hmm what makes you think 32 GB is enough for your 150 GB index?
It depends on queries and distribution of matching documents, for example.  
What's yours like?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Salman Akram 
> To: solr-user@lucene.apache.org
> Sent: Tue, January 25, 2011 4:20:34 AM
> Subject: Performance optimization of Proximity/Wildcard searches
> 
> Hi,
> 
> I am facing performance issues in three types of queries (and  their
> combination). Some of the queries take more than 2-3 mins. Index size  is
> around 150GB.
> 
> 
>- Wildcard
>-  Proximity
>- Phrases (with common words)
> 
> I know CommonGrams and  Stop words are a good way to resolve such issues but
> they don't fulfill our  functional requirements (Common Grams seem to have
> issues with phrase  proximity, stop words have issues with exact match etc).
> 
> Sharding is an  option too but that too comes with limitations so want to
> keep that as a last  resort but I think there must be other things coz 150GB
> is not too big for  one drive/server with 32GB Ram.
> 
> Cache warming is a good option too but  the index get updated every hour so
> not sure how much would that  help.
> 
> What are the other main tips that can help in performance  optimization of
> the above queries?
> 
> Thanks
> 
> -- 
> Regards,
> 
> Salman Akram
> 


Re: Highlighting with/without Term Vectors

2011-02-04 Thread Otis Gospodnetic
Salman,

It also depends on the size of your documents.  Re-analyzing 20 fields of 500 
bytes each will be a lot faster than re-analyzing 20 fields with 50 KB each.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Grant Ingersoll 
> To: solr-user@lucene.apache.org
> Sent: Wed, January 26, 2011 10:44:09 AM
> Subject: Re: Highlighting with/without Term Vectors
> 
> 
> On Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
> 
> > Hi,
> > 
> > Does anyone have any benchmarks how much highlighting speeds up with  Term
> > Vectors (compared to without it)? e.g. if highlighting on 20  documents take
> > 1 sec with Term Vectors any idea how long it will take  without them?
> > 
> > I need to know since the index used for  highlighting has a TVF file of
> > around 450GB (approx 65% of total index  size) so I am trying to see whether
> > the decreasing the index size by  dropping TVF would be more helpful for
> > performance (less RAM, should be  good for I/O too I guess) or keeping it is
> > still better?
> > 
> > I know the best way is try it out but indexing takes a very long time  so
> > trying to see whether its even worthy or not.
> 
> 
> Try testing  on a smaller set.  In general, you are saving the process of 
>re-analyzing  the content, so, to some extent it is going to be dependent on 
>how 
>fast your  analyzer chain is.  At the size you are at, I don't know if storing 
>TVs is  worth it.


Re: Solr for finding similar word between two documents

2011-02-04 Thread Otis Gospodnetic
Rohan,

You can really do that with Lucene's tokenizers to get individual tokens/words 
and a HashMap where keys are those words/tokens from the first document.  You 
can then tokenize the second doc and check each of its words in the HashMap.

Our Key Phrase Extractor ( 
http://sematext.com/products/key-phrase-extractor/index.html ) includes similar 
functionality that works with 2 corpora (or 2 pieces of text or 2 language 
models) and gets you the "overlap".  I think it also takes into consideration 
term frequencies, which can be handy.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: rohan rai 
> To: solr-user@lucene.apache.org
> Sent: Thu, February 3, 2011 2:35:39 PM
> Subject: Re: Solr for finding similar word between two documents
> 
> Lets say 1 have document(file) which is large and contains word inside  it.
> 
> And the 2nd document also is a text file.
> 
> Problem is to find  all those words in 2nd document which is present in first
> document
> when  both of the files are large enough.
> 
> Regards
> Rohan
> 
> On Fri, Feb  4, 2011 at 1:01 AM, openvictor Open wrote:
> 
> >  Rohan : what you want to do can be done with quite little effort if  your
> > document has a limited size (up to some Mo) with common and  basic
> > structures
> > like Hasmap.
> >
> > Do you have any  additional information on your problem so that we can give
> > you more  useful inputs ?
> >
> > 2011/2/3 Gora Mohanty 
> >
> > >  On Thu, Feb 3, 2011 at 11:32 PM, rohan rai  wrote:
> >  > > Is there a way to use solr and get similar words between two  document
> > > > (files).
> > > [...]
> > >
> > >  This is *way* too vague t make any sense out of. Could you elaborate,
> >  > as I could have sworn that what you seem to want is the essential
> >  > function of a search engine.
> > >
> > > Regards,
> >  > Gora
> > >
> >
> 


Re: Highlighting with/without Term Vectors

2011-02-04 Thread Salman Akram
Basically Term Vectors are only on one main field i.e. Contents. Average
size of each document would be few KB's but there are around 130 million
documents so what do you suggest now?

On Fri, Feb 4, 2011 at 5:24 PM, Otis Gospodnetic  wrote:

> Salman,
>
> It also depends on the size of your documents.  Re-analyzing 20 fields of
> 500
> bytes each will be a lot faster than re-analyzing 20 fields with 50 KB
> each.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Grant Ingersoll 
> > To: solr-user@lucene.apache.org
> > Sent: Wed, January 26, 2011 10:44:09 AM
> > Subject: Re: Highlighting with/without Term Vectors
> >
> >
> > On Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
> >
> > > Hi,
> > >
> > > Does anyone have any benchmarks how much highlighting speeds up with
>  Term
> > > Vectors (compared to without it)? e.g. if highlighting on 20  documents
> take
> > > 1 sec with Term Vectors any idea how long it will take  without them?
> > >
> > > I need to know since the index used for  highlighting has a TVF file of
> > > around 450GB (approx 65% of total index  size) so I am trying to see
> whether
> > > the decreasing the index size by  dropping TVF would be more helpful
> for
> > > performance (less RAM, should be  good for I/O too I guess) or keeping
> it is
> > > still better?
> > >
> > > I know the best way is try it out but indexing takes a very long time
>  so
> > > trying to see whether its even worthy or not.
> >
> >
> > Try testing  on a smaller set.  In general, you are saving the process of
> >re-analyzing  the content, so, to some extent it is going to be dependent
> on how
> >fast your  analyzer chain is.  At the size you are at, I don't know if
> storing
> >TVs is  worth it.
>



-- 
Regards,

Salman Akram


Re: Problem in faceting

2011-02-04 Thread Bagesh Sharma

Sending two separate queries is an approach but i think it may affect
performance of the solr because for every new search there will be two
queries to solr due to this reason i was thinking to do it by a single
query. I am going to implement it with two queries now but if any thing is
found useful in future then suggest me please. Thanks for the suggestion
-- 
Thanks and Regards
   Bagesh Sharma

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p2424104.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facet Query

2011-02-04 Thread Bagesh Sharma

yes it works fine ... thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Facet-Query-tp2422212p2424155.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Index Not Matching

2011-02-04 Thread Esclusa, Will
Hello Grijesh,

The URL below returns a 404 with the following error:

The requested resource (/select/) is not available.



-Original Message-
From: Grijesh [mailto:pintu.grij...@gmail.com] 
Sent: Friday, February 04, 2011 12:17 AM
To: solr-user@lucene.apache.org
Subject: RE: Index Not Matching


http://localhost:8080/select/?q=*:* will return all records form solr

-
Thanx:
Grijesh
http://lucidimagination.com
-- 
View this message in context:
http://lucene.472066.n3.nabble.com/Index-Not-Matching-tp2417612p2421560.
html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index Not Matching

2011-02-04 Thread Stefan Matheis
try http://localhost:8080/solr/select?q=*:* or while using solr's
default port http://localhost:8983/solr/select?q=*:*

On Fri, Feb 4, 2011 at 2:50 PM, Esclusa, Will
 wrote:
> Hello Grijesh,
>
> The URL below returns a 404 with the following error:
>
> The requested resource (/select/) is not available.
>
>
>
> -Original Message-
> From: Grijesh [mailto:pintu.grij...@gmail.com]
> Sent: Friday, February 04, 2011 12:17 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Index Not Matching
>
>
> http://localhost:8080/select/?q=*:* will return all records form solr
>
> -
> Thanx:
> Grijesh
> http://lucidimagination.com
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Index-Not-Matching-tp2417612p2421560.
> html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Use Parallel Search

2011-02-04 Thread Gustavo Maia
Hello,

I am not using Nutch.

Let me explain more about how to use the lucene.
The class has lucene RemoteSearch which a server machine is used to publish
its index.

RemoteSearchable remote = new RemoteSearchable (parallelSearcher);
Naming.rebind ("//"+ LocalIP +"/"+ artPortMap.getNick (), remote);


On the client it is necessary only to make lookuo based on IP of the
machine. In each class we use the machine ParallelSearch which allows me to
do searches in parallel using different processors dirent hds. Logo with 6
hds and a machine with more than 6 processors have a perfect looking into is
parallel.

In the example below is how to seek the reference of the server machine.
 Searchable ts = (Searchable) Naming.lookup ("//" + ip + ":" + port +
"/" + name);

My all document is in XML format. I have a pre processing that converts the
HTML document, DOC, PDF to XML.

The searches did not use facet, because the lucene not possible. That's one
reason I'm studying the SOLR. Today I have the need to use FACET:). Use
queries such as sorting, filtering, multiple fields 

With this architecture have an index of 18 fragments scattered in 18hds of
three machines, each index fragment with a size of 10GB, which give me a
180GB total size of index.

But I'm afraid because I multiply by 10 the index going from 180GB to
1180GB. Apache SOLR is best suited for this new index size or can I continue
using lucene, being only necessary to add more machines?




2011/2/4 Ganesh 

> I am having similar kind of problem. I need to scale out. Could you explain
> how you have done distributed indexing and search using Lucene.
>
> Regards
> Ganesh
>
> - Original Message -
> From: "Gustavo Maia" 
> To: 
> Sent: Thursday, February 03, 2011 11:36 PM
> Subject: Use Parallel Search
>
>
> > Hello,
> >
> > Let me give a brief description of my scenario.
> > Today I am only using Lucene 2.9.3. I have an index of 30 million
> documents
> > distributed on three machines and each machine with 6 hds (15k rmp).
> > The server queries the search index using the remote class search. And
> each
> > machine is made to search using the parallel search (search
> simultaneously
> > in 6 hds).
> > So during the search are simulating using the three machines and 18 hds,
> > returning me to a very good response time.
> >
> >
> > Today I am studying the SOLR and am interested in knowing more about the
> > searches and use of distributed parallel search on the same machine. What
> > would be the best scenario using SOLR that is better than I already am
> using
> > today only with lucene?
> >  Note: I need to have installed on each machine 6 SOLR instantiate from
> my
> > server? One for each hd? Or would some other alternative way for me to
> use
> > the 6 hds without having 6 instances of SORL server?
> >
> >  Another question would be if the SOLR would have some limiting size
> index
> > for Hard drive? It would be interesting not index too big because when
> the
> > index increased the longer the search.
> >
> > Thanks for everything.
> >
> >
> > Gustavo Maia
> >
> Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> Download Now! http://messenger.yahoo.com/download.php
>


Re: What is the best protocol for data transfer rate HTTP or RMI?

2011-02-04 Thread Gustavo Maia
Hi Otis,


Hello,

  You have many documents, 2 billion. Could you explain to me how this set
yours?

The mine is defined as follows, but using lucene.
I have 3 machines and each machine with 6 each hds. Each hd this index with
afragment of 10GB. Soon I have 3 servers search. Each server uses the lucene
 classParallelSerach using 6 hds and publish that server using the class
RemoteSearch.
  My client connects these three machines using RMI. Everything is in using
lucene.Using the classes that provide it.

   Please explain how you did the distribution of the index. How many hds
you use formachine? What is the maximum size of index you use for HD? Are
you using theSORL or lucene? How many instance you have the SOLR server on
each machine?

Sorry for so many questions.

Gustavo Maia



2011/2/4 Otis Gospodnetic 

> Gustavo,
>
> I haven't used RMI in 5 years, but last time I used it I remember it being
> problematic - this is in the context of Lucene-based search involving some
> 40
> different shards/servers, high query rates, and some 2 billion documents,
> if I
> remember correctly.  I remember us wanting to get away from RMI to
> something
> simpler, less problematic, more HTTP-like.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Gustavo Maia 
> > To: solr-user@lucene.apache.org
> > Sent: Thu, February 3, 2011 1:05:16 PM
> > Subject: What is the best protocol for data transfer rate HTTP or RMI?
> >
> > Hello,
> >
> >
> >
> > I am doing a comparative study between Lucene and Solr and  wish to
> obtain
> > more concrete data on the data transfer using the lucene  RemoteSearch
> that
> > uses RMI and data transfer of SOLR that uses the HTTP  protocol.
> >
> >
> >
> >
> > Gustavo Maia
> >
>


Re: What is the best protocol for data transfer rate HTTP or RMI?

2011-02-04 Thread Mattmann, Chris A (388J)
Hi Guys,

It depends on what properties you're trying to maximize. I've done several 
studies of this over the years:

http://sunset.usc.edu/~mattmann/pubs/MSST2006.pdf
http://sunset.usc.edu/~mattmann/pubs/IWICSS07.pdf
http://sunset.usc.edu/~mattmann/pubs/icse-shark08.pdf

And if you're really bored, and have time, this one:

http://sunset.usc.edu/~mattmann/Dissertation.pdf

It would be nice to see how Lucene/Solr as an application that induces 
distribution scenarios affects the underlying data transfer, similar to the 
approaches described in the above papers.

HTH!

Cheers,
Chris

On Feb 4, 2011, at 3:32 AM, Otis Gospodnetic wrote:

> Gustavo,
> 
> I haven't used RMI in 5 years, but last time I used it I remember it being 
> problematic - this is in the context of Lucene-based search involving some 40 
> different shards/servers, high query rates, and some 2 billion documents, if 
> I 
> remember correctly.  I remember us wanting to get away from RMI to something 
> simpler, less problematic, more HTTP-like.
> 
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> - Original Message 
>> From: Gustavo Maia 
>> To: solr-user@lucene.apache.org
>> Sent: Thu, February 3, 2011 1:05:16 PM
>> Subject: What is the best protocol for data transfer rate HTTP or RMI?
>> 
>> Hello,
>> 
>> 
>> 
>> I am doing a comparative study between Lucene and Solr and  wish to obtain
>> more concrete data on the data transfer using the lucene  RemoteSearch that
>> uses RMI and data transfer of SOLR that uses the HTTP  protocol.
>> 
>> 
>> 
>> 
>> Gustavo Maia
>> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



prices

2011-02-04 Thread Dennis Gearon
Using solr 1.4.

I have a price in my schema. Currently it's a tfloat. Somewhere along the way 
from php, json, solr, and back, extra zeroes are getting truncated along with 
the decimal point for even dollar amounts.

So I have two questions, neither of which seemed to be findable with google.

A/ Any way to keep both zeroes going inito a float field? (In the analyzer, 
with 
XML output, the values are shown with 1 zero)
B/ Can strings be used in range queries like a float and work well for prices?


 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



RE: DataImportHandler usage with RDF database

2011-02-04 Thread McGibbney, Lewis John
Hi Otis... thanks for your thoughts.

>I don't think DIH can read from a triple store today.  It can read from a 
>RDBMS,
>RSS/Atom feeds, URLs, mail servers, maybe others...
>Maybe what you should be looking at is the ManifoldCF instead, although I don't
>think it can fetch data from triple stores today either.

Ok well a way I can work around this (for the time being) is to pull data from 
URL's instead.

>> without sending an index commit to Solr. As far as I can see  
>> DataImportHandler
>>currently supports full and delta imports which mean I would  be indexing.
>>

> I don't follow what you mean by this and how it relates to the first part.

Well as you mentioned below, I'm talking about a custom SearchComponent that 
reads some data from
somewhere (URL for the time being) and then uses it at search time for
something. I have no need to index this data, I merely require it at search 
time.

>> So far I have yet to find a requestHandler which is able to read  then store
>>data in memory, then use this data elsewhere prior to returning  documents via
>>queryResponseWriter.


>I think you are talking about a custom SearchComponent that reads some data 
>from
>somewhere (e.g. your triple store) and then uses it at search time for
>something.  This sounds doable, although you didn't provide details.  For
>example, we (Sematext) have implemented custom SearchComponents for e-commerce
>customers where frequently-changing information about product availability was
>fetched from external stores and applied to search results.

I have web based files and the idea is to specify the URLs to the 
SearchComponent which can then use data within them during search time. Did 
your plug-in adhere to the general requestHandler design? Can you provide any 
resource from which I can get started with this?

thank you
Lewis

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html


Re: prices

2011-02-04 Thread Yonik Seeley
On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon  wrote:
> Using solr 1.4.
>
> I have a price in my schema. Currently it's a tfloat. Somewhere along the way
> from php, json, solr, and back, extra zeroes are getting truncated along with
> the decimal point for even dollar amounts.
>
> So I have two questions, neither of which seemed to be findable with google.
>
> A/ Any way to keep both zeroes going inito a float field? (In the analyzer, 
> with
> XML output, the values are shown with 1 zero)
> B/ Can strings be used in range queries like a float and work well for prices?

You could do a copyField into a stored string field and use the tfloat
(or tint and store cents)
for range queries, searching, etc, and the string field just for display.

-Yonik
http://lucidimagination.com




>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a 
> better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>


Re: HTTP ERROR 400 undefined field: *

2011-02-04 Thread Jed Glazner

Sorry for the lack of details.

It's all clear in my head.. :)

We checked out the head revision from the 3.x branch a few weeks ago 
(https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/). We 
picked up r1058326.


We upgraded from a previous checkout (r960098). I am using our 
customized schema.xml and the solrconfig.xml from the old revision with 
the new checkout.


After upgrading I just copied the data folders from each core into the 
new checkout (hoping I wouldn't have to re-index the content, as this 
takes days).  Everything seems to work fine, except that now I can't get 
the score to return.


The stack trace is attached.  I also saw this warning in the logs not 
sure exactly what it's talking about:


Feb 3, 2011 8:14:10 PM org.apache.solr.core.Config getLuceneVersion
WARNING: the luceneMatchVersion is not specified, defaulting to 
LUCENE_24 emulation. You should at some point declare and reindex to at 
least 3.0, because 2.4 emulation is deprecated and will be removed in 
4.0. This parameter will be mandatory in 4.0.


Here is my request handler, the actual fields here are different than 
what is in mine, but I'm a little uncomfortable publishing how our 
companies search service works to the world:




explicit
edismax
true

field_a^2 field_b^2 field_c^4 


field_d^10




0.1


tvComponent



Anyway  Hopefully this is enough info, let me know if you need more.

Jed.





On 02/03/2011 10:29 PM, Chris Hostetter wrote:

: I was working on an checkout of the 3.x branch from about 6 months ago.
: Everything was working pretty well, but we decided that we should update and
: get what was at the head.  However after upgrading, I am now getting this

FWIW: please be specific.  "head" of what? the 3x branch? or trunk?  what
revision in svn does that corrispond to? (the "svnversion" command will
tell you)

: HTTP ERROR 400 undefined field: *
:
: If I clear the fl parameter (default is set to *, score) then it works fine
: with one big problem, no score data.  If I try and set fl=score I get the same
: error except it says undefined field: score?!
:
: This works great in the older version, what changed?  I've googled for about
: an hour now and I can't seem to find anything.

i can't reproduce this using either trunk (r1067044) or 3x (r1067045)

all of these queries work just fine...

http://localhost:8983/solr/select/?q=*
http://localhost:8983/solr/select/?q=solr&fl=*,score
http://localhost:8983/solr/select/?q=solr&fl=score
http://localhost:8983/solr/select/?q=solr

...you'll have to proivde us with a *lot* more details to help understand
why you might be getting an error (like: what your configs look like, what
the request looks like, what the full stack trace of your error is in the
logs, etc...)




-Hoss


 844 Feb 3, 2011 8:16:58 PM org.apache.solr.core.SolrCore execute
 845 INFO: [music] webapp=/solr path=/select params={explainOther=&fl=*,score&indent=on&start=0&q=test&hl.fl=&qt=standard&wt=standard&fq=&version=2.2&rows=10} hits=2201 status=400 QTime=143
 846 Feb 3, 2011 8:17:00 PM org.apache.solr.core.SolrCore execute
 847 INFO: [rovi] webapp=/solr path=/replication params={command=indexversion&wt=javabin} status=0 QTime=0
 848 Feb 3, 2011 8:17:00 PM org.apache.solr.core.SolrCore execute
 849 INFO: [rovi] webapp=/solr path=/replication params={command=filelist&wt=javabin&indexversion=1277332208072} status=0 QTime=0
 850 Feb 3, 2011 8:17:00 PM org.apache.solr.core.SolrCore execute
 851 INFO: [rovi] webapp=/solr path=/replication params={command=indexversion&wt=javabin} status=0 QTime=0
 852 Feb 3, 2011 8:17:09 PM org.apache.solr.common.SolrException log
 853 SEVERE: org.apache.solr.common.SolrException: undefined field: score
 854   at org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:142)
 855   at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
 856   at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 857   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1357)
 858   at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
 859   at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
 860   at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 861   at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
 862   at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 863   at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 864   at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 865   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
 866   at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 867   at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:1

RE: prices

2011-02-04 Thread Jonathan Rochkind
Your prices are just dollars and cents? For actual queries, you might consider 
an int type rather than a float type.  Multiple by a hundred to put it in the 
index, then multiply your values in queries by a hundred before putting them in 
the query.  Same for range facetting, just divide by 100 before display of 
anything you get back. 

Fixed precision values like price values aren't really floats or don't really 
need floats, and floats sometimes do weird things, as you've noticed. 

Alternately if your problem is simply that you want to display "2.0" as "2.00" 
rather than "2" or "2.0", that is something for you to take care of in your PHP 
app that does the display. PHP will have some function for formatting numbers 
and saying with what precision you want to display. 

There is no way to keep two trailing zeroes 'in' a float field, because "2.0" 
or "2." is the same value as "2.00" or "2.00", so they've all got the same 
internal representation in the float field. There is no way I know to tell Solr 
what precision to render floats with in it's responses. 


From: ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley 
[yo...@lucidimagination.com]
Sent: Friday, February 04, 2011 1:49 PM
To: solr-user@lucene.apache.org
Subject: Re: prices

On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon  wrote:
> Using solr 1.4.
>
> I have a price in my schema. Currently it's a tfloat. Somewhere along the way
> from php, json, solr, and back, extra zeroes are getting truncated along with
> the decimal point for even dollar amounts.
>
> So I have two questions, neither of which seemed to be findable with google.
>
> A/ Any way to keep both zeroes going inito a float field? (In the analyzer, 
> with
> XML output, the values are shown with 1 zero)
> B/ Can strings be used in range queries like a float and work well for prices?

You could do a copyField into a stored string field and use the tfloat
(or tint and store cents)
for range queries, searching, etc, and the string field just for display.

-Yonik
http://lucidimagination.com




>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a 
> better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>


Re: Using terms and N-gram

2011-02-04 Thread openvictor Open
Hi Otis,

That's good I finally made it. For sematext I am afraid that I am too poor
to consider this solution :) (I am doing that for fun)
Thank you anyway !

2011/2/4 Otis Gospodnetic 

> Hi,
>
> The main difference is that CommonGrams will take 2 adjacent words and put
> them
> together, while NGram* stuff will take a single word and chop it up in
> sequences
> of one or more characters/letters.
>
> If you are stuck with auto-complete stuff, consider
> http://sematext.com/products/autocomplete/index.html
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: openvictor Open 
> > To: solr-user@lucene.apache.org
> > Sent: Thu, February 3, 2011 10:15:47 AM
> > Subject: Re: Using terms and N-gram
> >
> > Thank you, I will do that and hopefuly it will be handy !
> >
> > But can someone  explain me difference between CommonGramFIlterFactory et
> > NGramFilterFactory ?  ( Maybe the solution is there)
> >
> > Thank you all,
> > best  regards
> >
> > 2011/2/3 Grijesh 
> >
> > >
> > >  Use analysis.jsp to see what happening at index time and query time
>  with
> > > your
> > > input data.You can use highlighting to see if match  found.
> > >
> > > -
> > > Thanx:
> > > Grijesh
> > > http://lucidimagination.com
> > > --
> > > View this message in  context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Using-terms-and-N-gram-tp2410938p2411244.html
> > >  Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> >
>


Re: prices

2011-02-04 Thread Dennis Gearon
That's a good idea, Yonik. So, fields that aren't stored don't get displayed, 
so 
the float field in the schema never gets seen by the user. Good, I like it.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Yonik Seeley 
To: solr-user@lucene.apache.org
Sent: Fri, February 4, 2011 10:49:42 AM
Subject: Re: prices

On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon  wrote:
> Using solr 1.4.
>
> I have a price in my schema. Currently it's a tfloat. Somewhere along the way
> from php, json, solr, and back, extra zeroes are getting truncated along with
> the decimal point for even dollar amounts.
>
> So I have two questions, neither of which seemed to be findable with google.
>
> A/ Any way to keep both zeroes going inito a float field? (In the analyzer, 
>with
> XML output, the values are shown with 1 zero)
> B/ Can strings be used in range queries like a float and work well for prices?

You could do a copyField into a stored string field and use the tfloat
(or tint and store cents)
for range queries, searching, etc, and the string field just for display.

-Yonik
http://lucidimagination.com




>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a 
>better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>



Re: Performance optimization of Proximity/Wildcard searches

2011-02-04 Thread Salman Akram
I know so we are not really using it for regular warm-ups (in any case index
is updated on hourly basis). Just tried few times to compare results. The
issue is I am not even sure if warming up is useful for such regular
updates.



On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic  wrote:

> Salman,
>
> I only skimmed your email, but wanted to say that this part sounds a little
> suspicious:
>
> > Our warm up script currently  executes all distinct queries in our logs
> > having count > 5. It was run  yesterday (with all the indexing update
> every
>
> It sounds like this will make warmup take a long time, assuming you
> have
> more than a handful distinct queries in your logs.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk
> > Sent: Tue, January 25, 2011 6:32:48 AM
> > Subject: Re: Performance optimization of Proximity/Wildcard searches
> >
> > By warmed index you only mean warming the SOLR cache or OS cache? As I
>  said
> > our index is updated every hour so I am not sure how much SOLR cache
>  would
> > be helpful but OS cache should still be helpful, right?
> >
> > I  haven't compared the results with a proper script but from manual
>  testing
> > here are some of the observations.
> >
> > 'Recent' queries which are  in cache of course return immediately (only
> if
> > they are exactly same - even  if they took 3-4 mins first time). I will
> need
> > to test how many recent  queries stay in cache but still this would work
> only
> > for very common queries.  User can run different queries and I want at
> least
> > them to be at 'acceptable'  level (5-10 secs) even if not very fast.
> >
> > Our warm up script currently  executes all distinct queries in our logs
> > having count > 5. It was run  yesterday (with all the indexing update
> every
> > hour after that) and today when  I executed some of the same queries
> again
> > their time seemed a little less  (around 15-20%), I am not sure if this
> means
> > anything. However, still their  time is not acceptable.
> >
> > What do you think is the best way to compare  results? First run all the
> warm
> > up queries and then execute same randomly and  compare?
> >
> > We are using Windows server, would it make a big difference if  we move
> to
> > Linux? Our load is not high but some queries are really  complex.
> >
> > Also I was hoping to move to SSD in last after trying out all  software
> > options. Is that an agreed fact that on large indexes (which don't  fit
> in
> > RAM) proximity/wildcard/phrase queries (on common words) would be slow
>  and
> > it can be only improved by cache warm up and better hardware? Otherwise
>  with
> > an index of around 150GB such queries will take more than a  min?
> >
> > If that's the case I know this question is very subjective but if a
>  single
> > query takes 2 min on SAS 10K RPM what would its approx time be on a  good
> SSD
> > (everything else same)?
> >
> > Thanks!
> >
> >
> > On Tue, Jan 25,  2011 at 3:44 PM, Toke Eskildsen
> wrote:
> >
> > >  On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote:
> > > > Cache  warming is a good option too but the index get updated every
> hour
> > >  so
> > > > not sure how much would that help.
> > >
> > > What is the  time difference between queries with a warmed index and a
> > > cold one? If  the warmed index performs satisfactory, then one answer
> is
> > > to upgrade  your underlying storage. As always for IO-caused
> performance
> > > problem in  Lucene/Solr-land, SSD is the answer.
> > >
> > >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: Performance optimization of Proximity/Wildcard searches

2011-02-04 Thread Salman Akram
Well I assume many people out there would have indexes larger than 100GB and
I don't think so normally you will have more RAM than 32GB or 64!

As I mentioned the queries are mostly phrase, proximity, wildcard and
combination of these.

What exactly do you mean by distribution of documents? On this index our
documents are not more than few hundred KB's on average (file system size)
and there are around 14 million documents. 80% of the index size is taken up
by position file. I am not sure if this is what you asked?

On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic  wrote:

> Hi,
>
>
> > Sharding is an  option too but that too comes with limitations so want to
> > keep that as a last  resort but I think there must be other things coz
> 150GB
> > is not too big for  one drive/server with 32GB Ram.
>
> Hmm what makes you think 32 GB is enough for your 150 GB index?
> It depends on queries and distribution of matching documents, for example.
> What's yours like?
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Tue, January 25, 2011 4:20:34 AM
> > Subject: Performance optimization of Proximity/Wildcard searches
> >
> > Hi,
> >
> > I am facing performance issues in three types of queries (and  their
> > combination). Some of the queries take more than 2-3 mins. Index size  is
> > around 150GB.
> >
> >
> >- Wildcard
> >-  Proximity
> >- Phrases (with common words)
> >
> > I know CommonGrams and  Stop words are a good way to resolve such issues
> but
> > they don't fulfill our  functional requirements (Common Grams seem to
> have
> > issues with phrase  proximity, stop words have issues with exact match
> etc).
> >
> > Sharding is an  option too but that too comes with limitations so want to
> > keep that as a last  resort but I think there must be other things coz
> 150GB
> > is not too big for  one drive/server with 32GB Ram.
> >
> > Cache warming is a good option too but  the index get updated every hour
> so
> > not sure how much would that  help.
> >
> > What are the other main tips that can help in performance  optimization
> of
> > the above queries?
> >
> > Thanks
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


NullPointerException on queries to new 3rd core

2011-02-04 Thread Alex Thurlow
I just moved to a multi core solr instance a few weeks ago, and it's 
been working great.  I'm trying to add a 3rd core and I can't query 
against it though.


I'm running 1.4.1 (and tried 1.4.0) with the spatial search plugin.

This is the section in solr.xml






I've removed the index dir and completely rebuilt all three cores from 
scratch.  I can query the old ones, but any query against the new one 
gives me this error:


HTTP ERROR: 500
null

java.lang.NullPointerException
at org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:761)
at org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:619)
at org.apache.solr.schema.TextField.write(TextField.java:45)
at org.apache.solr.schema.SchemaField.write(SchemaField.java:108)
at org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:311)
at org.apache.solr.request.XMLWriter$3.writeDocs(XMLWriter.java:483)
at org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:420)
at org.apache.solr.request.XMLWriter.writeDocList(XMLWriter.java:457)
at org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:520)
at org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130)
at 
org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34)
at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)

at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)

at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)


I'm not finding any reason why this should be happening.


Re: phrase, inidividual term, prefix, fuzzy and stemming search

2011-02-04 Thread Jay Hill
You mentioned that dismax does not support wildcards, but edismax does. Not
sure if dismax would have solved your other problems, or whether you just
had to shift gears because of the wildcard issue, but you might want to have
a look at edismax.

-Jay
http://www.lucidimagination.com


On Mon, Jan 31, 2011 at 2:22 PM, cyang2010  wrote:

>
> My current project has the requirement to support search when user inputs
> any
> number of terms across a few index fields (movie title, actor, director).
>
> In order to maximize result, I plan to support all those searches listed in
> the subject, phrase, individual term, prefix, fuzzy and stemming.  Of
> course, score relevance in the right order is also important.
>
> I have considered using dismax query.  However, it does not support prefix
> query.  I am not sure if it supports fuzzy query, my guess is does not.
>
> Therefore, i still need to use standard query.   For example, if someone
> searches "deim moer" (typo for demi moore), i compare the phrase and terms
> with each searchable fields (title, actor, director):
>
>
> title_display: "deim moer"~30 actors: "deim moer"~30 directors: "deim
> moer"~30<--  OR
>
> title_display: deim<-- OR
> actors: deim
> directors: deim
>
> title_display: deim*   <-- OR
> actors: deim*
> directors: deim*
>
> title_display: deim~0.6   <-- OR
> actors: deim~0.6
> directors: deim~0.6
>
> title_display: moer<-- OR
> actors: moer
> directors: moer
>
> title_display: moer*   <-- OR
> actors: moer*
> directors: moer*
>
> title_display: moer~0.6<-- OR
> actors: moer~0.6
> directors: moer~0.6
>
> The solr relevance score is sum for all those OR.  In that way, i can make
> sure relevance score are in order.  For example, for the exact match ("deim
> moer"), it will match phrase, term, prefix and fuzzy query all at the same
> time.   Therefore, it will score higher than some input text only matchs
> term, or prefix or fuzzy. At the same time, i can apply boost to a
> particular search field if requirement needs.
>
>
> Does it sound right to you?  Is there better ways to achieve the same
> thing?
> My concern is my query is not going to perform, since it tries to do too
> much.  But isn't that what people want to get (maximize result) when they
> just type in a few search words?
>
> Another question is that:  Can i combine the result of two query together?
> For example, first i query phrase and term match, next I query for prefix
> match.  Can I just append the result for prefix match to that for
> phrase/term match?   I thought two queries have different queryNorm,
> therefore, the score is not comparable to each other so as to combine.  Is
> it correct?
>
>
> Thanks.  love to hear what your thought is.
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/phrase-inidividual-term-prefix-fuzzy-and-stemming-search-tp239p239.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


WordDelimiterFilterFactory

2011-02-04 Thread John kim
If i use WordDelimiterFilterFactory during indexing and at query time,
will a search for "cls500" find "cls 500" and "cls500x"?  If so, will
it find and score exact matches higher?  If not, how do you get exact
matches to display first?


Re: WordDelimiterFilterFactory

2011-02-04 Thread Jay Hill
You can always try something like this out in the analysis.jsp page,
accessible from the Solr Admin home. Check out that page and see how it
allows you to enter text to represent what was indexed, and text for a
query. You can then see if there are matches. Very handy to see how the
various filters in a field type act on text. Make sure to check "verbose
output" for both index and query.

For this specific issue, yes, a query for "cls500" will match both of those
examples.

To get the exact match to score higher:
- create a text field (or a custom type that uses the
WordDelimiterFilterFactory) (let's name the field "foo")
- create a string field  (let's name it "foo_string")
- create a "copyField" with the source being "foo" and the dest being
"foo_string".
- use dismax (or edismax) to search both of those fields

http://localhost:8983/solr/select/?q=cls500&defType=edismax&qf=foofoo_string

This should score the string field higher, but you could also add a boost to
it to make sure:

http://localhost:8983/solr/select/?q=cls500&defType=edismax&qf=foofoo_string^4.0

-Jay
http://lucidimagination.com


On Fri, Feb 4, 2011 at 4:25 PM, John kim  wrote:

> If i use WordDelimiterFilterFactory during indexing and at query time,
> will a search for "cls500" find "cls 500" and "cls500x"?  If so, will
> it find and score exact matches higher?  If not, how do you get exact
> matches to display first?
>


Re: What is the best protocol for data transfer rate HTTP or RMI?

2011-02-04 Thread Otis Gospodnetic
Hi Gustavo,

I think none of the answers I could give you would be valuable to you now, 
because they would be from circa 2007 or 2008.  We didn't use Solr, just Lucene.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Gustavo Maia 
> To: solr-user@lucene.apache.org
> Sent: Fri, February 4, 2011 10:15:09 AM
> Subject: Re: What is the best protocol for data transfer rate HTTP or RMI?
> 
> Hi Otis,
> 
> 
> Hello,
> 
>   You have many documents, 2 billion.  Could you explain to me how this set
> yours?
> 
> The mine is defined as  follows, but using lucene.
> I have 3 machines and each machine with 6 each  hds. Each hd this index with
> afragment of 10GB. Soon I have 3 servers search.  Each server uses the lucene
>  classParallelSerach using 6 hds and publish that  server using the class
> RemoteSearch.
>   My client connects these three  machines using RMI. Everything is in using
> lucene.Using the classes that  provide it.
> 
>Please explain how you did the distribution of the  index. How many hds
> you use formachine? What is the maximum size of index you  use for HD? Are
> you using theSORL or lucene? How many instance you have the  SOLR server on
> each machine?
> 
> Sorry for so many  questions.
> 
> Gustavo Maia
> 
> 
> 
> 2011/2/4 Otis Gospodnetic 
> 
> >  Gustavo,
> >
> > I haven't used RMI in 5 years, but last time I used it  I remember it being
> > problematic - this is in the context of Lucene-based  search involving some
> > 40
> > different shards/servers, high query  rates, and some 2 billion documents,
> > if I
> > remember  correctly.  I remember us wanting to get away from RMI to
> >  something
> > simpler, less problematic, more HTTP-like.
> >
> >  Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene  ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > - Original  Message 
> > > From: Gustavo Maia 
> > > To: solr-user@lucene.apache.org
> >  > Sent: Thu, February 3, 2011 1:05:16 PM
> > > Subject: What is the  best protocol for data transfer rate HTTP or RMI?
> > >
> > >  Hello,
> > >
> > >
> > >
> > > I am doing a  comparative study between Lucene and Solr and  wish to
> >  obtain
> > > more concrete data on the data transfer using the  lucene  RemoteSearch
> > that
> > > uses RMI and data transfer  of SOLR that uses the HTTP  protocol.
> > >
> > >
> >  >
> > >
> > > Gustavo Maia
> > >
> >
> 


Re: Highlighting with/without Term Vectors

2011-02-04 Thread Otis Gospodnetic
Hi Salman,

Ah, so in the end you *did* have TV enabled on one of your fields! :) (I think 
this was a problem we were trying to solve a few weeks ago here)

How many docs you have in the index doesn't matter here - only N docs/fields 
that you need to display on a page with N results need to be reanalyzed for 
highlighting purposes, so follow Grant's advice, make a small index without TV, 
and compare highlighting speed with and without TV.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Salman Akram 
> To: solr-user@lucene.apache.org
> Sent: Fri, February 4, 2011 8:03:06 AM
> Subject: Re: Highlighting with/without Term Vectors
> 
> Basically Term Vectors are only on one main field i.e. Contents. Average
> size  of each document would be few KB's but there are around 130 million
> documents  so what do you suggest now?
> 
> On Fri, Feb 4, 2011 at 5:24 PM, Otis  Gospodnetic  >  wrote:
> 
> > Salman,
> >
> > It also depends on the size of your  documents.  Re-analyzing 20 fields of
> > 500
> > bytes each will  be a lot faster than re-analyzing 20 fields with 50 KB
> >  each.
> >
> > Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > - Original  Message 
> > > From: Grant Ingersoll 
> > > To: solr-user@lucene.apache.org
> >  > Sent: Wed, January 26, 2011 10:44:09 AM
> > > Subject: Re:  Highlighting with/without Term Vectors
> > >
> > >
> > > On  Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
> > >
> > > >  Hi,
> > > >
> > > > Does anyone have any benchmarks how much  highlighting speeds up with
> >  Term
> > > > Vectors  (compared to without it)? e.g. if highlighting on 20  documents
> >  take
> > > > 1 sec with Term Vectors any idea how long it will  take  without them?
> > > >
> > > > I need to know  since the index used for  highlighting has a TVF file of
> > > >  around 450GB (approx 65% of total index  size) so I am trying to  see
> > whether
> > > > the decreasing the index size by   dropping TVF would be more helpful
> > for
> > > > performance  (less RAM, should be  good for I/O too I guess) or keeping
> > it  is
> > > > still better?
> > > >
> > > > I know  the best way is try it out but indexing takes a very long time
> >   so
> > > > trying to see whether its even worthy or not.
> >  >
> > >
> > > Try testing  on a smaller set.  In  general, you are saving the process of
> > >re-analyzing  the  content, so, to some extent it is going to be dependent
> > on how
> >  >fast your  analyzer chain is.  At the size you are at, I don't  know if
> > storing
> > >TVs is  worth  it.
> >
> 
> 
> 
> -- 
> Regards,
> 
> Salman Akram
> 


Re: Performance optimization of Proximity/Wildcard searches

2011-02-04 Thread Otis Gospodnetic
Salman,

Warming up may be useful if your caches are getting decent hit ratios. Plus, 
you 
are warming up the OS cache when you warm up.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Salman Akram 
> To: solr-user@lucene.apache.org
> Sent: Fri, February 4, 2011 3:33:41 PM
> Subject: Re: Performance optimization of Proximity/Wildcard searches
> 
> I know so we are not really using it for regular warm-ups (in any case  index
> is updated on hourly basis). Just tried few times to compare results.  The
> issue is I am not even sure if warming up is useful for such  regular
> updates.
> 
> 
> 
> On Fri, Feb 4, 2011 at 5:16 PM, Otis  Gospodnetic  >  wrote:
> 
> > Salman,
> >
> > I only skimmed your email, but wanted  to say that this part sounds a little
> > suspicious:
> >
> > >  Our warm up script currently  executes all distinct queries in our  logs
> > > having count > 5. It was run  yesterday (with all the  indexing update
> > every
> >
> > It sounds like this will make  warmup take a long time, assuming you
> > have
> > more than a  handful distinct queries in your logs.
> >
> > Otis
> > 
> >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem  search :: http://search-lucene.com/
> >
> >
> >
> > - Original  Message 
> > > From: Salman Akram 
> >  > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk
> > >  Sent: Tue, January 25, 2011 6:32:48 AM
> > > Subject: Re: Performance  optimization of Proximity/Wildcard searches
> > >
> > > By warmed  index you only mean warming the SOLR cache or OS cache? As I
> >   said
> > > our index is updated every hour so I am not sure how much SOLR  cache
> >  would
> > > be helpful but OS cache should still be  helpful, right?
> > >
> > > I  haven't compared the results  with a proper script but from manual
> >  testing
> > > here are  some of the observations.
> > >
> > > 'Recent' queries which  are  in cache of course return immediately (only
> > if
> > >  they are exactly same - even  if they took 3-4 mins first time). I  will
> > need
> > > to test how many recent  queries stay in  cache but still this would work
> > only
> > > for very common  queries.  User can run different queries and I want at
> >  least
> > > them to be at 'acceptable'  level (5-10 secs) even if  not very fast.
> > >
> > > Our warm up script currently   executes all distinct queries in our logs
> > > having count > 5. It  was run  yesterday (with all the indexing update
> > every
> > >  hour after that) and today when  I executed some of the same  queries
> > again
> > > their time seemed a little less  (around  15-20%), I am not sure if this
> > means
> > > anything. However,  still their  time is not acceptable.
> > >
> > > What do you  think is the best way to compare  results? First run all the
> >  warm
> > > up queries and then execute same randomly and   compare?
> > >
> > > We are using Windows server, would it make a  big difference if  we move
> > to
> > > Linux? Our load is not  high but some queries are really  complex.
> > >
> > > Also I  was hoping to move to SSD in last after trying out all  software
> >  > options. Is that an agreed fact that on large indexes (which don't   fit
> > in
> > > RAM) proximity/wildcard/phrase queries (on common  words) would be slow
> >  and
> > > it can be only improved by  cache warm up and better hardware? Otherwise
> >  with
> > > an  index of around 150GB such queries will take more than a  min?
> >  >
> > > If that's the case I know this question is very subjective but  if a
> >  single
> > > query takes 2 min on SAS 10K RPM what  would its approx time be on a  good
> > SSD
> > > (everything  else same)?
> > >
> > > Thanks!
> > >
> > >
> >  > On Tue, Jan 25,  2011 at 3:44 PM, Toke Eskildsen
> > wrote:
> >  >
> > > >  On Tue, 2011-01-25 at 10:20 +0100, Salman Akram  wrote:
> > > > > Cache  warming is a good option too but the  index get updated every
> > hour
> > > >  so
> > >  > > not sure how much would that help.
> > > >
> > > >  What is the  time difference between queries with a warmed index and  a
> > > > cold one? If  the warmed index performs satisfactory,  then one answer
> > is
> > > > to upgrade  your underlying  storage. As always for IO-caused
> > performance
> > > > problem  in  Lucene/Solr-land, SSD is the answer.
> > > >
> > >  >
> > >
> > >
> > > --
> > > Regards,
> >  >
> > > Salman Akram
> > >
> >
> 
> 
> 
> -- 
> Regards,
> 
> Salman Akram
> 


Re: Performance optimization of Proximity/Wildcard searches

2011-02-04 Thread Otis Gospodnetic
Heh, I'm not sure if this is valid thinking. :)

By *matching* doc distribution I meant: what proportion of your millions of 
documents actually ever get matched and then how many of those make it to the 
UI.
If you have 1000 queries in a day and they all end up matching only 3 of your 
docs, the system will need less RAM than a system where 1000 queries match 
5 
different docs.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Salman Akram 
> To: solr-user@lucene.apache.org
> Sent: Fri, February 4, 2011 3:38:55 PM
> Subject: Re: Performance optimization of Proximity/Wildcard searches
> 
> Well I assume many people out there would have indexes larger than 100GB  and
> I don't think so normally you will have more RAM than 32GB or  64!
> 
> As I mentioned the queries are mostly phrase, proximity, wildcard  and
> combination of these.
> 
> What exactly do you mean by distribution of  documents? On this index our
> documents are not more than few hundred KB's on  average (file system size)
> and there are around 14 million documents. 80% of  the index size is taken up
> by position file. I am not sure if this is what  you asked?
> 
> On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic  >  wrote:
> 
> > Hi,
> >
> >
> > > Sharding is an  option  too but that too comes with limitations so want to
> > > keep that as a  last  resort but I think there must be other things coz
> >  150GB
> > > is not too big for  one drive/server with 32GB  Ram.
> >
> > Hmm what makes you think 32 GB is enough for your 150  GB index?
> > It depends on queries and distribution of matching documents,  for example.
> > What's yours like?
> >
> > Otis
> >  
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem  search :: http://search-lucene.com/
> >
> >
> >
> > - Original  Message 
> > > From: Salman Akram 
> >  > To: solr-user@lucene.apache.org
> >  > Sent: Tue, January 25, 2011 4:20:34 AM
> > > Subject: Performance  optimization of Proximity/Wildcard searches
> > >
> > >  Hi,
> > >
> > > I am facing performance issues in three types of  queries (and  their
> > > combination). Some of the queries take  more than 2-3 mins. Index size  is
> > > around 150GB.
> >  >
> > >
> > >- Wildcard
> > > -  Proximity
> > >- Phrases (with common  words)
> > >
> > > I know CommonGrams and  Stop words are a  good way to resolve such issues
> > but
> > > they don't fulfill  our  functional requirements (Common Grams seem to
> > have
> >  > issues with phrase  proximity, stop words have issues with exact  match
> > etc).
> > >
> > > Sharding is an  option too  but that too comes with limitations so want to
> > > keep that as a  last  resort but I think there must be other things coz
> >  150GB
> > > is not too big for  one drive/server with 32GB  Ram.
> > >
> > > Cache warming is a good option too but  the  index get updated every hour
> > so
> > > not sure how much would  that  help.
> > >
> > > What are the other main tips that can  help in performance  optimization
> > of
> > > the above  queries?
> > >
> > > Thanks
> > >
> > > --
> >  > Regards,
> > >
> > > Salman Akram
> >  >
> >
> 
> 
> 
> -- 
> Regards,
> 
> Salman Akram
> 


Re: geodist and spacial search

2011-02-04 Thread Bill Bell
Why not just:

q=*:*
fq={!bbox}
sfield=store
pt=49.45031,11.077721
d=40
fl=store
sort=geodist() asc


http://localhost:8983/solr/select?q=*:*&sfield=store&pt=49.45031,11.077721&;
d=40&fq={!bbox}&sort=geodist%28%29%20asc

That will sort, and filter up to 40km.

No need for the 

fq={!func}geodist()
sfield=store
pt=49.45031,11.077721


Bill




On 2/4/11 4:30 AM, "Eric Grobler"  wrote:

>Hi Grant,
>
>Thanks for the tip
>This seems to work:
>
>q=*:*
>fq={!func}geodist()
>sfield=store
>pt=49.45031,11.077721
>
>fq={!bbox}
>sfield=store
>pt=49.45031,11.077721
>d=40
>
>fl=store
>sort=geodist() asc
>
>
>On Thu, Feb 3, 2011 at 7:46 PM, Grant Ingersoll 
>wrote:
>
>> Use a filter query?  See the {!geofilt} stuff on the wiki page.  That
>>gives
>> you your filter to restrict down your result set, then you can sort by
>>exact
>> distance to get your sort of just those docs that make it through the
>> filter.
>>
>>
>> On Feb 3, 2011, at 10:24 AM, Eric Grobler wrote:
>>
>> > Hi Erick,
>> >
>> > Thanks I saw that example, but I am trying to sort by distance AND
>> specify
>> > the max distance in 1 query.
>> >
>> > The reason is:
>> > running bbox on 2 million documents with a 20km distance takes only
>> 200ms.
>> > Sorting 2 million documents by distance takes over 1.5 seconds!
>> >
>> > So it will be much faster for solr to first filter the 20km documents
>>and
>> > then to sort them.
>> >
>> > Regards
>> > Ericz
>> >
>> > On Thu, Feb 3, 2011 at 1:27 PM, Erick Erickson
>>> >wrote:
>> >
>> >> Further down that very page ...
>> >>
>> >> Here's an example of sorting by distance ascending:
>> >>
>> >>  -
>> >>
>> >>  ...&q=*:*&sfield=store&pt=45.15,-93.85&sort=geodist()
>> >> asc<
>> >>
>> 
>>http://localhost:8983/solr/select?wt=json&indent=true&fl=name,store&q=*:*
>>&sfield=store&pt=45.15,-93.85&sort=geodist()%20asc
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >> The key is just the &sort=geodist(), I'm pretty sure that's
>>independent
>> of
>> >> the bbox, but
>> >> I could be wrong.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Wed, Feb 2, 2011 at 11:18 AM, Eric Grobler <
>> impalah...@googlemail.com
>> >>> wrote:
>> >>
>> >>> Hi
>> >>>
>> >>> In http://wiki.apache.org/solr/SpatialSearch
>> >>> there is an example of a bbox filter and a geodist function.
>> >>>
>> >>> Is it possible to do a bbox filter and sort by distance - combine
>>the
>> >> two?
>> >>>
>> >>> Thanks
>> >>> Ericz
>> >>>
>> >>
>>
>> --
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem docs using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>




UIMA Error

2011-02-04 Thread Darx Oman
hi guys
i'm trying to use UIMA contrib, but i got the following error

...
INFO: [] webapp=/solr path=/select
params={clean=false&commit=true&command=status&qt=/dataimport} status=0
QTime=0
05/02/2011 10:54:53 ص
org.apache.solr.uima.processor.UIMAUpdateRequestProcessor processText
INFO: Analazying text
05/02/2011 10:54:53 ص
org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
INFO: setting cat_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
05/02/2011 10:54:53 ص
org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
INFO: setting keyword_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
05/02/2011 10:54:53 ص
org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
INFO: setting concept_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
05/02/2011 10:54:53 ص
org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
INFO: setting entities_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
05/02/2011 10:54:53 ص
org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
INFO: setting lang_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
05/02/2011 10:54:53 ص
org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
INFO: setting oc_licenseID : g6h9zamsdtwhb93nc247ecrs
05/02/2011 10:54:53 ص WhitespaceTokenizer initialize
INFO: "Whitespace tokenizer successfully initialized"
05/02/2011 10:54:56 ص org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={clean=false&commit=true&command=status&qt=/dataimport} status=0
QTime=0
05/02/2011 10:54:57 ص WhitespaceTokenizer typeSystemInit
INFO: "Whitespace tokenizer typesystem initialized"
05/02/2011 10:54:57 ص WhitespaceTokenizer process
INFO: "Whitespace tokenizer starts processing"
05/02/2011 10:54:57 ص WhitespaceTokenizer process
INFO: "Whitespace tokenizer finished processing"
05/02/2011 10:54:57 ص
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl
callAnalysisComponentProcess(405)
SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException
 at
org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:206)
 at
org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:56)
 at
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
 at
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
 at
org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
 at
org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:409)
 at
org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
 at
org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
 at
org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
 at
org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280)
 at
org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processText(UIMAUpdateRequestProcessor.java:122)
 at
org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:69)
 at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:75)
 at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:291)
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:626)
 at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:266)
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185)
 at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335)
 at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
 at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
Caused by: java.net.UnknownHostException: api.opencalais.com
 at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:177)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
 at java.net.Socket.connect(Socket.java:529)
 at java.net.Socket.connect(Socket.java:478)
 at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
 at sun.net.www.http.HttpClient.(HttpClient.java:233)
 at sun.net.www.http.HttpClient.New(HttpClient.java:306)
 at sun.net.www.http.HttpClient.New(HttpClient.java:323)
 at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:975)
 at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:916)
 at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:841)
 at
sun.net.www.protocol.http.HttpURLConnection.getOutputStr