Re: Distributed search component.

2011-05-13 Thread Rok Rejc
I am still fighting (after a month of doing other things) with the first
part of the problem. Any ideas?

Many thanks,
Rok

On Mon, Apr 4, 2011 at 9:06 AM, Rok Rejc  wrote:

> Hi all,
>
> I am trying to create a distributed search component in solr which is quite
> difficult (at least for me, because I am new in solr and java). Anyway I
> have looked into solr source (FacetComponent, TermsComponent...) and created
> my own search component (it extends SearchComponent) but I still have two
> questions (for now):
>
> 1.) In the prepare method I have the following code:
>
> String shards = params.get(ShardParams.SHARDS);
> if (shards != null) {
> List lst = StrUtils.splitSmart(shards, ",", true);
> rb.shards = lst.toArray(new String[lst.size()]);
> rb.isDistrib = true;
> }
>
> If I remove "rb.isDistrib = true;" line the distributed methods are not
> called. But to set the isDistrib my code must be in the
> "org.apache.solr.handler.component" package (because it is not visible from
> the outside). Is this  correct procedure/behaviour/design?
>
> 2.) Functions (process, distributedProcess, handleResponses...) are all
> called properly. I can read partial responses in the handleResponses but I
> don't know how to build "final" response. I see that for example
> TermsComponent has a helper in the ResponseBuilder which collects all the
> terms. Is this the only way (to edit the ResponseBuilder source), or can I
> achive that without editing the solr's source?
>
> Many thanks,
>
> Rok
>


Re: Facet Count Based on Dates

2011-05-13 Thread Jasneet Sabharwal

Hey Otis,

FieldCollapsing is again a feature of Solr 4.0, anything possible using 
the default feature of Solr 3.1. Btw, how can I apply these patches on 
Solr 3.1 ?

On 13-05-2011 12:10, Otis Gospodnetic wrote:

Jasneet,

Like in http://wiki.apache.org/solr/FieldCollapsing ?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 

From: Jasneet Sabharwal
To: solr-user@lucene.apache.org
Sent: Thu, May 12, 2011 10:14:46 AM
Subject: Re: Facet Count Based on Dates

Or is it possible to use a Group By query in Solr 3.1 like we do in SQL ?
On  12-05-2011 19:37, Jasneet Sabharwal wrote:

Is it possible to use the  features of 3.1 by default for my query ?
On 12-05-2011 13:38, Grijesh  wrote:

You can apply patch for Hierarchical faceting on Solr  3.1

-
Thanx:
  Grijesh
www.gettinhahead.co.in
-- View this message in  context:

http://lucene.472066.n3.nabble.com/Facet-Count-Based-on-Dates-tp2922371p2930924.html


  Sent from the Solr - User mailing list archive at Nabble.com.





-- Regards

Jasneet Sabharwal
Software  Developer
NextGen Invent Corporation
+91-9871228582





--
Regards

Jasneet Sabharwal
Software Developer
NextGen Invent Corporation
+91-9871228582



Results with and without whitspace(soccer club and soccerclub)

2011-05-13 Thread roySolr
Hello,

My index looks like this:

Soccer club
Football club
etc.

Now i want that a user can search for "soccer club" and "soccerclub".
"Soccer club" works but
without the whitespace it's not a match. How can i fix this? How does my
configuration looks like? Is there a filter or something?

 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Results-with-and-without-whitspace-soccer-club-and-soccerclub-tp2934742p2934742.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH entity threads (multithreading)

2011-05-13 Thread Jamroz Marcin
I am using Solr 3.1 but tried it with 4.0 beta too


Does it depend on the batchSize argument? I also have table relationships, 
tried without them same effect

Is there an full featured example of how to use this threads parameter ?

**
Diese E-Mail wurde auf Viren ueberprueft.
mailswee...@it2media.de
**


Re: Results with and without whitspace(soccer club and soccerclub)

2011-05-13 Thread Paul Libbrecht
Roy,

I believe the way to do that is to use a compound-words-analyzer.
The issue: you need to know the decompositions in advance.
Compound words are pretty common in German, for example, and I'd wish research 
efforts to maintain compound-words-corpus but I have not seen it yet.

paul


Le 13 mai 2011 à 10:28, roySolr a écrit :

> Hello,
> 
> My index looks like this:
> 
> Soccer club
> Football club
> etc.
> 
> Now i want that a user can search for "soccer club" and "soccerclub".
> "Soccer club" works but
> without the whitespace it's not a match. How can i fix this? How does my
> configuration looks like? Is there a filter or something?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Results-with-and-without-whitspace-soccer-club-and-soccerclub-tp2934742p2934742.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Field collapsing classloading issues

2011-05-13 Thread karan singh

I applied the SOLR field collapsing patch to solr 3.1.0.

I'm not really sure about what to add in solrconfig.xml. Right now I've added 
the following : 
   true 
   
queryComponent 
  On running this, I 
get the following error :  SEVERE: 
org.apache.solr.common.SolrException: Error loading class 
'solr.CollapseComponent'Any ideas?  
   

Re: DIH help request: nested xml entities and xpath

2011-05-13 Thread Gora Mohanty
On Fri, May 13, 2011 at 10:18 AM, Ashique  wrote:
> Hi All,
>
> I am a Java/J2ee programmer and very new to SOLR. I would  like to index a
> table in a postgresSql database to SOLR. Then searching the records from a
> GUI (Jsp Page) and showing the results in tabular form. Could any one help
> me out with a simple sample code.
[...]

This is too broad a question. Please start out by looking
at the extensive Solr documentation:
* Complete list: http://wiki.apache.org/solr/FrontPage
* Initial tutorial: http://lucene.apache.org/solr/tutorial.html
  It is a good idea to first ensure that you are able to get
  this working.
* If you are using Java, this should be of interest:
  http://wiki.apache.org/solr/SolJava
* For easy data import from a database, you could consider
  using the DataImportHandler:
  http://wiki.apache.org/solr/DataImportHandler

You can ask here if you run into issues while trying these out.

Regards,
Gora


RE: Document match with no highlight

2011-05-13 Thread Pierre GOSSE
In WordDelimiterFilter the parameters catenateNumbers, catenateWords, 
catenateAlls are set to 1. This parameters adds overlapping tokens which could 
explain that you meet the bug described in the jira issue I mentioned. 

As I understand WordDelimiterFilter :
"0176 R3 1.5 TO" should we tokenized with tokens "R3" overlapping with "R" and 
"3", and "15" overlapping with "1" and "5" 

This parmeters are set to 0 for query, but having them set to 1 should not 
correct your problem unless you search for "R3 1.5".

I think you have to either
 - set this parameters to 0 in index, but your query won't match anymore
 - wait for correction to be released in a new solr version, 
 - use solr trunk, 
 - or backport the modifications in the lucene-highlighter version you use. 

I did a backport for solr 1.4.1 since I won't move to 3.0 until some time, so 
please ask if you have question about how to do this.

Pierre


-Message d'origine-
De : Phong Dais [mailto:phong.gd...@gmail.com] 
Envoyé : jeudi 12 mai 2011 20:06
À : solr-user@lucene.apache.org
Objet : Re: Document match with no highlight

Hi,

I read the link provided and I'll need some time to digest what it is
saying.

Here's my "text" fieldtype.


  





  
  






  

Also, I figured out what value in DOC_TEXT cause this issue to occur.
With a DOC_TEXT of (without the quotes):
"0176 R3 1.5 TO "

Searching for "3 1 15" returns a match with "empty" highlight.
Searching for "3 1 15"~1 returns a match with highlight.

Can anyone see anything that I'm missing?

Thanks,
P.


On Thu, May 12, 2011 at 12:27 PM, Pierre GOSSE wrote:

> > Since you're using the standard "text" field, this should NOT be you're
> case.
>
> Sorry, for the missing NOT in previous phrase. You should have the same
> issue given what you said, but still, it sound very similar.
>
> Are you sure your fieldtype "text" has nothing special ? a tokenizer or
> filter that could add some token in your indexed text but not in your query,
> like for example a WordDelimiter present in  and not  ?
>
> Pierre
>
> -Message d'origine-
> De : Pierre GOSSE [mailto:pierre.go...@arisem.com]
> Envoyé : jeudi 12 mai 2011 18:21
> À : solr-user@lucene.apache.org
> Objet : RE: Document match with no highlight
>
> > In fact if I did "3 1 15"~1 I do get snipet also.
>
> Strange, I had a very similar problem, but with overlapping tokens. Since
> you're using the standard "text" field, this should be you're case.
>
> Maybe you could have a look at this issue, since it sound very familiar to
> me :
> https://issues.apache.org/jira/browse/LUCENE-3087
>
> Pierre
>
> -Message d'origine-
> De : Phong Dais [mailto:phong.gd...@gmail.com]
> Envoyé : jeudi 12 mai 2011 17:26
> À : solr-user@lucene.apache.org
> Objet : Re: Document match with no highlight
>
> Hi,
>
> 
>
> The type "text" is the default one that came with the default solr 1.4
> install w.o any modifications.
>
> If I remove the quotes I do get snipets.  In fact if I did "3 1 15"~1 I do
> get snipet also.
>
> Hope that helps.
>
> P.
>
> On Thu, May 12, 2011 at 9:09 AM, Ahmet Arslan  wrote:
>
> >  > URL:
> > >
> >
> http://localhost:8983/solr/select?indent=on&version=2.2&q=DOC_TEXT%3A%223+1+15%22&fq=&start=0
> > >
> >
> &rows=10&fl=DOC_TEXT%2Cscore&qt=standard&wt=standard&debugQuery=on&explainOther=&hl=on&hl.fl=DOC_TEXT&hl.maxAnalyzedChars=-1
> > >
> > > XML:
> > > 
> > > 
> > >   
> > > 0
> > > 19
> > > 
> > >   
> > >> > name="indent">on
> > >> > name="hl.fl">DOC_TEXT
> > >> > name="wt">standard
> > >> > name="hl.maxAnalyzedChars">-1
> > >   on
> > >   10
> > >> > name="version">2.2
> > >> > name="debugQuery">on
> > >> > name="fl">DOC_TEXT,score
> > >   0
> > >   DOC_TEXT:"3 1
> > > 15"
> > >> > name="qt">standard
> > >   
> > > 
> > >   
> > >   

How does Solr's MoreLikeThis component internally work to get results?

2011-05-13 Thread Gnanakumar
Hi,

I'm new to Apache Solr and am currently exploring/trying to make use of
MoreLikeThis as a search component (instead of dedicated request handler).
I'm finding difficult to understand clearly on how this works internally to
get more-like-this results?

For example, I'm trying to search for the word java in one of the document
field named mytextcontentfield:

 
http://localhost/solr/core0/select/?q=mytextcontentfield:java&version=2.2&st
art=0&rows=10&indent=on&debugQuery=on&mlt=true&mlt.fl=mytextcontentfield

and I could see moreLikeThis in the XML response with unique keys of the
documents in name attribute.

My questions here is, how does Solr internally work/match to find
more-like-this documents based on the search keyword java? Any explanation
with good example are appreciated.

Regards,
Gnanam



Re: Results with and without whitespace(soccer club and soccerclub)

2011-05-13 Thread Grijesh
what about synonym filter factory 

-
Thanx: 
Grijesh 
www.gettinhahead.co.in 
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Results-with-and-without-whitespace-soccer-club-and-soccerclub-tp2934742p2934828.html
Sent from the Solr - User mailing list archive at Nabble.com.


Order of words in proximity search

2011-05-13 Thread Tor Henning Ueland
Hi,

The documentation does not(?) specify this, but still a interesting question.
Does the order of the words in a proximity search matter? And if it
does, is it possible to ignore the order?

I did not belive it did, but some tests against a ngram field does
give different results.

Examples:
"foo bar"~99 - 10 hits
"bar foo"~99 - 11 hits

-- 
Best regards
Tor Henning Ueland


Re: Document match with no highlight

2011-05-13 Thread Phong Dais
Pierre,

Merci beaucoup Pierre. :)

You saved me a lot of time and headache.

>As I understand WordDelimiterFilter :
> >"0176 R3 1.5 TO" should we tokenized with tokens "R3" overlapping with "R"
> and "3", and "15" overlapping with "1" and "5"
>

>This parmeters are set to 0 for query, but having them set to 1 should not
> correct your problem unless you search for "R3 1.5".
>

You are correct.


>
> >I think you have to either
> > - set this parameters to 0 in index, but your query won't match anymore
> > - wait for correction to be released in a new solr version,
> > - use solr trunk,
> > - or backport the modifications in the lucene-highlighter version you
> use.
>

For what I need, using 0 for the index should do the trick.  I did not want
my query to match.


> >I did a backport for solr 1.4.1 since I won't move to 3.0 until some time,
> so please ask if you have question about how to do this.
>

I don't anticipate the need for a backport but is there any wiki out there
that outline this process?

Regards,
Phong


>
> -Message d'origine-
> De : Phong Dais [mailto:phong.gd...@gmail.com]
> Envoyé : jeudi 12 mai 2011 20:06
> À : solr-user@lucene.apache.org
> Objet : Re: Document match with no highlight
>
> Hi,
>
> I read the link provided and I'll need some time to digest what it is
> saying.
>
> Here's my "text" fieldtype.
>
> 
>  
>
> words="stopwords.txt" enablePositionIncrements="true"/>
> generateNumberParts="1"
>  catenateWords="1" catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1"/>
>
> protected="protwords.txt"/>
>  
>  
>
> ignoreCase="true" expand="true"/>
> words="stopwords.txt" enablePositionIncrements="true"/>
> generateNumberParts="1"
>  catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="1"/>
>
> protected="protwords.txt"/>
>  
> 
> Also, I figured out what value in DOC_TEXT cause this issue to occur.
> With a DOC_TEXT of (without the quotes):
> "0176 R3 1.5 TO "
>
> Searching for "3 1 15" returns a match with "empty" highlight.
> Searching for "3 1 15"~1 returns a match with highlight.
>
> Can anyone see anything that I'm missing?
>
> Thanks,
> P.
>
>
> On Thu, May 12, 2011 at 12:27 PM, Pierre GOSSE  >wrote:
>
> > > Since you're using the standard "text" field, this should NOT be you're
> > case.
> >
> > Sorry, for the missing NOT in previous phrase. You should have the same
> > issue given what you said, but still, it sound very similar.
> >
> > Are you sure your fieldtype "text" has nothing special ? a tokenizer or
> > filter that could add some token in your indexed text but not in your
> query,
> > like for example a WordDelimiter present in  and not  ?
> >
> > Pierre
> >
> > -Message d'origine-
> > De : Pierre GOSSE [mailto:pierre.go...@arisem.com]
> > Envoyé : jeudi 12 mai 2011 18:21
> > À : solr-user@lucene.apache.org
> > Objet : RE: Document match with no highlight
> >
> > > In fact if I did "3 1 15"~1 I do get snipet also.
> >
> > Strange, I had a very similar problem, but with overlapping tokens. Since
> > you're using the standard "text" field, this should be you're case.
> >
> > Maybe you could have a look at this issue, since it sound very familiar
> to
> > me :
> > https://issues.apache.org/jira/browse/LUCENE-3087
> >
> > Pierre
> >
> > -Message d'origine-
> > De : Phong Dais [mailto:phong.gd...@gmail.com]
> > Envoyé : jeudi 12 mai 2011 17:26
> > À : solr-user@lucene.apache.org
> > Objet : Re: Document match with no highlight
> >
> > Hi,
> >
> > 
> >
> > The type "text" is the default one that came with the default solr 1.4
> > install w.o any modifications.
> >
> > If I remove the quotes I do get snipets.  In fact if I did "3 1 15"~1 I
> do
> > get snipet also.
> >
> > Hope that helps.
> >
> > P.
> >
> > On Thu, May 12, 2011 at 9:09 AM, Ahmet Arslan  wrote:
> >
> > >  > URL:
> > > >
> > >
> >
> http://localhost:8983/solr/select?indent=on&version=2.2&q=DOC_TEXT%3A%223+1+15%22&fq=&start=0
> > > >
> > >
> >
> &rows=10&fl=DOC_TEXT%2Cscore&qt=standard&wt=standard&debugQuery=on&explainOther=&hl=on&hl.fl=DOC_TEXT&hl.maxAnalyzedChars=-1
> > > >
> > > > XML:
> > > > 
> > > > 
> > > >   
> > > > 0
> > > > 19
> > > > 
> > > >   
> > > >> > > name="indent">on
> > > >> > > name="hl.fl">DOC_TEXT
> > > >> > > name="wt">standard
> > > >> > > name="hl.maxAnalyzedChars">-1
> > > >   on
> > > >   10
> > > >> > > name="version">2.2
> > > >> > > name="debugQuery">on
> > > >> > > name="fl">DOC_TEXT,score
> > > >   0
> > > >   DOC_TEXT:"3 1
> > > > 15"
> > > >> > > name="qt">standard
> > > >   
> > > > 
> > > >   
> > > >   

How does Solr's MoreLikeThis component internally work to get results?

2011-05-13 Thread Gnanakumar
Hi,

I'm new to Apache Solr and am currently exploring/trying to make use of
MoreLikeThis as a search component (instead of dedicated request handler).
I'm finding difficult to understand clearly on how this works internally to
get more-like-this results?

For example, I'm trying to search for the word java in one of the document
field named mytextcontentfield:

 
http://localhost/solr/core0/select/?q=mytextcontentfield:java&version=2.2&st
art=0&rows=10&indent=on&debugQuery=on&mlt=true&mlt.fl=mytextcontentfield

and I could see moreLikeThis in the XML response with unique keys of the
documents in name attribute.

My questions here is, how does Solr internally work/match to find
more-like-this documents based on the search keyword java? Any explanation
with good example are appreciated.

Regards,
Gnanam



Re: Huge performance drop in distributed search w/ shards on the same server/container

2011-05-13 Thread Grant Ingersoll
Is that 10 different Tomcat instances or are you using multicore?  How are you 
testing?

On May 13, 2011, at 6:08 AM, Frederik Kraus wrote:

> Hi, 
> 
> I'm having some serious problems scaling the following setup:
> 
> 48 CPU / Tomcat / ...
> 
> localhost/shard1
> ...
> localhost/shard10
> 
> When using all 10 shards in the query the req/s drop down to about 300 
> without fully utilizing cpu (60% idle) or ram (disk i/o is zero - everything 
> fits into the ram)
> 
> When only quering one shard I get about 5k-6k req/s 
> 
> Are there any known limits and/or work-arounds?
> 
> Thanks,
> 
> Fred.


Grant Ingersoll
Join the LUCENE REVOLUTION
Lucene & Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org



Re: Results with and without whitespace(soccer club and soccerclub)

2011-05-13 Thread roySolr
mm,, it's about 10.000 terms. It's possible but not the best solution i
think.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Results-with-and-without-whitespace-soccer-club-and-soccerclub-tp2934742p2934888.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Results with and without whitespace(soccer club and soccerclub)

2011-05-13 Thread Paul Libbrecht

Hey guys, keep a bit of the thread!

Roy, 

I'm afraid it's not different with CompoundAnalyzer: all in memory.
Have you tried?

I sure wish such a compound-analysis would be done with a lucene-powered 
dictionary!
That would rock.

paul


Le 13 mai 2011 à 11:57, Grijesh a écrit :

> what about synonym filter factory 


Le 13 mai 2011 à 13:03, roySolr a écrit :

> mm,, it's about 10.000 terms. It's possible but not the best solution i
> think.



Re: Results with and without whitspace(soccer club and soccerclub)

2011-05-13 Thread Markus Jelsma
Yes
http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html

> Roy,
> 
> I believe the way to do that is to use a compound-words-analyzer.
> The issue: you need to know the decompositions in advance.
> Compound words are pretty common in German, for example, and I'd wish
> research efforts to maintain compound-words-corpus but I have not seen it
> yet.
> 
> paul
> 
> Le 13 mai 2011 à 10:28, roySolr a écrit :
> > Hello,
> > 
> > My index looks like this:
> > 
> > Soccer club
> > Football club
> > etc.
> > 
> > Now i want that a user can search for "soccer club" and "soccerclub".
> > "Soccer club" works but
> > without the whitespace it's not a match. How can i fix this? How does my
> > configuration looks like? Is there a filter or something?
> > 
> > 
> > 
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Results-with-and-without-whitspace-so
> > ccer-club-and-soccerclub-tp2934742p2934742.html Sent from the Solr - User
> > mailing list archive at Nabble.com.


Re: Huge performance drop in distributed search w/ shards on the same server/container

2011-05-13 Thread Frederik Kraus
One Tomcat with multicore. I have a list of about 2mio "real" queries that I'm 
firing at the cluster with jmeter. Reason for splitting up the index in rather 
small parts is that the maximum response time of 1 sec cannot be exceeded for 
any of those queries.



On Freitag, 13. Mai 2011 at 12:57, Grant Ingersoll wrote: 
> Is that 10 different Tomcat instances or are you using multicore? How are you 
> testing?
> 
> On May 13, 2011, at 6:08 AM, Frederik Kraus wrote:
> 
> > Hi, 
> > 
> > I'm having some serious problems scaling the following setup:
> > 
> > 48 CPU / Tomcat / ...
> > 
> > localhost/shard1
> > ...
> > localhost/shard10
> > 
> > When using all 10 shards in the query the req/s drop down to about 300 
> > without fully utilizing cpu (60% idle) or ram (disk i/o is zero - 
> > everything fits into the ram)
> > 
> > When only quering one shard I get about 5k-6k req/s 
> > 
> > Are there any known limits and/or work-arounds?
> > 
> > Thanks,
> > 
> > Fred.
> 
> 
> Grant Ingersoll
> Join the LUCENE REVOLUTION
> Lucene & Solr User Conference
> May 25-26, San Francisco
> www.lucenerevolution.org
> 


Re: Changing the schema

2011-05-13 Thread Chamnap Chhorn
I wonder what if I add new field in the schema, do i have to reindex?

If no need to reindex, can i just update the schema.xml directly? After
that, Should I restart the tomcat service?

If no need to reindex, how about the existing documents? If I do a query
with new field, does it cause errors?

On Fri, May 13, 2011 at 1:38 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Brian,
>
> Yes, you do need to reindex.  We've used Hadoop with Solr to speed up
> indexing
> by orders of magnitude for some of our customers.  Something to consider.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Brian Lamb 
> > To: solr-user@lucene.apache.org
> > Sent: Thu, May 12, 2011 11:53:27 AM
> > Subject: Changing the schema
> >
> > If I change the field type in my schema, do I need to rebuild the  entire
> > index? I'm at a point now where it takes over a day to do a full  import
> due
> > to the sheer size of my application and I would prefer not having  to
> reindex
> > just because I want to make a change  somewhere.
> >
> > Thanks,
> >
> > Brian Lamb
> >
>



-- 
Chhorn Chamnap
http://chamnapchhorn.blogspot.com/


Re: Changing the schema

2011-05-13 Thread Stefan Matheis
Chamnap,

On Fri, May 13, 2011 at 2:59 PM, Chamnap Chhorn  wrote:
> I wonder what if I add new field in the schema, do i have to reindex?

If you're using that field within the DIH .. then of course yes, but
normally/otherwise: No :)

On Fri, May 13, 2011 at 2:59 PM, Chamnap Chhorn  wrote:
> If no need to reindex, can i just update the schema.xml directly? After
> that, Should I restart the tomcat service?

At least, you have to reload the Core-Config .. don't know exactly how
that works while using Tomcat -- with the packaged Jetty, it's just
hitting /admin/cores (while in Multicore-Mode) or restarting the
Java-Process (in Singlecore-Mode)

On Fri, May 13, 2011 at 2:59 PM, Chamnap Chhorn  wrote:
> If no need to reindex, how about the existing documents? If I do a query
> with new field, does it cause errors?

Afaik, the document are just not affected from searching and therefore
missing in/excluded from the result.

Regards
Stefan


Re: Solr performance

2011-05-13 Thread javaxmlsoapdev
Alright. It turned out that defaultSearchField=title where title field is of
a custom fieldType=edgyText

where 

   
   


so if no value in the "q" parameter is passed, solr picks up default field,
which is tiltle of type "edgyText" taking a very long time to return
results. Is there a way to IGNORE default(which is there in schema.xml)
field dynamically while I only want to search filterlist on keys (e.g.
fl=keys)? gram search is slowing things down extremely. Crazy clients want
to have minimum word =1, which is kind of insane but that's how it is. 

Any idea?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-performance-tp2926836p2935175.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Faceting question

2011-05-13 Thread Mark

No mixup. I probably didn't explain myself correctly.

Suppose my document has fields "title", "description" and "foo". When I 
search I would like to search across "title" and "description". I then 
would like facet counts on "foo" for documents that matched the "title" 
field only. IE, I would like the faceting behavior on "foo" to be 
exactly as if i searched against only the "title" field.


Does that make sense?

On 5/12/11 11:30 PM, Otis Gospodnetic wrote:

Hi,

I think there is a bit of a mixup here.  Facets are not about which field a
match was on, but about what values hits have in one or more fields you facet
on.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 

From: Mark
To: solr-user@lucene.apache.org
Sent: Fri, May 13, 2011 1:19:10 AM
Subject: Faceting question

Is there anyway to perform a search that searches across 2 fields yet only
gives  me facets accounts for documents matching 1 field?

For example

If  I have fields A&  B and I perform a search across I would like to match my
query across either of these two fields. I would then like facet counts for how
many documents matched in field A only.

Can this accomplished? If not out  of the box what classes should I look into
to create this  myself?

Thanks



Re: Support for huge data set?

2011-05-13 Thread Shawn Heisey
Our system, which I am not at liberty to disclose, consists of 55 
million documents, mostly photos and text, but video is starting to 
become prominent.  The entire archive is about 80 terabytes, but we only 
index a subset of the metadata, stored in a MySQL database, which is 
about 100GB or so in size.


The Solr index (version 1.4.1) consists of six large shards, each about 
16GB in size, plus a  seventh shard containing the most recent 7 days, 
which is usually less than 1 GB.  The entire system is replicated to 
slave servers.  Each virtual machine that houses a large shard has 9GB 
of RAM, and there are three large shards on each of the four physical 
hosts.  Each physical host is dual quad-core with 32GB of RAM, with a 
six drive SATA RAID10.  We went with virtualization (Xen) for cost reasons.


Performance is good.  If we could move to physical machines instead of 
virtualization, that would be optimal, but I think I'll have to settle 
for a RAM upgrade instead.


The main reason I stuck with distributed search is because of index 
rebuild time.  I can currently rebuild the entire index in 3-4 hours.  
It would take 5-6 times that if I had a single large index.



On 5/13/2011 12:37 AM, Otis Gospodnetic wrote:

With that many documents, I think GSA cost might be in millions of USD.  Don't
go there.

300 MB docs might be called medium these days.  Of course, if those documents
themselves are huge, then it's more resource intensive.  10 TB sounds like a lot
when it comes to search, but it's hard to tell what that represents (e.g. are
those docs with lots of photos in them?  Presentations very light on text?
Plain text documents with 300 words per page? etc.)

Anyhow, yes, Solr is a fine choice for this.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 

From: atreyu
To: solr-user@lucene.apache.org
Sent: Thu, May 12, 2011 12:59:28 PM
Subject: Support for huge data set?

Hi,

I have about 300 million docs (or 10TB data) which is doubling every  3
years, give or take.  The data mostly consists of Oracle records,  webpage
files (HTML/XML, etc.) and office doc files.  There are b/t two  and four
dozen concurrent users, typically.  The indexing server has>  27 GB of RAM,
but it still gets extremely taxed, and this will only get  worse.

Would Solr be able to efficiently deal with a load of this  size?  I am
trying to avoid the heavy cost of GSA,  etc...

Thanks.


--
View this message in context:
http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html

Sent  from the Solr - User mailing list archive at Nabble.com.





When to use trie over standard

2011-05-13 Thread Mark
When should one use Trie fields over the standard fields? What aret the 
pro's and con's of each?


Thanks


RE: When to use trie over standard

2011-05-13 Thread Jonathan Rochkind
Well, let's be clear about what we're talking about. The suggested numeric and 
date fields in the current Solr example schema are in fact ALL Trie based 
fields. 
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup

I don't think there is any downside to using a Trie-based field.  Trie based 
fields make for quicker range queries (or greater-than/less-than queries, the 
same thing), with no downside. 

The old non-trie based fields are marked in the example schema are all marked 
"should only be used for compatibility with existing indexes." So the 
'standard' fields are in fact all Trie-based fields now. 

Now, here's the thing though, using a Trie-based field, you still have a choice 
about the 'precision'.  The example schema includes trie-based fields with 
0-precision (int, float, long, double, date), and trie-fields with greater than 
0 precision (tint, tfloat, tdouble, tlong, date).  You can also create your own 
field defintions for trie fields with the precision of your choice, you aren't 
limited to what's set for tint etc in the example schema.xml. 

So the real question might be how to decide what precision on trie fields is 
appropriate for your data and use cases. I am not exactly sure of the answer to 
that, but it _probably_ won't matter that much for many data/use-cases. So one 
answer is, don't worry too much about it unless you are seeing performance 
problems on range queries. I know I've seen something written explaining what a 
trie field is and what the precision means in the context of Solr, but I can't 
seem to find it now. 



From: Mark [static.void@gmail.com]
Sent: Friday, May 13, 2011 11:14 AM
To: solr-user@lucene.apache.org
Subject: When to use trie over standard

When should one use Trie fields over the standard fields? What aret the
pro's and con's of each?

Thanks


Multi Word Filter Queries

2011-05-13 Thread davaugust
I've recently installed solr and built an index.  It's working for the most
part, but no matter what I do, I can't seem to get filter queries where the
filtered value is multiple words to work.  By work I mean to literally
filter exactly by the phrase provided in "fq".

After trying many different things, here's the most recent field type I'm
trying:


  



  
  



 


When I fq by a multi word string, it does something, but returns many
results with values not in the FQ

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multi-Word-Filter-Queries-tp2935451p2935451.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Results with and without whitespace(soccer club and soccerclub)

2011-05-13 Thread Robert Muir
On Fri, May 13, 2011 at 7:07 AM, Paul Libbrecht  wrote:

> I sure wish such a compound-analysis would be done with a lucene-powered 
> dictionary!
> That would rock.
>

me too, but its a chicken-and-egg problem (you would have to basically
index everything without decomposition to get the dictionary+freqs,
then use this as decomposition dictionary and index again)


Re: When to use trie over standard

2011-05-13 Thread Mark

Great explanation. Thanks

On 5/13/11 8:25 AM, Jonathan Rochkind wrote:

Well, let's be clear about what we're talking about. The suggested numeric and 
date fields in the current Solr example schema are in fact ALL Trie based 
fields. 
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup

I don't think there is any downside to using a Trie-based field.  Trie based 
fields make for quicker range queries (or greater-than/less-than queries, the 
same thing), with no downside.

The old non-trie based fields are marked in the example schema are all marked 
"should only be used for compatibility with existing indexes." So the 
'standard' fields are in fact all Trie-based fields now.

Now, here's the thing though, using a Trie-based field, you still have a choice 
about the 'precision'.  The example schema includes trie-based fields with 
0-precision (int, float, long, double, date), and trie-fields with greater than 
0 precision (tint, tfloat, tdouble, tlong, date).  You can also create your own 
field defintions for trie fields with the precision of your choice, you aren't 
limited to what's set for tint etc in the example schema.xml.

So the real question might be how to decide what precision on trie fields is 
appropriate for your data and use cases. I am not exactly sure of the answer to 
that, but it _probably_ won't matter that much for many data/use-cases. So one 
answer is, don't worry too much about it unless you are seeing performance 
problems on range queries. I know I've seen something written explaining what a 
trie field is and what the precision means in the context of Solr, but I can't 
seem to find it now.



From: Mark [static.void@gmail.com]
Sent: Friday, May 13, 2011 11:14 AM
To: solr-user@lucene.apache.org
Subject: When to use trie over standard

When should one use Trie fields over the standard fields? What aret the
pro's and con's of each?

Thanks


Re: Results with and without whitespace(soccer club and soccerclub)

2011-05-13 Thread Paul Libbrecht



Le 13 mai 2011 à 17:32, Robert Muir a écrit :

> On Fri, May 13, 2011 at 7:07 AM, Paul Libbrecht  wrote:
> 
>> I sure wish such a compound-analysis would be done with a lucene-powered 
>> dictionary!
>> That would rock.
>> 
> 
> me too, but its a chicken-and-egg problem (you would have to basically
> index everything without decomposition to get the dictionary+freqs,
> then use this as decomposition dictionary and index again)

I think this is ok.
It's a kind of a research project.
The decompositions follow some language specific rules I think.
And they should be reviewed by humans.

Maybe a good GSoc project one day...

paul

Re: Multi Word Filter Queries

2011-05-13 Thread Tor Henning Ueland
On Fri, May 13, 2011 at 5:26 PM, davaugust  wrote:
> When I fq by a multi word string, it does something, but returns many
> results with values not in the FQ

It isnt as easy as you have forgotten to add quotation marks around
them? If you have not changed the default operator and do not use
operators, the query will become: fq=field:foo OR bar.

-- 
Best regards
Tor Henning Ueland
H3x.no


Schema Design Question

2011-05-13 Thread Zac Smith
Let's say I have a data model that involves books and bookshelves. I have tens 
of thousands of books and thousands of bookshelves. There is a many-many 
relationship between books & bookshelves. All of the books are indexed by SOLR.

I need to be able to query SOLR and get all the books for a given bookshelf. I 
see two schema design options here:


1)  Each book has a multi-value field that contains a list of all the 
bookshelf ID's. Many books will have thousands of bookshelf ID's. In this case 
the query is simple, I just send solr the bookshelf ID.

2)  I send solr a query with each book on the bookshelf e.g. 
q=book_id:(1+OR+2+OR+3 ). Many bookshelves will have thousands of book ID's 
so the query can get rather large.

Right now I am using option 2 and it seems to be working fine. I have had to 
crank 'maxBooleanClauses' right up but it does seem to be pretty fast.

Anyone have an opinion?



Editor loads wrong version of IndexSearcher while debugging - how to fix?

2011-05-13 Thread Gabriele Kahlout
Hello,

I'm debugging Solr built as a maven project in NB, and when I enter the code
of a Lucene dependency, namely
org.apache.lucene.search.IndexSearcher.explain(..) the call stack expects
this method to be at line 599 while in the editor the class ends at 304.

from solr-core's pom.xml:

  ${project.groupId}
  solr-solrj
*  ${project.version}*


from solrj's pom.xml:
 
  org.apache.lucene
  lucene-core
*  ${project.version}*


Looking up the actual class it's indeed 846 lines class and the editor is
loading a faulty version sources.jar (download sourcecode).
So the code in the sources.jar doesn't correspond to the binary code.
Now the big question is,* why do I get sources different from the binary of
the same version for a dependency*? How more could this be debugged? I don't
know how NB downloads a dependency sources (googling it seems that each IDE
has it's plugin for doing that).

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: Support for huge data set?

2011-05-13 Thread Jack Repenning
On May 13, 2011, at 7:59 AM, Shawn Heisey wrote:

> The entire archive is about 80 terabytes, but we only index a subset of the 
> metadata, stored in a MySQL database, which is about 100GB or so in size.
> 
> The Solr index (version 1.4.1) consists of six large shards, each about 16GB 
> in size,

This is really useful data, Shawn, thanks! It's particularly interesting 
because the numbers are in the same ball-park as a project I'm considering.

Can you clarify one thing? What's the relationship you're describing between 
MySQL and Solr? I think you're saying that there's a 80TB MySQL database, with 
a 100GB Solr system in front, is that right? Or is the entire 80TB accessible 
through Solr directly?

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Re: Support for huge data set?

2011-05-13 Thread Darren Govoni
Can I ask if you do any faceted or MLT type searches? Do those even work
across shards? 

On Fri, 2011-05-13 at 08:59 -0600, Shawn Heisey wrote:
> Our system, which I am not at liberty to disclose, consists of 55 
> million documents, mostly photos and text, but video is starting to 
> become prominent.  The entire archive is about 80 terabytes, but we only 
> index a subset of the metadata, stored in a MySQL database, which is 
> about 100GB or so in size.
> 
> The Solr index (version 1.4.1) consists of six large shards, each about 
> 16GB in size, plus a  seventh shard containing the most recent 7 days, 
> which is usually less than 1 GB.  The entire system is replicated to 
> slave servers.  Each virtual machine that houses a large shard has 9GB 
> of RAM, and there are three large shards on each of the four physical 
> hosts.  Each physical host is dual quad-core with 32GB of RAM, with a 
> six drive SATA RAID10.  We went with virtualization (Xen) for cost reasons.
> 
> Performance is good.  If we could move to physical machines instead of 
> virtualization, that would be optimal, but I think I'll have to settle 
> for a RAM upgrade instead.
> 
> The main reason I stuck with distributed search is because of index 
> rebuild time.  I can currently rebuild the entire index in 3-4 hours.  
> It would take 5-6 times that if I had a single large index.
> 
> 
> On 5/13/2011 12:37 AM, Otis Gospodnetic wrote:
> > With that many documents, I think GSA cost might be in millions of USD.  
> > Don't
> > go there.
> >
> > 300 MB docs might be called medium these days.  Of course, if those 
> > documents
> > themselves are huge, then it's more resource intensive.  10 TB sounds like 
> > a lot
> > when it comes to search, but it's hard to tell what that represents (e.g. 
> > are
> > those docs with lots of photos in them?  Presentations very light on text?
> > Plain text documents with 300 words per page? etc.)
> >
> > Anyhow, yes, Solr is a fine choice for this.
> >
> > Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > - Original Message 
> >> From: atreyu
> >> To: solr-user@lucene.apache.org
> >> Sent: Thu, May 12, 2011 12:59:28 PM
> >> Subject: Support for huge data set?
> >>
> >> Hi,
> >>
> >> I have about 300 million docs (or 10TB data) which is doubling every  3
> >> years, give or take.  The data mostly consists of Oracle records,  webpage
> >> files (HTML/XML, etc.) and office doc files.  There are b/t two  and four
> >> dozen concurrent users, typically.  The indexing server has>  27 GB of RAM,
> >> but it still gets extremely taxed, and this will only get  worse.
> >>
> >> Would Solr be able to efficiently deal with a load of this  size?  I am
> >> trying to avoid the heavy cost of GSA,  etc...
> >>
> >> Thanks.
> >>
> >>
> >> --
> >> View this message in context:
> >> http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html
> >>
> >> Sent  from the Solr - User mailing list archive at Nabble.com.
> >>
> 




Boosting score by distance

2011-05-13 Thread Ian Eure
I have a bunch of documents representing points of interest indexed in Solr. 
I'm trying to boost the score of documents based on distance from an origin 
point, and having some difficulty.

I'm currently using the standard query parser and sending in this query:

(name:sushi OR tags:sushi OR classifiers:sushi) AND deleted:False AND 
owner:simplegeo

I'm also using the spatial search to limit results to ones found within 25km of 
my origin point. The issue I'm having is that I need the score to be a blend of 
the FT match _and_ distance from the origin point; If i sort by distance, lots 
of low quality matches clog up the results for simple searches, but if I sort 
by score, more distant results overwhelm nearby, though less relevant 
(according to Solr) results.

I think what I want to do is boost the score of documents based on the distance 
from the origin search point. Alternately, if there was some way to treat a 
match on any of the three fields as having equal weight, I believe that would 
get me much closer to what I want.

The examples I've seen for doing this kind of thing use dismax and its boost 
function (`bf') parameter. I don't know if my queries are translatable to 
dismax syntax as they are now, and it looks like the boost functions don't work 
with the standard query parser — at least, I have been completely unable to 
change the score when using it.

Is there some way to boost by the inverse of the distance using the standard 
query parser, or alternately, to filter my results by different fields with the 
dismax parser?

Re: DIH help request: nested xml entities and xpath

2011-05-13 Thread Weiss, Eric
I think my original question/thread was accidentally pwnd.  Let me take
this opportunity to refocus this thread to my original question about DIH
and nested entities and xpath.  I'll try to ask a very simple question
instead:

Why doesn't this field xpath work?  By "not working" I mean the
MsgKeywordMF field does not populate in the index...unless I remove the
xpath filter.



OR



- I modified the original xml so Category was an attribute to MsgItem
instead...still does not work despite this matching in other tools and
explicitly documented in the DIH wiki page.




Full details below (in original post as well).

Thx,
-- Eric


On Fri, May 13, 2011 at 4:53 AM, Weiss, Eric  wrote:

>Apologies in advance if this topic/question has been previously answeredŠI
>have scoured the docs, mail archives, web looking for an answer(s) with no
>luck.  I am sure I am just being dense or missing something obviousŠplease
>point out my stupidity as my head hurts trying to get this working.
>
>Solr 3.1
>Java 1.6
>Eclipse/Tomcat 7/Maven 2.x
>
>Goal: to extract manufacturer names from a repeating list of keywords each
>denoted by a Category, one of which is "Manufacturer", and load them into
>a
>MsgKeywordMF field  (see xml below)
>
>I have xml files I am loading via DIH.  This an abbreviated example xml
>data (each file has repeating "Report" items, each report has repeating
>MsgSet, Msg, MsgList, etc items).  Notice the nested repeating groups,
>namely MsgItems, within each document (Report):
>
>
>
>
>  
>
>02/22/2011
>
> Š
>
>  
>
>  
>
>
>
>  http://someurl.com/path/to/doc
>
>
>   Š
>
>  blah blah
>
>  
>
>
>
>  SomeType
>
>  Location
>
>  USA
>
>
>
>
>
>  AnotherType
>
>  Manufacturer
>
>  Apple
>
>
>
>Š
>
>  
>
>
>
>  
>
>
>
>Š
>
>
>Š
>
>Š
>
>Here is my data-config.xml:
>
>
>
>
>  
>
>
>  
>
>
>processor="FileListEntityProcessor" fileName="^.*\.xml$"
>recursive="false" baseDir="/files/xml/">
>
>  
>rootEntity="true" pk="id"
>
>  url="${fileload.fileAbsolutePath}"
>processor="XPathEntityProcessor"
>
>  forEach="/Report/MsgSet/Msg" onError="skip"
>
>  transformer="DateFormatTransformer,RegexTransformer">
>
>  xpath="/Report/MsgSet/Msg/DocumentText"/>
>
>  
>
>  xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Category" />
>
>  xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Keyword" />
>
>  xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword
>"
>/>
>
>  Š
>
>  
>
>
>
>  
>
>
>
>
>As seen in my config and sample data above, I am extracting the repeating
>"Keywords" into the the MsgKeyword field.  Also, and the part that does
>NOT
>work, I am trying to extract into a separate field just the keywords that
>have a "Category" of "Manufacturer" -->   xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword
>"
>/>
>
>I have also tried: xpath="/Report/MsgSet/Msg/MsgList/MsgItem[@Category='Manufacturer']/Keywor
>d"
>/>
>Šafter changing the "Category" to an attribute of MsgItem (Category="Location">) but it too fails to match.
>
>I have tested my xpath notation against my xml data file using various
>xpath evaluator tools, like within Eclipse, and it matches perfectlyŠbut I
>can't get it to match/work during import.
>
>As I am able to understand it, DIH does not support nested/correlated
>entities, at least not with XML data sources using nested entity tags.
>I've
>tried without success to nest entities but I can't "correlate" the nested
>entity with the parent.  I think the way I'm trying should work, but no
>luck
>so farŠ.
>
>BTW, I can't easily change the xml format, although it is possible with
>some painŠ
>
>Any ideas?
>
>TIA,
>-- Eric
>
>






On 5/13/11 1:58 AM, "Gora Mohanty"  wrote:

>On Fri, May 13, 2011 at 10:18 AM, Ashique  wrote:
>> Hi All,
>>
>> I am a Java/J2ee programmer and very new to SOLR. I would  like to
>>index a
>> table in a postgresSql database to SOLR. Then searching the records
>>from a
>> GUI (Jsp Page) and showing the results in tabular form. Could any one
>>help
>> me out with a simple sample code.
>[...]
>
>This is too broad a question. Please start out by looking
>at the extensive Solr documentation:
>* Complete list: http://wiki.apache.org/solr/FrontPage
>* Initial tutorial: http://lucene.apache.org/solr/tutorial.html
>  It is a good idea to first ensure that you are able to get
>  this working.
>* If you are using Java, this should be of interest:
>  http://wiki.apache.org/solr/SolJava
>* For easy data import from a database, you could consider
>  using the DataImportHandler:
>  http://wiki.apache.org/solr/DataImportHandler
>
>You can ask here if you run into issues while trying these out.
>
>Regards,
>Gora



Re: Support for huge data set?

2011-05-13 Thread Shawn Heisey
The objects are in a number of filesystems, taking up 80TB of space.  
The MySQL database is about 128GB, 117GB of which is table containing 
the metadata for the documents.  We don't use all that metadata, just a 
subset.  I don't have any way to really calculate the subset's size.


With seven shards, six of which are about 16.5GB and one that's about 
1GB, the entire Solr index takes up about 100GB.  In the near future we 
should be able to drop a number of stored fields from the index, making 
it smaller.


We use the dataimporthandler to get all this into Solr, with a custom 
build system written in Perl.



On 5/13/2011 11:05 AM, Jack Repenning wrote:

On May 13, 2011, at 7:59 AM, Shawn Heisey wrote:


The entire archive is about 80 terabytes, but we only index a subset of the 
metadata, stored in a MySQL database, which is about 100GB or so in size.

The Solr index (version 1.4.1) consists of six large shards, each about 16GB in 
size,

This is really useful data, Shawn, thanks! It's particularly interesting 
because the numbers are in the same ball-park as a project I'm considering.

Can you clarify one thing? What's the relationship you're describing between 
MySQL and Solr? I think you're saying that there's a 80TB MySQL database, with 
a 100GB Solr system in front, is that right? Or is the entire 80TB accessible 
through Solr directly?

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep













Re: Schema Design Question

2011-05-13 Thread Otis Gospodnetic
Hi Zac,

Solr 4.0 (trunk) has support for relationships/JOIN.  Have a look: 
http://search-lucene.com/?q=solr+join

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Zac Smith 
> To: "solr-user@lucene.apache.org" 
> Sent: Fri, May 13, 2011 12:28:35 PM
> Subject: Schema Design Question
> 
> Let's say I have a data model that involves books and bookshelves. I have 
> tens  
>of thousands of books and thousands of bookshelves. There is a many-many  
>relationship between books & bookshelves. All of the books are indexed by  
>SOLR.
> 
> I need to be able to query SOLR and get all the books for a given  bookshelf. 
> I 
>see two schema design options here:
> 
> 
> 1)   Each book has a multi-value field that contains a list of all the  
>bookshelf ID's. Many books will have thousands of bookshelf ID's. In this case 
> 
>the query is simple, I just send solr the bookshelf ID.
> 
> 2)   I send solr a query with each book on the bookshelf e.g.  
>q=book_id:(1+OR+2+OR+3 ). Many bookshelves will have thousands of book 
>ID's  
>so the query can get rather large.
> 
> Right now I am using option 2 and it  seems to be working fine. I have had to 
>crank 'maxBooleanClauses' right up but  it does seem to be pretty fast.
> 
> Anyone have an opinion?
> 
> 


Re: Sub query using SOLR?

2011-05-13 Thread Thalaiselvam
But we cann't judge the subquery return value, Is this possible to add more
than ID in sub query?..

Thanks & Regards,
Thalaiselvam N 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sub-query-using-SOLR-tp2193251p2931267.html
Sent from the Solr - User mailing list archive at Nabble.com.


Show filename in search result using a FileListEntityProcessor

2011-05-13 Thread Marcel Panse
Hi Solr community,

I'm new to solr and trying to scan all pdf/doc files in a directory. This
works fine and I am able to scan all documents. The next thing i'm trying to
do is also receiving the filename of the file in the search results. The
filename however never shows up. I tried a couple of things, but the
documentation is not very helpfull about how to do this.

This is my dataConfig:











 





Thanks,
Marcel Panse


Re: Support for huge data set?

2011-05-13 Thread Renaud Delbru

Hi,

Our system [1] consists of +220 million semi-structured web documents 
(RDF, Microformats, etc.), with fairly small documents (a few kb) and 
large documents (a few MB). Each document has in addition a dozen of 
additional fields for indexing and storing metadata about the document.


It runs on top of Solr 3.1 with the following configuration:
- 2 master indexes
- 2 slaves indexes
Each server is a quad-core with 32Gb of Ram, and 4 SATA drives in RAID10.

The indexing performance are quite good. We can reindex our full data 
collection in less than a day (using only the two master indexes). Live 
updates (a few millions documents per day) are processed continuously by 
our masters. We replicate the change every hours to the slave indexes. 
Query performance are also ok (you can try it by yourself on [1]).


As a side note, we are using Solr 3.1 plus a plugin we have developped 
for indexing semi-structured data. This plugin is adding much more data 
to the index than plain Solr. So you can expect even better performance 
by using plain solr (with respect to indexing performance).


[1] http://sindice.com
--
Renaud Delbru

On 12/05/11 17:59, atreyu wrote:

Hi,

I have about 300 million docs (or 10TB data) which is doubling every 3
years, give or take.  The data mostly consists of Oracle records, webpage
files (HTML/XML, etc.) and office doc files.  There are b/t two and four
dozen concurrent users, typically.  The indexing server has>  27 GB of RAM,
but it still gets extremely taxed, and this will only get worse.

Would Solr be able to efficiently deal with a load of this size?  I am
trying to avoid the heavy cost of GSA, etc...

Thanks.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: UIMA analysisEngine path

2011-05-13 Thread chamara
Hi, 
 Is this code line 57 needs to be changed to the location where the jar
files(library files) resides?
 URL url = this.getClass().getResource(""); I did
change it but no luck so far. Let me know what i am doing wrong?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/UIMA-analysisEngine-path-tp2895284p2935541.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Support for huge data set?

2011-05-13 Thread Shawn Heisey

On 5/13/2011 11:09 AM, Darren Govoni wrote:

Can I ask if you do any faceted or MLT type searches? Do those even work
across shards?


We currently aren't using facets in production, but we've done some data 
mining with them.  They work very well in distributed mode.  We plan to 
start incorporating them into our system, which will require a fair 
amount of work by our application developers.


MLT is not supported in distributed mode yet.  There is a feature 
request for it (SOLR-788) that includes a couple of patches against very 
old revisions of Solr 1.5-dev.  I recently tried to get it applied to 
version 3.1, but was unsuccessful.  The code has changed too much for me 
to figure out how to do it.  I have some experience with programming, 
but not in Java, and I'm not familiar with Lucene/Solr code.


If this is important to you, you can try to fix the patch or vote for 
the Jira issue.  I've done both.


https://issues.apache.org/jira/browse/SOLR-788



Re: Faceting question

2011-05-13 Thread lee carroll
Hi Mark,
I think you would need to issue two seperate queries. Its also a, I was
going to say odd
usecase but who am I to judge, interesting usecase. If you have a faceted
navigation front
end you are in real danger of confusing your users. I suppose its a case of
what do you want to achieve? Faceting mat not be the way to go.

lee c

On 13 May 2011 15:56, Mark  wrote:

> No mixup. I probably didn't explain myself correctly.
>
> Suppose my document has fields "title", "description" and "foo". When I
> search I would like to search across "title" and "description". I then would
> like facet counts on "foo" for documents that matched the "title" field
> only. IE, I would like the faceting behavior on "foo" to be exactly as if i
> searched against only the "title" field.
>
> Does that make sense?
>
>
> On 5/12/11 11:30 PM, Otis Gospodnetic wrote:
>
>> Hi,
>>
>> I think there is a bit of a mixup here.  Facets are not about which field
>> a
>> match was on, but about what values hits have in one or more fields you
>> facet
>> on.
>>
>> Otis
>> 
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> - Original Message 
>>
>>> From: Mark
>>> To: solr-user@lucene.apache.org
>>> Sent: Fri, May 13, 2011 1:19:10 AM
>>> Subject: Faceting question
>>>
>>> Is there anyway to perform a search that searches across 2 fields yet
>>> only
>>> gives  me facets accounts for documents matching 1 field?
>>>
>>> For example
>>>
>>> If  I have fields A&  B and I perform a search across I would like to
>>> match my
>>> query across either of these two fields. I would then like facet counts
>>> for how
>>> many documents matched in field A only.
>>>
>>> Can this accomplished? If not out  of the box what classes should I look
>>> into
>>> to create this  myself?
>>>
>>> Thanks
>>>
>>>


document storage

2011-05-13 Thread Mike Sokolov
Would anyone care to comment on the merits of storing indexed full-text 
documents in Solr versus storing them externally?


It seems there are three options for us:

1) store documents both in Solr and externally - this is what we are 
doing now, and gives us all sorts of flexibility, but doesn't seem like 
the most scalable option, at least in terms of storage space and I/O 
required when updating/inserting documents.


2) store documents externally: For the moment, the only thing that 
requires us to store documents in Solr is the need to highlight them, 
both in search result snippets and in full document views. We are 
considering hunting for or writing a Highlighter extension that could 
pull in the document text from an external source (eg filesystem).


3) store documents only in Solr.  We'd just retrieve document text as a 
Solr field value rather than reading from the filesystem.  Somehow this 
strikes me as the wrong thing to do, but it could work:  I'm not sure 
why.  A lot of unnecessary merging I/O activity perhaps.  Makes it hard 
to grep the documents or use other filesystem tools, I suppose.


Which one of these sounds best to you?  Under which circumstances? Are 
there other possibilities?


Thanks!

--

Michael Sokolov
Engineering Director
www.ifactory.com



Writing response from custom RequestHandler

2011-05-13 Thread logan.stinger
I am writing a custom RequestHandler by extending RequestHandlerBase.  I
would like this request handler to perform some work and then write the
response using the VelocityResponseWriter.  I am just getting started so
currently my custom RequestHandler looks like this:

@Override
public void handleRequestBody(SolrQueryRequest request, SolrQueryResponse
response) throws Exception {
log.info("I'm here");
}

 I have added the following to solrconfig.xml:



velocity
text/xml;charset=UTF-8
browse
layout



Right now I get the default xml response in my browser:

 
 


Why doesn't it use VelocityResponseWriter and display the browse template
like I have specified?  Do I need to tell it which writer to use.  I assumed
that was being done by some base class by examining the wt param.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Writing-response-from-custom-RequestHandler-tp2936178p2936178.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: MoreLikeThis PDF search

2011-05-13 Thread Brian Lamb
Any thoughts on this one?

On Thu, May 12, 2011 at 10:46 AM, Brian Lamb
wrote:

> Hi all,
>
> I've become more and more familiar with the MoreLikeThis handler over the
> last several months. I'm curious whether it is possible to do a MoreLikeThis
> search by uploading a PDF? I looked at the ExtractingRequestHandler and that
> looks like it that is used to process PDF files and the like but is it
> possible to combine the two?
>
> Just to be clear, I don't want to send a PDF and have that be a part of the
> index. But rather, I'd like to be able to use the PDF as a MoreLikeThis
> search.
>
> Thanks,
>
> Brian Lamb
>


Re: Support for huge data set?

2011-05-13 Thread Darren Govoni
Thanks for the info Shawn. I'll look into the issue as well.

On Fri, 2011-05-13 at 12:34 -0600, Shawn Heisey wrote:

> On 5/13/2011 11:09 AM, Darren Govoni wrote:
> > Can I ask if you do any faceted or MLT type searches? Do those even work
> > across shards?
> 
> We currently aren't using facets in production, but we've done some data 
> mining with them.  They work very well in distributed mode.  We plan to 
> start incorporating them into our system, which will require a fair 
> amount of work by our application developers.
> 
> MLT is not supported in distributed mode yet.  There is a feature 
> request for it (SOLR-788) that includes a couple of patches against very 
> old revisions of Solr 1.5-dev.  I recently tried to get it applied to 
> version 3.1, but was unsuccessful.  The code has changed too much for me 
> to figure out how to do it.  I have some experience with programming, 
> but not in Java, and I'm not familiar with Lucene/Solr code.
> 
> If this is important to you, you can try to fix the patch or vote for 
> the Jira issue.  I've done both.
> 
> https://issues.apache.org/jira/browse/SOLR-788
> 




Re: Replication Clarification Please

2011-05-13 Thread Ravi Solr
Sorry guys spoke too soon I guess. The replication still remains very
slow even after upgrading to 3.1 and setting the compression off. Now
Iam totally clueless. I have tried everything that I know of to
increase the speed of replication but failed. if anybody faced the
same issue, can you please tell me how you solved it.

Ravi Kiran Bhaskar

On Thu, May 12, 2011 at 6:42 PM, Ravi Solr  wrote:
> Thank you Mr. Bell and Mr. Kanarsky, as per your advise we have moved
> from 1.4.1 to 3.1 and have made several changes to configuration. The
> configuration changes have worked nicely till now and the replication
> is finishing within the interval and not backing up. The changes we
> made are as follows
>
> 1. Increased the mergeFactor from 10 to 15
> 2. Increased ramBufferSizeMB to 1024
> 3. Changed lockType to single (previously it was simple)
> 4. Set maxCommitsToKeep to 1 in the deletionPolicy
> 5. Set maxPendingDeletes to 0
> 6. Changed caches from LRUCache to FastLRUCache as we had hit ratios
> well over 75% to increase warming speed
> 7. Increased the poll interval to 6 minutes and re-indexed all content.
>
> Thanks,
>
> Ravi Kiran Bhaskar
>
> On Wed, May 11, 2011 at 6:00 PM, Alexander Kanarsky
>  wrote:
>> Ravi,
>>
>> if you have what looks like a full replication each time even if the
>> master generation is greater than slave, try to watch for the index on
>> both master and slave the same time to see what files are getting
>> replicated. You probably may need to adjust your merge factor, as Bill
>> mentioned.
>>
>> -Alexander
>>
>>
>>
>> On Tue, 2011-05-10 at 12:45 -0400, Ravi Solr wrote:
>>> Hello Mr. Kanarsky,
>>>                 Thank you very much for the detailed explanation,
>>> probably the best explanation I found regarding replication. Just to
>>> be sure, I wanted to test solr 3.1 to see if it alleviates the
>>> problems...I dont think it helped. The master index version and
>>> generation are greater than the slave, still the slave replicates the
>>> entire index form master (see replication admin screen output below).
>>> Any idea why it would get the whole index everytime even in 3.1 or am
>>> I misinterpreting the output ? However I must admit that 3.1 finished
>>> the replication unlike 1.4.1 which would hang and be backed up for
>>> ever.
>>>
>>> Master        http://masterurl:post/solr-admin/searchcore/replication
>>>       Latest Index Version:null, Generation: null
>>>       Replicatable Index Version:1296217097572, Generation: 12726
>>>
>>> Poll Interval         00:03:00
>>>
>>> Local Index   Index Version: 1296217097569, Generation: 12725
>>>
>>>       Location: /data/solr/core/search-data/index
>>>       Size: 944.32 MB
>>>       Times Replicated Since Startup: 148
>>>       Previous Replication Done At: Tue May 10 12:32:42 EDT 2011
>>>       Config Files Replicated At: null
>>>       Config Files Replicated: null
>>>       Times Config Files Replicated Since Startup: null
>>>       Next Replication Cycle At: Tue May 10 12:35:41 EDT 2011
>>>
>>> Current Replication Status    Start Time: Tue May 10 12:32:41 EDT 2011
>>>       Files Downloaded: 18 / 108
>>>       Downloaded: 317.48 KB / 436.24 MB [0.0%]
>>>       Downloading File: _ayu.nrm, Downloaded: 4 bytes / 4 bytes [100.0%]
>>>       Time Elapsed: 17s, Estimated Time Remaining: 23902s, Speed: 18.67 KB/s
>>>
>>>
>>> Thanks,
>>> Ravi Kiran Bhaskar
>>>
>>> On Tue, May 10, 2011 at 4:10 AM, Alexander Kanarsky
>>>  wrote:
>>> > Ravi,
>>> >
>>> > as far as I remember, this is how the replication logic works (see
>>> > SnapPuller class, fetchLatestIndex method):
>>> >
>>> >> 1. Does the Slave get the whole index every time during replication or
>>> >> just the delta since the last replication happened ?
>>> >
>>> >
>>> > It look at the index version AND the index generation. If both slave's
>>> > version and generation are the same as on master, nothing gets
>>> > replicated. if the master's generation is greater than on slave, the
>>> > slave fetches the delta files only (even if the partial merge was done
>>> > on the master) and put the new files from master to the same index
>>> > folder on slave (either index or index., see further
>>> > explanation). However, if the master's index generation is equals or
>>> > less than one on slave, the slave does the full replication by
>>> > fetching all files of the master's index and place them into a
>>> > separate folder on slave (index.). Then, if the fetch is
>>> > successfull, the slave updates (or creates) the index.properties file
>>> > and puts there the name of the "current" index folder. The "old"
>>> > index. folder(s) will be kept in 1.4.x - which was treated
>>> > as a bug - see SOLR-2156 (and this was fixed in 3.1). After this, the
>>> > slave does commit or reload core depending whether the config files
>>> > were replicated. There is another bug in 1.4.x that fails replication
>>> > if the slave need to do the full replication AND the config files were
>>> > changed - also fix

RE: SolrDispatchFilter

2011-05-13 Thread Chris Hostetter

: This problem is only occurring when using IE8 ( Chrome & FireFox fine )

if it only happens when using the form on the admin screen (and not when 
hitting the URL directly, via shift-reload for example), it may just be 
a differnet manifestation of this silly javascript bug...

https://issues.apache.org/jira/browse/SOLR-2455


-Hoss


Re: document storage

2011-05-13 Thread Rich Cariens
We've decided to store the original document in both Solr and external
repositories. This is to support the following:

   1. highlighting - We need to mark-up the entire document with hit-terms.
   However if this was the only reason to store the text I'd seriously consider
   calling out to the external repository via a custom highlighter.
   2. "hot" documents - We need to index user-generated data like activity
   streams, folksonomy tags, annotations, and comments. When our indexer is
   made aware of those events we decorate the existing SolrDocument with new
   fields and re-index it.
   3. in-place index rebuild - Our search service is still evolving so we
   periodically change our schema and indexing code. We believe it's more
   efficient, not to mention faster, to rebuild the index if we've got all the
   data.

Hope that helps!

On Fri, May 13, 2011 at 3:10 PM, Mike Sokolov  wrote:

> Would anyone care to comment on the merits of storing indexed full-text
> documents in Solr versus storing them externally?
>
> It seems there are three options for us:
>
> 1) store documents both in Solr and externally - this is what we are doing
> now, and gives us all sorts of flexibility, but doesn't seem like the most
> scalable option, at least in terms of storage space and I/O required when
> updating/inserting documents.
>
> 2) store documents externally: For the moment, the only thing that requires
> us to store documents in Solr is the need to highlight them, both in search
> result snippets and in full document views. We are considering hunting for
> or writing a Highlighter extension that could pull in the document text from
> an external source (eg filesystem).
>
> 3) store documents only in Solr.  We'd just retrieve document text as a
> Solr field value rather than reading from the filesystem.  Somehow this
> strikes me as the wrong thing to do, but it could work:  I'm not sure why.
>  A lot of unnecessary merging I/O activity perhaps.  Makes it hard to grep
> the documents or use other filesystem tools, I suppose.
>
> Which one of these sounds best to you?  Under which circumstances? Are
> there other possibilities?
>
> Thanks!
>
> --
>
> Michael Sokolov
> Engineering Director
> www.ifactory.com
>
>


Re: Solr Range Facets

2011-05-13 Thread Chris Hostetter

: I did try what you suggested, but I am not getting the expected results. The
: code is given below,

+5 points for posting the code you tried, but -10 points for not 
explaining how the results you get are differnet from the results you 
expect, and -5 more points for not even giving an example of the results 
you did get.

In the absense of any other info about how this doesn't match your 
expecations, my hunch is it's because you left out hte crucial part of my 
suggestion...

: query.set("facet.range.start", "2010-01-01T00:00:00Z") ;

You said you wanted the facet results to be based on the users local 
timezone, but you aren't including the "timezone offset" info that i 
mentioned you should add (Unless this example is suppose to show the 
results for a user whose local timezone is UTC)

See below...

: -Original Message-
: From: Chris Hostetter
...
: Date faceting is entirely driven by query params, so if you index your 
: events using the "true" time that they happend at (formatted as a string 
: in UTC) you can then select your date ranges using whatever timezone 
: offset is specified by your user at query time as a UTC offset.

:   facet.range.start = 2011-01-01T00:00:00Z+${useroffset}MINUTES
:   facet.range.gap = +1DAY
:   etc...

-Hoss


Want to Delete Existing Index & create fresh index

2011-05-13 Thread Pawan Darira
Hi

I had an existing index created months back. now my database schema has
changed. i wanted to delete the current data/index directory & re-create the
fresh index

but it is saying that "segments" file not found & just create blank
data/index directory. Please help

-- 
Thanks,
Pawan Darira


Re: Want to Delete Existing Index & create fresh index

2011-05-13 Thread Gabriele Kahlout
"curl --fail $solrIndex/update?commit=true -d
'*:*'" #empty index [1
]

did u try?


On Sat, May 14, 2011 at 7:26 AM, Pawan Darira wrote:

> Hi
>
> I had an existing index created months back. now my database schema has
> changed. i wanted to delete the current data/index directory & re-create
> the
> fresh index
>
> but it is saying that "segments" file not found & just create blank
> data/index directory. Please help
>
> --
> Thanks,
> Pawan Darira
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


RE: Schema Design Question

2011-05-13 Thread Zac Smith
Thanks that looks interesting. Don't think it helps my situation though as I 
would have to index all the bookshelves and will still end up having to put 
thousands of Book ID values in a multi-value field.

I guess the question I have is: Is it more appropriate to load a multi-value 
field with a large number of values or should you pass a large number of values 
in as a Boolean clause?

Zac

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Friday, May 13, 2011 10:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Schema Design Question

Hi Zac,

Solr 4.0 (trunk) has support for relationships/JOIN.  Have a look: 
http://search-lucene.com/?q=solr+join

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem 
search :: http://search-lucene.com/



- Original Message 
> From: Zac Smith 
> To: "solr-user@lucene.apache.org" 
> Sent: Fri, May 13, 2011 12:28:35 PM
> Subject: Schema Design Question
> 
> Let's say I have a data model that involves books and bookshelves. I 
>have tens of thousands of books and thousands of bookshelves. There is 
>a many-many relationship between books & bookshelves. All of the books are 
>indexed by  SOLR.
> 
> I need to be able to query SOLR and get all the books for a given  
>bookshelf. I see two schema design options here:
> 
> 
> 1)   Each book has a multi-value field that contains a list of all the  
>bookshelf ID's. Many books will have thousands of bookshelf ID's. In 
>this case the query is simple, I just send solr the bookshelf ID.
> 
> 2)   I send solr a query with each book on the bookshelf e.g.  
>q=book_id:(1+OR+2+OR+3 ). Many bookshelves will have thousands of 
>book ID's so the query can get rather large.
> 
> Right now I am using option 2 and it  seems to be working fine. I have 
>had to crank 'maxBooleanClauses' right up but  it does seem to be pretty fast.
> 
> Anyone have an opinion?
> 
>