Re: What does it mean when you see a plus sign in between two words inside synonyms.txt?

2010-04-05 Thread Koji Sekiguchi

paulosalamat wrote:

Hi I'm new to this group,

I would like to ask a question:

What does it mean when you see a plus sign in between two words inside
synonyms.txt?

e.g. 


macbookair => macbook+air

Thanks,
Paulo
  

Welcome, Paulo!

It depends on your tokenizer. You can specify a tokenizer via
tokenizerFactory attribute when you use SynonymFilterFactory.
The tokenizer is used when SynonymFilterFactory reads the
synonyms.txt. If you do not specify it, WhitespaceTokenizer
will be used as default.

In the above example, the term text "macbookair" will be
normalized to the term text "macbook+air", if WhitespaceTokenizer
is used.

Koji

--
http://www.rondhuit.com/en/



Re: What does it mean when you see a plus sign in between two words inside synonyms.txt?

2010-04-05 Thread paulosalamat

Hi Koji,

Thank you for the reply.

I have another question. If WhitespaceTokenizer is used, is the term text
"macbook+air" equal to "macbook air"?

Thank you,
Paulo


On Mon, Apr 5, 2010 at 5:50 PM, Koji Sekiguchi [via Lucene] <
ml-node+697386-2142071620-218...@n3.nabble.com
> wrote:

> paulosalamat wrote:
>
> > Hi I'm new to this group,
> >
> > I would like to ask a question:
> >
> > What does it mean when you see a plus sign in between two words inside
> > synonyms.txt?
> >
> > e.g.
> >
> > macbookair => macbook+air
> >
> > Thanks,
> > Paulo
> >
> Welcome, Paulo!
>
> It depends on your tokenizer. You can specify a tokenizer via
> tokenizerFactory attribute when you use SynonymFilterFactory.
> The tokenizer is used when SynonymFilterFactory reads the
> synonyms.txt. If you do not specify it, WhitespaceTokenizer
> will be used as default.
>
> In the above example, the term text "macbookair" will be
> normalized to the term text "macbook+air", if WhitespaceTokenizer
> is used.
>
> Koji
>
> --
> http://www.rondhuit.com/en/
>
>
>
> --
>  View message @
> http://n3.nabble.com/What-does-it-mean-when-you-see-a-plus-sign-in-between-two-words-inside-synonyms-txt-tp697235p697386.html
> To unsubscribe from What does it mean when you see a plus sign in between
> two words inside synonyms.txt?, click here< (link removed) ==>.
>
>
>

-- 
View this message in context: 
http://n3.nabble.com/What-does-it-mean-when-you-see-a-plus-sign-in-between-two-words-inside-synonyms-txt-tp697235p697403.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: What does it mean when you see a plus sign in between two words inside synonyms.txt?

2010-04-05 Thread Koji Sekiguchi

paulosalamat wrote:

Hi Koji,

Thank you for the reply.

I have another question. If WhitespaceTokenizer is used, is the term text
"macbook+air" equal to "macbook air"?
  

No. In the field, "macbook air" will be a phrase (not a term).
You can define not only terms but phrases in synonyms.txt:

ex)
macbookair => macbook air

Koji

--
http://www.rondhuit.com/en/



Re: Obtaining SOLR index size on disk

2010-04-05 Thread Na_D

  hi,
 
   I am using the piece of code given below
 
  ReplicationHandler handler2 = new ReplicationHandler();
 System.out.println( handler2.getDescription());
 
 
 NamedList statistics = handler2.getStatistics();
 System.out.println("Statistics   "+ statistics);

The result that i am getting (ie the printed statment is :
Statistics  
{handlerStart=1270469530218,requests=0,errors=0,timeouts=0,totalTime=0,avgTimePerRequest=NaN,avgRequestsPerSecond=NaN}


But the Statistics consists of the other info too:


org.apache.solr.handler.ReplicationHandler
  
  
$Revision: 829682 $
  

  
ReplicationHandler provides replication of index and configuration
files from Master to Slaves
  
  


  1270463612968



  0



  0



  0



  0



  NaN



  0.0



  19.29 KB



  1266984293131



  3



  C:\solr\apache-solr-1.4.0\example\example-DIH\solr\db\data\index



  true



  false



  schema.xml,stopwords.txt,elevate.xml



  [commit, startup]



  true


  




this is where the problem lies : i need the size of the index im not finding
the API 
nor is the statistics printing out(sysout) the same.
how to i get the size of the index
-- 
View this message in context: 
http://n3.nabble.com/Obtaining-SOLR-index-size-on-disk-tp500095p697599.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: cheking the size of the index using solrj API's

2010-04-05 Thread Na_D

 hi,
 
   I am using the piece of code given below
 
  ReplicationHandler handler2 = new ReplicationHandler();
 System.out.println( handler2.getDescription());
 
 
 NamedList statistics = handler2.getStatistics();
 System.out.println("Statistics   "+ statistics);

The result that i am getting (ie the printed statment is :
Statistics  
{handlerStart=1270469530218,requests=0,errors=0,timeouts=0,totalTime=0,avgTimePerRequest=NaN,avgRequestsPerSecond=NaN}


But the Statistics consists of the other info too:


org.apache.solr.handler.ReplicationHandler
  
  
$Revision: 829682 $
  

  
ReplicationHandler provides replication of index and configuration
files from Master to Slaves
  
  


  1270463612968



  0



  0



  0



  0



  NaN



  0.0



  19.29 KB



  1266984293131



  3



  C:\solr\apache-solr-1.4.0\example\example-DIH\solr\db\data\index



  true



  false



  schema.xml,stopwords.txt,elevate.xml



  [commit, startup]



  true


  




this is where the problem lies : i need the size of the index im not finding
the API
nor is the statistics printing out(sysout) the same.
how to i get the size of the index 
-- 
View this message in context: 
http://n3.nabble.com/cheking-the-size-of-the-index-using-solrj-API-s-tp692686p697603.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: cheking the size of the index using solrj API's

2010-04-05 Thread Peter Sturge
If you're using ReplicitionHandler directly, you already have the xml from
which to extract the 'indexSize' attribute.
>From a client, you can get the indexSize by issuing:
  http://hostname:8983/solr/core/replication?command=details
This will give you an xml response.
Use:
  http://hostname:8983/solr/core/replication?command=details&wt=json
to give you a json string that has 'indexSize' within it:

{"responseHeader":{"status":0,"QTime":0},"details":{"indexSize":"6.63
KB","indexPath":"usr//bin/solr/core0/index","commits":[["indexVersion",1259974360056,"generation",1572,"filelist",["segments_17o"]],["indexVersion",1259974360057,"generation",1573,"filelist",["segments_17p","_zv.fdx","_zv.fnm","_zv.fdt","_zv.nrm","_zv.tis","_zv.prx","_zv.tii","_zv.frq"]]],"isMaster":"true","isSlave":"false","indexVersion":1259974360057,"generation":1573,"backup":["startTime","Mon
Apr 05 14:28:46 BST
2010","fileCount",17,"status","success","snapshotCompletedAt","Mon Apr
05 14:28:47 BST 2010"]},"WARNING":"This response format is
experimental.  It is likely to change in the future."}

Either way, you'll need to have some sort of parsing logic or formatting to
get just the index size bit.


Re: Apache Lucene EuroCon Call For Participation: Prague, Czech Republic May 20 & 21, 2010

2010-04-05 Thread Grant Ingersoll
Just a reminder, just over one week left open on the CFP.  Some great talks 
entered already.  Keep it up!

On Mar 24, 2010, at 8:03 PM, Grant Ingersoll wrote:

> Apache Lucene EuroCon Call For Participation - Prague, Czech Republic May 20 
> & 21, 2010
>  
> All submissions must be received by Tuesday, April 13, 2010, 12 Midnight 
> CET/6 PM US EDT
> 
> The first European conference dedicated to Lucene and Solr is coming to 
> Prague from May 18-21, 2010. Apache Lucene EuroCon is running on on 
> not-for-profit basis, with net proceeds donated back to the Apache Software 
> Foundation. The conference is sponsored by Lucid Imagination with additional 
> support from community and other commercial co-sponsors.
> 
> Key Dates:
> 24 March 2010: Call For Participation Open
> 13 April 2010: Call For Participation Closes
> 16 April 2010: Speaker Acceptance/Rejection Notification
> 18-19 May 2010: Lucene and Solr Pre-conference Training Sessions
> 20-21 May 2010: Apache Lucene EuroCon
> 
> This conference creates a new opportunity for the Apache Lucene/Solr 
> community and marketplace, providing  the chance to gather, learn and 
> collaborate on the latest in Apache Lucene and Solr search technologies and 
> what's happening in the community and ecosystem. There will be two days of 
> Lucene and Solr training offered May 18 & 19, and followed by two days packed 
> with leading edge Lucene and Solr Open Source Search content and talks by 
> search and open source thought leaders.
> 
> We are soliciting 45-minute presentations for the conference, 20-21 May 2010 
> in Prague. The conference and all presentations will be in English.
> 
> Topics of interest include: 
> - Lucene and Solr in the Enterprise (case studies, implementation, return on 
> investment, etc.)
> - “How We Did It”  Development Case Studies
> - Spatial/Geo search
> - Lucene and Solr in the Cloud
> - Scalability and Performance Tuning
> - Large Scale Search
> - Real Time Search
> - Data Integration/Data Management
> - Tika, Nutch and Mahout
> - Lucene Connectors Framework
> - Faceting and Categorization
> - Relevance in Practice
> - Lucene & Solr for Mobile Applications
> - Multi-language Support
> - Indexing and Analysis Techniques
> - Advanced Topics in Lucene & Solr Development
> 
> All accepted speakers will qualify for discounted conference admission. 
> Financial assistance is available for speakers that qualify.
> 
> To submit a 45-minute presentation proposal, please send an email to 
> c...@lucene-eurocon.org containing the following information in plain text:
> 
> 1. Your full name, title, and organization
> 
> 2. Contact information, including your address, email, phone number
> 
> 3. The name of your proposed session (keep your title simple and relevant to 
> the topic)
> 
> 4. A 75-200 word overview of your presentation (in English); in addition to 
> the topic, describe whether your presentation is intended as a tutorial, 
> description of an implementation, an theoretical/academic discussion, etc.
> 
> 5. A 100-200-word speaker bio that includes prior conference speaking or 
> related experience (in English)
> 
> To be considered, proposals must be received by 12 Midnight CET Tuesday, 13 
> April 2010 (Tuesday 13 April 6 PM US Eastern time, 3 PM US Pacific Time).
> 
> Please email any questions regarding the conference to 
> i...@lucene-eurocon.org. To be added to the conference mailing list, please 
> email sig...@lucene-eurocon.org. If your organization is interested in 
> sponsorship opportunities, email
> spon...@lucene-eurocon.org
> 
> Key Dates
> 
> 24 March 2010: Call For Participation Open
> 13 April 2010: Call For Participation Closes
> 16 April 2010: Speaker Acceptance/Rejection Notification
> 18-19 May 2010  Lucene and Solr Pre-conference Training Sessions
> 20-21 May 2010: Apache Lucene EuroCon
> 
> We look forward to seeing you in Prague!
> 
> Grant Ingersoll
> Apache Lucene EuroCon Program Chair
> www.lucene-eurocon.org



Re: cheking the size of the index using solrj API's

2010-04-05 Thread Ryan McKinley
On Fri, Apr 2, 2010 at 7:07 AM, Na_D  wrote:
>
> hi,
>
>
> I need to monitor the index for the following information:
>
> 1. Size of the index
> 2 Last time the index was updated.
>

If by 'size o the index' you mean document count, then check the Luke
Request Handler
http://wiki.apache.org/solr/LukeRequestHandler

ryan


Re: add/update document as distinct operations? Is it possible?

2010-04-05 Thread Julian Davchev
Hi,
I got the picture now.
Not having distinct add/update actions force me to implement custom
queueing mechanism.
Thanks
Cheers.

Erick Erickson wrote:
> One of the most requested features in Lucene/SOLR is to be able
> to update only selected fields rather than the whole document. But
> that's not how it works at present. An update is really a delete and
> an add.
>
> So for your second message, you can't do a partial update, you must
> "update" the whole document.
>
> I'm a little confused by what you *want* in your first e-mail. But the
> current way SOLR works, if the SOLR server first received the delete
> then the update, the index would have the document in it. But the
> opposite order would delete the documen.
>
> But this really doesn't sound like a SOLR issue, since SOLR can't
> magically divine the desired outcome. Somewhere you have
> to coordinate the requests or your index will not be what you expect.
> That is, you have to define what rules index modifications follow and
> enforce them. Perhaps you can consider a queueing mechanism of
> some sort (that you'd have to implement yourself...)
>
> HTH
> Erick
>
>
> On Thu, Apr 1, 2010 at 1:03 AM, Julian Davchev  wrote:
>
>   
>> Hi
>> I have distributed messaging solution where I need to distinct between
>> adding a document and just
>> trying to update it.
>>
>> Scenario:
>> 1. message sent for document to be updated
>> 2. meanwhile another message is sent for document to be deleted and is
>> executed before 1
>> As a result when 1 comes instead of ignoring the update as document is
>> no more...it will add it again.
>>
>> From what I see in manual I cannot distinct those operations which
>> would. Any pointers?
>>
>> Cheers
>>
>> 
>
>   



Re: add/update document as distinct operations? Is it possible?

2010-04-05 Thread Israel Ekpo
Chris,

I don't see anything in the headers suggesting that Julian's message was a
hijack of another thread

On Thu, Apr 1, 2010 at 2:17 PM, Chris Hostetter wrote:

>
> : Subject: add/update document as distinct operations? Is it possible?
> : References:
> :
> 
> : In-Reply-To:
> :
> 
>
> http://people.apache.org/~hossman/#threadhijack
> Thread Hijacking on Mailing Lists
>
> When starting a new discussion on a mailing list, please do not reply to
> an existing message, instead start a fresh email.  Even if you change the
> subject line of your email, other mail headers still track which thread
> you replied to and your question is "hidden" in that thread and gets less
> attention.   It makes following discussions in the mailing list archives
> particularly difficult.
> See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
>
>
>
>
> -Hoss
>
>


-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


Re: Related terms/combined terms

2010-04-05 Thread Ahmet Arslan

> Not sure of the exact vocabulary I am looking for so I'll
> try to explain
> myself.
> 
> Given a search term is there anyway to return back a list
> of related/grouped
> keywords (based on the current state of the index) for that
> term. 
> 
> For example say I have a sports catalog and I search for
> "Callaway". Is
> there anything that could give me back
> 
> "Callaway Driver"
> "Callaway Golf Balls"
> "Callaway Hat"
> "Callaway Glove"
> 
> Since these words are always grouped to together/related.
> Note sure if
> something like this is even possible.

ShingleFilterFactory[1] plus TermsComponent[2] can give you grouped (phrases) 
keywords. You need to create an extra field (populate it via copyField) that 
constructs shingles (token n-grams). After that you can retrieve those trigram 
or bi-gram tokens starting with callaway. 
solr/terms?terms=true&terms.fl=yourNewField&terms.prefix=Callaway


[1]http://wiki.apache.org/solr/TermsComponent

[2]http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory




  


Re: add/update document as distinct operations? Is it possible?

2010-04-05 Thread Erick Erickson
I still don't see what the difference is. If there was a distinct
add/update process, how would that absolve you from having
to implement your own queueing? To have predictable index
content, you still must order your operations.

Best
Erick

On Mon, Apr 5, 2010 at 12:45 PM, Julian Davchev  wrote:

> Hi,
> I got the picture now.
> Not having distinct add/update actions force me to implement custom
> queueing mechanism.
> Thanks
> Cheers.
>
> Erick Erickson wrote:
> > One of the most requested features in Lucene/SOLR is to be able
> > to update only selected fields rather than the whole document. But
> > that's not how it works at present. An update is really a delete and
> > an add.
> >
> > So for your second message, you can't do a partial update, you must
> > "update" the whole document.
> >
> > I'm a little confused by what you *want* in your first e-mail. But the
> > current way SOLR works, if the SOLR server first received the delete
> > then the update, the index would have the document in it. But the
> > opposite order would delete the documen.
> >
> > But this really doesn't sound like a SOLR issue, since SOLR can't
> > magically divine the desired outcome. Somewhere you have
> > to coordinate the requests or your index will not be what you expect.
> > That is, you have to define what rules index modifications follow and
> > enforce them. Perhaps you can consider a queueing mechanism of
> > some sort (that you'd have to implement yourself...)
> >
> > HTH
> > Erick
> >
> >
> > On Thu, Apr 1, 2010 at 1:03 AM, Julian Davchev  wrote:
> >
> >
> >> Hi
> >> I have distributed messaging solution where I need to distinct between
> >> adding a document and just
> >> trying to update it.
> >>
> >> Scenario:
> >> 1. message sent for document to be updated
> >> 2. meanwhile another message is sent for document to be deleted and is
> >> executed before 1
> >> As a result when 1 comes instead of ignoring the update as document is
> >> no more...it will add it again.
> >>
> >> From what I see in manual I cannot distinct those operations which
> >> would. Any pointers?
> >>
> >> Cheers
> >>
> >>
> >
> >
>
>


Re: excluder filters and multivalued fields

2010-04-05 Thread Chris Hostetter

: name->john
: year->2009;year->2010;year->2011
: 
: And I query for:
: q=john&fq=-year:2010
: 
: Doc1 won't be in the matching results. Is there a way to make it appear
: because even having 2010 the document has also years that don't match the
: filter query?

Not natively -- but you can index an additional field num_years=3 and then 
make your filter query...

-(+year:2010 +num_years:1)


-Hoss



Re: Minimum Should Match the other way round

2010-04-05 Thread Grant Ingersoll

On Apr 3, 2010, at 10:18 AM, MitchK wrote:

> 
> Hello,
> 
> I want to tinkle a little bit with Solr, so I need a little feedback:
> Is it possible to define a Minimum Should Match for the document itself?
> 
> I mean, it is possible to say, that a query "this is my query" should only
> match a document, if the document matches 3 of the four queried terms.
> 
> However, I am searching for a solution that does something like: "this is my
> query" and the document has to consist of this query plus maximal - for
> example - two another terms?
> 
> Example:
> Query: "this is my query"
> Doc1: "this is my favorite query"
> Doc2: "I am searching for a lot of stuff, so this is my query"
> Doc2: "I'd like to say: this is my query"
> 
> Saying that maximal two another terms should occur in the document, Solr
> should response only doc1.
> If this is not possible out-of-the-box, I think one has to work with
> TermVectors, am I right?

Not quite following.  It sounds like you are saying you want to favor docs that 
are shorter, while still maximizing the number of terms that match, right?

You might look at the Similarity class and the SimilarityFactory as well in the 
Solr/Lucene code.

> 
> I think it's possible to do so outside of Lucene/Solr by aking the response
> of the TermVectorsComponent and filtering the result-list. But I'd like to
> integrate this into Lucene/Solr itself.
> Any ideas which components I have to customize? 
> 
> At the moment I am speculating that I have to customize the class which is
> collecting the result, before it is passing it to the ResponseWriter. 
> 
> Kind regards
> - Mitch
> -- 
> View this message in context: 
> http://n3.nabble.com/Minimum-Should-Match-the-other-way-round-tp694867p694867.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Does Lucidimagination search uses Multi facet query filter or uses session?

2010-04-05 Thread Grant Ingersoll
We are using multiselect facets like what you have below (although I haven't 
verified your syntax).  So no, we are not using sessions.

See http://www.lucidimagination.com/search/?q=multiselect+faceting#/s:email for 
help.

-Grant
http://www.lucidimagination.com

On Apr 1, 2010, at 12:35 PM, bbarani wrote:

> 
> Hi,
> 
> I am trying to create a search functionality same as that of
> Lucidimagination search.
> 
> As of now I have formed the Facet query as below
> 
> http://localhost:8080/solr/db/select?q=*:*&fq={!tag=3DotHierarchyFacet}3DotHierarchyFacet:ABC&facet=on&facet.field={!ex=3DotHierarchyFacet}3DotHierarchyFacet&facet.field=ApplicationStatusFacet&facet.mincount=1
> 
> Since I am having multiple facets I have planned to form the query based on
> the user selection. Something like below...if the user selects (multiple
> facets) application status as 'P' I would form the query as below
> 
> http://localhost:8080/solr/db/select?q=*:*&fq={!tag=3DotHierarchyFacet}3DotHierarchyFacet:NTS&fq={!tag=ApplicationStatusFacet}ApplicationStatusFacet:P&facet=on&facet.field={!ex=3DotHierarchyFacet}3DotHierarchyFacet&&facet.field={!ex=ApplicationStatusFacet}&facet.mincount=1
> 
> Can someone let me know I am forming the correct query to perform
> multiselect facets? I just want to know if I am doing anything wrong in the
> query..
> 
> We are also trying to achieve this using sessions but if we are able to
> solve this by query I would prefer using query than using session
> variables..
> 
> Thanks,
> Barani
> -- 
> View this message in context: 
> http://n3.nabble.com/Does-Lucidimagination-search-uses-Multi-facet-query-filter-or-uses-session-tp691167p691167.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: MoreLikeThis function queries

2010-04-05 Thread Blargy

Ok its now monday and everyone should have had their nice morning cup of
coffee :)
-- 
View this message in context: 
http://n3.nabble.com/MoreLikeThis-function-queries-tp692377p698304.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: feature request for ivalid data formats

2010-04-05 Thread Chris Hostetter
: 
: I don't know whether this is the good place to ask it, or there is a special
: tool for issue
: requests.

We use Jira for bug reports and feature reuqests, but it's always a good 
idea to start with a solr-user email before filing a new bug/request to 
help discuss the behavior you are seeing.

: 2010.03.23. 13:27:23 org.apache.solr.common.SolrException log
: SEVERE: java.lang.NumberFormatException: For input string: "1595-1600"
:at
: java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
:at java.lang.Integer.parseInt(Integer.java:456)
: 
: It would be great help in some cases, if I could know which field contained
: this data in wrong format.

you are 100% correct ... can you let us know what the rest of hte stack 
trace is (beyond that last line you posted) so we can figure out exactly 
where the bug is?

: "SimplePostTool: FATAL: Solr returned an error: For_input_string_15951600
: __javalangNumberFormatException_For_input_string_15951600
: ___at_javalangNumberFormatExceptionforInputStringNumberFormat"
: 
: (I added some line breaks for the shake of readability.)
: 
: Could not be returned a string with the same format as in Solr log?

Solr relies on the servlet container to format the error and return it to 
the user, with Jetty, the error does actually come back in human readable 
form as part of the response body -- what the SimplePostToll is printing 
out there is actually the one line HTTP "response message" which jetty (in 
it's infinite wisdom) set's using the entire response with the whitespace 
and newlinees escaped.

If you us something like "curl -D -" to hit a Solr URL, you'll see what i 
mean about the response message vs the response body, and if you use a 
differnet servlet container (like tomcat) you'll see wha i mean baout the 
servlet container having control over what the error messages look like.


-Hoss



Re: dismax multi search?

2010-04-05 Thread Chris Hostetter

: I want to be able to direct some search terms to specific fields
: 
: I want to do something like this
: 
: keyword1 should search against book titles / authors
: 
: keyword2 should search against book contents / book info / user reviews

your question is a little vague ... will keyword1 and keyword2 be distinct 
params (ie: will the user tell you when certain words should be queried 
against titles/authors and when other keywords sould be queried against 
content/info/reviews) ... or are you going to have big ass giant workd 
lists, and anytime you see a word from one of those lists, you query a 
specific field for that word?

assuming you mean the first (and not hte second) situation, you can use 
nested query parsers with param substitutio to get some interesting 
results...

http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/
http://n3.nabble.com/How-to-compose-a-query-from-multiple-HTTP-URL-parameters-td519441.html#a679489



-Hoss



including external files in config by corename

2010-04-05 Thread Shawn Heisey
Is it possible to access the core name in a config file (such as 
solrconfig.xml) so I can include core-specific configlets into a common 
config file?  I would like to pull in different configurations for 
things like shards and replication, but have all the cores otherwise use 
an identical config file.


Also, I have been looking for the syntax to include a snippet and 
haven't turned anything up yet.


Thanks,
Shawn



Re: Related terms/combined terms

2010-04-05 Thread Blargy

Thanks for the response Mitch. 

I'm not too sure how well this will work for my needs but Ill certainly play
around with it. I think something more along the lines of Ahmet's solution
is what I was looking for. 
-- 
View this message in context: 
http://n3.nabble.com/Related-terms-combined-terms-tp694083p698327.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: no of cfs files are more that the mergeFactor

2010-04-05 Thread Chris Hostetter

This sounds completley normal form what i remembe about mergeFactor.

Segmenets are merged "by level" meaning that with a mergeFactor of 5, once 
5 "level 1" segments are formed they are merged into a single "level 2" 
segment.  then 5 more "level 1" segments are allowed to form before the 
next merge (resulting in 2 "legel 2" sements).  Once you have 5 "level 2" 
sements, then they are all merged into a single "level 3" segment, etc...

: I had my mergeFactor as 5 , 
: but when i load a data with some 1,00,000 i got some 12 .cfs files in my
: data/index folder .
: 
: How come this is possible .
: in what context we can have more no of .cfs files 


-Hoss



Re: Related terms/combined terms

2010-04-05 Thread Blargy

Ahmet thanks, this sounds like what I was looking for. 

Would one recommend using the TermsComponent prefix search or the Faceted
prefix search for this sort of functionality. I know for auto-suggest
functionality the generally consensus has been leaning towards the Faceted
prefix search over the TermsComponent. Wondering if this holds true for this
use case.

Thanks again
-- 
View this message in context: 
http://n3.nabble.com/Related-terms-combined-terms-tp694083p698349.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: no of cfs files are more that the mergeFactor

2010-04-05 Thread Mark Miller
I'm guessing the user is expecting there to be one cfs file for the 
index, and does not understand that its actually per segment.


On 04/05/2010 01:59 PM, Chris Hostetter wrote:

This sounds completley normal form what i remembe about mergeFactor.

Segmenets are merged "by level" meaning that with a mergeFactor of 5, once
5 "level 1" segments are formed they are merged into a single "level 2"
segment.  then 5 more "level 1" segments are allowed to form before the
next merge (resulting in 2 "legel 2" sements).  Once you have 5 "level 2"
sements, then they are all merged into a single "level 3" segment, etc...

: I had my mergeFactor as 5 ,
: but when i load a data with some 1,00,000 i got some 12 .cfs files in my
: data/index folder .
:
: How come this is possible .
: in what context we can have more no of .cfs files


-Hoss

   



--
- Mark

http://www.lucidimagination.com





Re: Getting solr response in HTML format : HTMLResponseWriter

2010-04-05 Thread Chris Hostetter

: so I have tried to attach the xslt steelsheet to the response of SOLR with
: passing this 2 variables wt=xslt&tr=example.xsl
: 
: while example.xsl is an included steelsheet to SOLR , but the response in
: HTML was'nt very perfect .

can you elaborate on what you mean by "wasn't very perfect" ? ... what was 
wrong with it? ... was there an actaul bug, or were you just not happy 
with how it looked?  did you try modifying the exampl.xsl?  (it's intended 
purely as an example ... it's not ment to work for everyone as is)

: So i have readen on the net that we can write an extension to the
: QueryResponseWriter class like XMLResponseWriter (default)
: and i m trying to build that .
...
: I m proceeding like XMLREsponseWriter to create HTMLResponseWriter and i

I would strongly suggest that instead of doing this, you take a look at 
the velocity response writer (in contrib) or tweet teh XSL some more ... 
writing a custom HTMLResponseWriter isn't neraly as flexible as either of 
those other two options -- particularly because the ResponseWriter API 
requires you to deal with the Response objects in the order they are added 
by the RequestHandler -- which isn't neccessarily the same order you want 
to deal with them in an HTML response.  (this isn't typically a problem 
for most ResponseWriters because htey aren't typically intended to be read 
by humans)

: org.apache.solr.common.SolrException: Error loading class
: 'org.apache.solr.request.HTMLResponseWriter'

1) if you are writing a custom ResponseWriter, you should be using your 
own java package name, not "org.apache.solr.request"

: Caused by: java.lang.ClassNotFoundException:
: org.apache.solr.request.HTMLResponseWriter

2) it can't find your class.  did you compile it?  did you put it i na 
jar? where did you put the jar?  what does your solr install look like? 
... the details are the key to understanding why it can't find your class.



-Hoss



Re: exceptionhandling & error-reporting?

2010-04-05 Thread Chris Hostetter

: This client uses a simple user-agent that requires JSON-syntax while parsing 
: searchresults from solr, but when solr drops an exception, tomcat returns an 
: error-500 page to the client and it crashes. 

define "crashes" ? ... presumabl you are tlaking about the client crashing 
because it can't parse theerro response, correct? ... the best suggestion 
given the current state of Solr is to make hte client smart enough to not 
attempt parsing of hte response unless the response code is 200.

: I was wondering if theres already a way to prepare exceptions as 
error-reports 
: and integrate them into the search-result as a hint to the user? If it would 
: be just another element of the whole response-format, it would be possibly 
: compatible with any client out there. 

It's one of the oldest out standing "improvements" in the Solr issue 
tracker, but it hasn't gotten much love over the years...

https://issues.apache.org/jira/browse/SOLR-141

One possible workarround if you are comfortable with Java andif you are 
willing to always get the erros in a single response format (ie: JSON)...
 
you can customize the solr.war to specify an "error jsp" that your serlvet 
container will use to format all error responses.  you can make that JSP 
extract the error message from the Exception and output it in JSON format.



-Hoss



Need info on CachedSQLentity processor

2010-04-05 Thread bbarani

Hi,

I am using cachedSqlEntityprocessor in DIH to index the data. Please find
below my dataconfig structure,

 ---> object 
 --> object properties 

For each and every object I would be retrieveing corresponding object
properties (in my subqueries).

I get in to OOM very often and I think thats a trade off if I use
cachedSqlEntityprocessor. 

My assumption is that when I use cachedSqlEntityprocessor the indexing
happens as follows,

First entity x will get executed and the entire table gets stored in cache

next entity y gets executed and entire table gets stored in cache 

Finally the compasion heppens through hash map .

So always I need to have the memory allocated to SOLR JVM more than or equal
to the data present in tables?


Now my final question is that even after SOLR complexes indexing the memory
used previously is not getting released. I could still see the JVM consuming
1.5 GB after the indexing completes. I tried to use Java hotspot options but
didnt see any differences..

Any thoughts / confirmation on my assumptions above would be of great help
to me to get in to  a decision of choosing cachedSqlEntityprocessor or not.

Thanks,
BB



-- 
View this message in context: 
http://n3.nabble.com/Need-info-on-CachedSQLentity-processor-tp698418p698418.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is this a bug of the RessourceLoader?

2010-04-05 Thread Chris Hostetter

: Some applications (such as Windows Notepad), insert a UTF-8 Byte Order Mark
: (BOM) as the first character of the file. So, perhaps the first word in your
: stopwords list contains a UTF-8 BOM and thats why you are seeing this
: behavior.

Robert: BOMs are one of those things that strike me as being abhorent and 
inheriently evil because they seem to cause nothing but problems -- but in 
truth i understand very little baout them and have no idea if/when they 
actually add value.

If text files that start with a BOM aren't properly being dealt with by 
Solr right now, should we consider that a bug?  Is there something we 
can/should be doing in SolrResourceLoader to make Solr handle this 
situation better?


-Hoss



RE: Query time only Ranges

2010-04-05 Thread Chris Hostetter

: Actually I needed time upto seconds granularity, so did you mean I 
: should index the field after conversion into seconds

it doesnt' relaly matter what granularity you need -- the point is if you 
need to query for things based on time of day, independent of hte actual 
date, then the best way to do this is probably to ignore the Solr 
DateField completely and just use a numeric field to index some unit of 
time as a number (it doesn't matter wether it's hours, minutes, seconds, 
or milliseconds -- use whatever makes hte most sense for your needs)

: if you only need to store the hour of the day, and query on the hour of 
: the day, then i would just use a numeric integer field containing the hour 
: of the day.
: 
: if you want minute or second (even even millisecond) granularity, but you 
: still only care abotu the time of day (and note the *date*) then i would 
: still use an integer field, and just index the numeric value in whatever 
: granualrity you need.

-Hoss



Re: selecting documents older than 4 hours

2010-04-05 Thread Chris Hostetter

: NOW/HOUR-5HOURS evaluates to 2010-03-31T21:00:00 which should not be the
: case if the current time is Wed Mar 31 19:50:48 PDT 2010. Is SOLR converting
: NOW to GMT time? 

1) "NOW" means "Now" ... what moment in time is happening right at this 
moment is independent of what locale you are in and how you want to format 
that moment to represent it as a string.

2) Solr always parses/formats date time vlaues in UTC because Solr has no 
way of knowing what timezone the clients are in (or if some clients are in 
differnet timezones from eachother, or if the index is being replicated 
from a server in one timezone to a server i na differnet timezone, 
etc...).  The documentation for DateField is very explicit about this 
(it's why the trailing "Z" is mandatory)

3) Rounding is always done relative UTC, largely for all of the same 
reasons listed above.  If you want a specific offset you have to add it in 
using the DateMath syntax, ie...

last_update_date:[NOW/DAY-7DAYS+8HOURS TO NOW/HOUR-5HOURS+8HOURS]


-Hoss



Re: Is this a bug of the RessourceLoader?

2010-04-05 Thread Yonik Seeley
On Mon, Apr 5, 2010 at 2:28 PM, Chris Hostetter
 wrote:
> If text files that start with a BOM aren't properly being dealt with by
> Solr right now, should we consider that a bug?

It's a Java bug:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

But we should fix if it's practical to do so, rather than passing the buck.

-Yonik
http://www.lucidimagination.com


Re: Minimum Should Match the other way round

2010-04-05 Thread Chris Hostetter
: > However, I am searching for a solution that does something like: "this is my
: > query" and the document has to consist of this query plus maximal - for
: > example - two another terms?
...
: Not quite following.  It sounds like you are saying you want to favor 
: docs that are shorter, while still maximizing the number of terms that 
: match, right?

I'm pretty sure he's looking for more then what Similarity can provide 
w/lengthNorms -- note that he specificly wnats to eliminate matches that 
contain more then X additional terms besides what's included in the query.  
(so the doc "how now brown sexy cow" would match a query for q=how+cow&x=3 
but it would not match a query for q=how+cow&x=2 (because there are more 
then 2 "left over" words in the document)

This sounds a lot like a usecase that a mentioned in my "Beyond The Box" 
talk at ACUS2008...
   http://people.apache.org/~hossman/apachecon2008us/btb/

...take a look at slides 32-35.  The first approach is how the person I 
spoke to (anonymous) actaully solved this problem for their company (note: 
it was not actaully a movie title domain space, that's my own example) and 
the second appraoch is an example of how i would have probably attempted 
to tackle this problem.  (Note: in hindsight, you can't have a gneric 
numeric field with a tokenizer, so that "titleLen" field would need to be 
a TextField and you'd have to use oldschool zero padding tricks to make 
the range query qork problem -- but for this type of usecase the numbers 
aren't likelye to ever be more then 100 anyway so it's not to heineous)


-Hoss



Re: Is this a bug of the RessourceLoader?

2010-04-05 Thread Robert Muir
On Mon, Apr 5, 2010 at 2:28 PM, Chris Hostetter wrote:

>
> Robert: BOMs are one of those things that strike me as being abhorent and
> inheriently evil because they seem to cause nothing but problems --
>

Yes.


>
> If text files that start with a BOM aren't properly being dealt with by
> Solr right now, should we consider that a bug?


No.


> Is there something we
> can/should be doing in SolrResourceLoader to make Solr handle this
> situation better?
>
>
Yes, we can ignore them for the first line of the file to be more
user-friendly. I'll open an issue.

-- 
Robert Muir
rcm...@gmail.com


Re: Read Time Out Exception while trying to upload a huge SOLR input xml

2010-04-05 Thread Lance Norskog
Solr also has a feature to stream from a local file rather than over
the network. The parameter
stream.file=/full/local/file/name.txt
means 'read this file from the local disk instead of the POST upload'.
Of course, you have to get the entire file onto the Solr indexer
machine (or a common file server).

http://wiki.apache.org/solr/UpdateRichDocuments#Parameters

On Thu, Apr 1, 2010 at 9:27 PM, Mark Fletcher
 wrote:
> Hi Eric, Shawn,
>
> Thank you for your reply.
>
> Luckily just on the second time itself my 13GB SOLR XML (more than a million
> docs) went in fine into SOLR without any problem and I uploaded another 2
> more sets of 1.2million+ docs fine without any hassle.
>
> I will try for lesser sized more xmls next time as well as the auto commit
> suggestion.
>
> Best Rgds,
> Mark.
>
> On Thu, Apr 1, 2010 at 6:18 PM, Shawn Smith  wrote:
>
>> The error might be that your http client doesn't handle really large
>> files (32-bit overflow in the Content-Length header?) or something in
>> your network is killing your long-lived socket?  Solr can definitely
>> accept a 13GB xml document.
>>
>> I've uploaded large files into Solr successfully, including recently a
>> 12GB XML input file with ~4 million documents.  My Solr instance had
>> 2GB of memory and it took about 2 hours.  Solr streamed the XML in
>> nicely.  I had to jump through a couple of hoops, but in my case it
>> was easier than writing a tool to split up my 12GB XML file...
>>
>> 1. I tried to use curl to do the upload, but it didn't handle files
>> that large.  For my quick and dirty testing, netcat (nc) did the
>> trick--it doesn't buffer the file in memory and it doesn't overflow
>> the Content-Length header.  Plus I could pipe the data through pv to
>> get a progress bar and estimated time of completion.  Not recommended
>> for production!
>>
>>  FILE=documents.xml
>>  SIZE=$(stat --format %s $FILE)
>>  (echo "POST /solr/update HTTP/1.1
>>  Host: localhost:8983
>>  Content-Type: text/xml
>>  Content-Length: $SIZE
>>  " ; cat $FILE ) | pv -s $SIZE | nc localhost 8983
>>
>> 2. Indexing seemed to use less memory if I configured Solr to auto
>> commit periodically in solrconfig.xml.  This is what I used:
>>
>>    
>>        
>>            25000 
>>            30 
>>        
>>    
>>
>> Shawn
>>
>> On Thu, Apr 1, 2010 at 10:10 AM, Erick Erickson 
>> wrote:
>> > Don't do that. For many reasons . By trying to batch so many docs
>> > together, you're just *asking* for trouble. Quite apart from whether
>> it'll
>> > work once, having *any* HTTP-based protocol work reliably with 13G is
>> > fragile...
>> >
>> > For instance, I don't want to have my know whether the XML parsing in
>> > SOLR parses the entire document into memory before processing or
>> > not. But I sure don't want my application to change behavior if SOLR
>> > changes it's mind and wants to process the other way. My perfectly
>> > working application (assuming an event-driven parser) could
>> > suddenly start requiring over 13G of memory... Oh my aching head!
>> >
>> > Your specific error might even be dependent upon GCing, which will
>> > cause it to break differently, sometimes, maybe..
>> >
>> > So do break things up and transmit multiple documents. It'll save you
>> > a world of hurt.
>> >
>> > HTH
>> > Erick
>> >
>> > On Thu, Apr 1, 2010 at 4:34 AM, Mark Fletcher
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> For the first time I tried uploading a huge input SOLR xml having about
>> 1.2
>> >> million *docs* (13GB in size). After some time I get the following
>> >> exception:-
>> >>
>> >>  The server encountered an internal error ([was class
>> >> java.net.SocketTimeoutException] Read timed out
>> >> java.lang.RuntimeException: [was class java.net.SocketTimeoutException]
>> >> Read
>> >> timed out
>> >>  at
>> >>
>> >>
>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>> >>  at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>> >>  at
>> >>
>> >>
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>> >>  at
>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>> >>  at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279)
>> >>  at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138)
>> >>  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
>> >>  at
>> >>
>> >>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>> >>  at
>> >>
>> >>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>> >>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>> >>  at
>> >>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>> >>  at
>> >>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>> >>  at
>> >>
>> >>
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applicati

Re: Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor

2010-04-05 Thread Andrew McCombe
Hi

Can no-one help me with this?

Andrew

On 2 April 2010 22:24, Andrew McCombe  wrote:
> Hi
>
> I am experimenting with Solr to index my gmail and am experiencing an error:
>
> 'Unable to load MailEntityProcessor or
> org.apache.solr.handler.dataimport.MailEntityProcessor'
>
> I downloaded a fresh 1.4 tgz, extracted it and added the following to
> example/solr/config/solrconfig.xml:
>
>
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>    
>       name="config">/home/andrew/bin/apache-solr-1.5-dev/example/solr/conf/email-data-config.xml
>    
>  
>
> email-data-config.xml containd the following:
>
> 
> 
>              user="eupe...@gmail.com"
>           password="xx"
>           host="imap.gmail.com"
>           protocol="imaps"
>           folders = "inbox"/>
> 
> 
>
> Whenever I try to import data using /dataimport?command=full-import I
> am seeing the error below:
>
> Apr 2, 2010 10:14:51 PM
> org.apache.solr.handler.dataimport.DataImporter doFullImport
> SEVERE: Full Import failed
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
> to load EntityProcessor implementation for entity:11418758786959
> Processing Document # 1
>        at 
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>        at 
> org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:805)
>        at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:536)
>        at 
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:261)
>        at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185)
>        at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
>        at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
>        at 
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
> Caused by: java.lang.ClassNotFoundException: Unable to load
> MailEntityProcessor or
> org.apache.solr.handler.dataimport.MailEntityProcessor
>        at 
> org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:966)
>        at 
> org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:802)
>        ... 6 more
> Caused by: org.apache.solr.common.SolrException: Error loading class
> 'MailEntityProcessor'
>        at 
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
>        at 
> org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:956)
>        ... 7 more
> Caused by: java.lang.ClassNotFoundException: MailEntityProcessor
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>        at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:592)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
>        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
>        at java.lang.Class.forName0(Native Method)
>        at java.lang.Class.forName(Class.java:247)
>        at 
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357)
>        ... 8 more
> Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: start rollback
> Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: end_rollback
>
>
> Am I missing a step somewhere? I have tried this with the standard
> apache 1.4, a nightly of 1.5 and also the LucidWorks release and get
> the same issue with each.  The wiki isn't very detailed either. My
> backbground isn't in Java so a lot of this is new to me.
>
>
> Regards
> Andrew McCombe
>


Re: Experience with indexing billions of documents?

2010-04-05 Thread Lance Norskog
The 2B limitation is within one shard, due to using a signed 32-bit
integer. There is no limit in that regard in sharding- Distributed
Search uses the stored unique document id rather than the internal
docid.

On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens  wrote:
> A colleague of mine is using native Lucene + some home-grown
> patches/optimizations to index over 13B small documents in a 32-shard
> environment, which is around 406M docs per shard.
>
> If there's a 2B doc id limitation in Lucene then I assume he's patched it
> himself.
>
> On Fri, Apr 2, 2010 at 1:17 PM,  wrote:
>
>> My guess is that you will need to take advantage of Solr 1.5's upcoming
>> cloud/cluster renovations and use multiple indexes to comfortably achieve
>> those numbers. Hypthetically, in that case, you won't be limited by single
>> index docid limitations of Lucene.
>>
>> > We are currently indexing 5 million books in Solr, scaling up over the
>> > next few years to 20 million.  However we are using the entire book as a
>> > Solr document.  We are evaluating the possibility of indexing individual
>> > pages as there are some use cases where users want the most relevant
>> pages
>> > regardless of what book they occur in.  However, we estimate that we are
>> > talking about somewhere between 1 and 6 billion pages and have concerns
>> > over whether Solr will scale to this level.
>> >
>> > Does anyone have experience using Solr with 1-6 billion Solr documents?
>> >
>> > The lucene file format document
>> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
>> > mentions a limit of about 2 billion document ids.   I assume this is the
>> > lucene internal document id and would therefore be a per index/per shard
>> > limit.  Is this correct?
>> >
>> >
>> > Tom Burton-West.
>> >
>> >
>> >
>> >
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Minimum Should Match the other way round

2010-04-05 Thread MitchK

Thank you both for responsing.

Hoss,

what you've pointed out was exactly what I am looking for.
However, I would *always* prefer the second implementation, because of the
fact that you have to compute the number of terms for all records only for
*one* time. :-)

At the moment I would feel like writing a TokenCountingTokenFilter and
implement the QParser this way:
extending my favorite QParser and in the constructor I would do something
like:

- creating a StringReader from the query-string
- let a Tokenizer tokenize my query-string (without a factory, just
instantiate something like Tokenizer t = new WhitespaceTokenizer(reader);)
- maybe filtering the tokenized query with other filters
- give my query to the TokenCountingTokenFilter and set the number of tokens
of the query with its help.
- getting MAX_LEN with the help of a getParam-Method.

However, I got some doubts on this: What about queries that should be
filtered with the WordDelimiterFilter. This could make a large difference to
a none-delimiter-filtered MAX_LEN *and* it has got a protwords param. I
can't instantiate a new WordDelimiterFilter everytime I do a query, so how
can I put my already instantiated Filters into a cache for such usecases?
I think solving this problem perhaps would also lead to a possibility to
make multiword synonyms at query-time possible. 

Do you know which class stores the produced filters from the FilterFactories
and how I can access them?

Kind regards
- Mitch
-- 
View this message in context: 
http://n3.nabble.com/Minimum-Should-Match-the-other-way-round-tp694867p698683.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index db data

2010-04-05 Thread MitchK

It seems to work ;).

However, trueman, you should subscribe to solr-user@lucene.apache.org, since 
not everybody looks up Nabble for mailing-list postings. 

- Mitch
-- 
View this message in context: 
http://n3.nabble.com/Index-db-data-tp693204p698691.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr caches and nearly static indexes

2010-04-05 Thread Lance Norskog
In a word: "no".

What you can do instead of deleting them is to add them to a growing
list of "don't search for these documents". This could be listed in a
filter query.

We had exactly this problem in a consumer app; we had a small but
continuously growing list of obscene documents in the index, and did
not want to display these. So, we had a filter query with all of the
obscene words, and used this with every query.

Lance

On Fri, Apr 2, 2010 at 6:34 PM, Shawn Heisey  wrote:
> My index has a number of shards that are nearly static, each with about 7
> million documents.  By nearly static, I mean that the only changes that
> normally happen to them are document deletions, done with the xml update
> handler.  The process that does these deletions runs once every two minutes,
> and does them with a query on a field other than the one that's used for
> uniqueKey.  Once a day, I will be adding data to these indexes with the DIH
> delta-import.  One of my shards gets all new data once every two minutes,
> but it is less than 5% the size of the others.
>
> The problem that I'm running into is that every time a delete is committed,
> my caches are suddenly invalid and I seem to have two options: Spend a lot
> of time and I/O rewarming them, or suffer with slow (3 seconds or longer)
> search times.  Is there any way to have the index keep its caches when the
> only thing that happens is deletions, then invalidate them when it's time to
> actually add data?  It would have to be something I can dynamically change
> when switching between deletions and the daily import.
>
> Thanks,
> Shawn
>
>



-- 
Lance Norskog
goks...@gmail.com


Some help for folks trying to get new Solr/Lucene up in Eclipse

2010-04-05 Thread Mattmann, Chris A (388J)
Hey All,

Just to save some folks some time in case you are trying to get new
Lucene/Solr up in running in Eclipse. If you continue to get weird errors,
e.g., in solr/src/test/TestConfig.java regarding
org.w3c.dom.Node#getTextContent(), I found for me this error was caused by
including the Tidy.jar (which includes its own version of the Node API) in
the build path. If you take that out, you should be good.

Wanted to pass that along.

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: Obtaining SOLR index size on disk

2010-04-05 Thread Lance Norskog
This information is not available via the API. If you would like this
information added to the statistics request, please file a JIRA
requesting it.

Without knowing the size of the index files to be transferred, the
client cannot monitor its own disk space. This would be useful for the
cloud management features.

On Mon, Apr 5, 2010 at 5:35 AM, Na_D  wrote:
>
>  hi,
>
>   I am using the piece of code given below
>
>          ReplicationHandler handler2 = new ReplicationHandler();
>                 System.out.println( handler2.getDescription());
>
>
>                 NamedList statistics = handler2.getStatistics();
>                 System.out.println("Statistics   "+ statistics);
>
> The result that i am getting (ie the printed statment is :
> Statistics
> {handlerStart=1270469530218,requests=0,errors=0,timeouts=0,totalTime=0,avgTimePerRequest=NaN,avgRequestsPerSecond=NaN}
>
>
> But the Statistics consists of the other info too:
>
> 
>        org.apache.solr.handler.ReplicationHandler
>      
>      
>        $Revision: 829682 $
>      
>
>      
>        ReplicationHandler provides replication of index and configuration
> files from Master to Slaves
>      
>      
>
>        
>          1270463612968
>        
>
>        
>          0
>        
>
>        
>          0
>        
>
>        
>          0
>        
>
>        
>          0
>        
>
>        
>          NaN
>        
>
>        
>          0.0
>        
>
>        
>          19.29 KB
>        
>
>        
>          1266984293131
>        
>
>        
>          3
>        
>
>        
>          C:\solr\apache-solr-1.4.0\example\example-DIH\solr\db\data\index
>        
>
>        
>          true
>        
>
>        
>          false
>        
>
>        
>          schema.xml,stopwords.txt,elevate.xml
>        
>
>        
>          [commit, startup]
>        
>
>        
>          true
>        
>
>      
>    
>
>
>
> this is where the problem lies : i need the size of the index im not finding
> the API
> nor is the statistics printing out(sysout) the same.
> how to i get the size of the index
> --
> View this message in context: 
> http://n3.nabble.com/Obtaining-SOLR-index-size-on-disk-tp500095p697599.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Minimum Should Match the other way round

2010-04-05 Thread MitchK

Sorry for doubleposting, but to avoid any missunderstanding: 
Accessing instantiated filters is not a really good idea, since a new Filter
must be instantiated all the time. However, what I have ment was: if I
create a WordDelimiterFilter or a StopFilter and I have set a param for a
file like stopwords.txt or protwords.txt, I want to access those (as I
understood cached) ressources. 

- Mitch
-- 
View this message in context: 
http://n3.nabble.com/Minimum-Should-Match-the-other-way-round-tp694867p698796.html
Sent from the Solr - User mailing list archive at Nabble.com.


one particular doc in results should always come first for a particular query

2010-04-05 Thread Mark Fletcher
Hi,

Suppose I search for the word  *international. *A particular record (say *
recordX*) I am looking for is coming as the Nth result now.
I have a requirement that when a user queries for *international *I need
recordX to always be the first result. How can I achieve this.

Note:- When user searches with a *different* keyword, *recordX*  need not be
the expected first result record; it may be a different record that has to
be made to come as the first in the result for that keyword.

Is there a way to achieve this requirement. I am using dismax.

Thanks in advance.

BR,
Mark


Re: Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor

2010-04-05 Thread Lance Norskog
The MailEntityProcessor is an "extra" and does not come normally with
the DataImportHandler. The wiki page should mention this.

In the Solr distribution it should be in the dist/ directory as
dist/apache-solr-dataimporthandler-extras-1.4.jar. The class it wants
is in this jar . (Do 'unzip -l jar' to find the classes inside a jar.)

You have to make a lib/ directory in the Solr core you are using, and
copy this jar into there.

On Mon, Apr 5, 2010 at 1:15 PM, Andrew McCombe  wrote:
> Hi
>
> Can no-one help me with this?
>
> Andrew
>
> On 2 April 2010 22:24, Andrew McCombe  wrote:
>> Hi
>>
>> I am experimenting with Solr to index my gmail and am experiencing an error:
>>
>> 'Unable to load MailEntityProcessor or
>> org.apache.solr.handler.dataimport.MailEntityProcessor'
>>
>> I downloaded a fresh 1.4 tgz, extracted it and added the following to
>> example/solr/config/solrconfig.xml:
>>
>>
>> > class="org.apache.solr.handler.dataimport.DataImportHandler">
>>    
>>      > name="config">/home/andrew/bin/apache-solr-1.5-dev/example/solr/conf/email-data-config.xml
>>    
>>  
>>
>> email-data-config.xml containd the following:
>>
>> 
>> 
>>   >           user="eupe...@gmail.com"
>>           password="xx"
>>           host="imap.gmail.com"
>>           protocol="imaps"
>>           folders = "inbox"/>
>> 
>> 
>>
>> Whenever I try to import data using /dataimport?command=full-import I
>> am seeing the error below:
>>
>> Apr 2, 2010 10:14:51 PM
>> org.apache.solr.handler.dataimport.DataImporter doFullImport
>> SEVERE: Full Import failed
>> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
>> to load EntityProcessor implementation for entity:11418758786959
>> Processing Document # 1
>>        at 
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>>        at 
>> org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:805)
>>        at 
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:536)
>>        at 
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:261)
>>        at 
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185)
>>        at 
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
>>        at 
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
>>        at 
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
>> Caused by: java.lang.ClassNotFoundException: Unable to load
>> MailEntityProcessor or
>> org.apache.solr.handler.dataimport.MailEntityProcessor
>>        at 
>> org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:966)
>>        at 
>> org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:802)
>>        ... 6 more
>> Caused by: org.apache.solr.common.SolrException: Error loading class
>> 'MailEntityProcessor'
>>        at 
>> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
>>        at 
>> org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:956)
>>        ... 7 more
>> Caused by: java.lang.ClassNotFoundException: MailEntityProcessor
>>        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>>        at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:592)
>>        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
>>        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
>>        at java.lang.Class.forName0(Native Method)
>>        at java.lang.Class.forName(Class.java:247)
>>        at 
>> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357)
>>        ... 8 more
>> Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
>> INFO: start rollback
>> Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
>> INFO: end_rollback
>>
>>
>> Am I missing a step somewhere? I have tried this with the standard
>> apache 1.4, a nightly of 1.5 and also the LucidWorks release and get
>> the same issue with each.  The wiki isn't very detailed either. My
>> backbground isn't in Java so a lot of this is new to me.
>>
>>
>> Regards
>> Andrew McCombe
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: including external files in config by corename

2010-04-05 Thread Lance Norskog
Making snippets is part of highlighting.

http://www.lucidimagination.com/search/s:lucid/li:cdrg?q=snippet

On Mon, Apr 5, 2010 at 10:53 AM, Shawn Heisey  wrote:
> Is it possible to access the core name in a config file (such as
> solrconfig.xml) so I can include core-specific configlets into a common
> config file?  I would like to pull in different configurations for things
> like shards and replication, but have all the cores otherwise use an
> identical config file.
>
> Also, I have been looking for the syntax to include a snippet and haven't
> turned anything up yet.
>
> Thanks,
> Shawn
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: no of cfs files are more that the mergeFactor

2010-04-05 Thread Lance Norskog
mergeFactor=5 means that if there are 42 documents, there will be 3 index files:

1 with 25 documents,
3 with 5 documents, and
1 with 2 documents

Imagine making change with coins of 1 document, 5 documents, 5^2
documents, 5^3 documents, etc.

On Mon, Apr 5, 2010 at 10:59 AM, Chris Hostetter
 wrote:
>
> This sounds completley normal form what i remembe about mergeFactor.
>
> Segmenets are merged "by level" meaning that with a mergeFactor of 5, once
> 5 "level 1" segments are formed they are merged into a single "level 2"
> segment.  then 5 more "level 1" segments are allowed to form before the
> next merge (resulting in 2 "legel 2" sements).  Once you have 5 "level 2"
> sements, then they are all merged into a single "level 3" segment, etc...
>
> : I had my mergeFactor as 5 ,
> : but when i load a data with some 1,00,000 i got some 12 .cfs files in my
> : data/index folder .
> :
> : How come this is possible .
> : in what context we can have more no of .cfs files
>
>
> -Hoss
>
>



-- 
Lance Norskog
goks...@gmail.com


exact match coming as second record

2010-04-05 Thread Mark Fletcher
Hi,

I am using the dismax handler.
I have a field named *myfield* which has a value say XXX.YYY.ZZZ. I have
boosted myfield^20.0.
Even with such a high boost (in fact among the qf fields specified this
field has the max boost given), when I search for XXX.YYY.ZZZ I see my
record as the second one in the results and a record of  the form
XXX.YYY.ZZZ.AAA.BBB is appearing as the first one.

Can any one help me understand why is this so, as I thought an exact match
on a heavily boosted field would give the exact match record first in
dismax.

Thanks and Rgds,
Mark


Re: one particular doc in results should always come first for a particular query

2010-04-05 Thread Erick Erickson
Hmmm, how do you know which particular record corresponds to which keyword?
Is this a list known at index time, as in "this record should come up first
whenever "bonkers" is the keyword?

If that's the case, you could copy the magic keyword to a different field
(say magic_keyword) and boost it right into orbit as an OR clause
(magic_keyword:bonkers ^1). This kind of assumes that a magic keyword
corresponds to one and only one document

If this is way off base, perhaps you could characterize how keywords map to
specific documents you want at the top.

Best
Erick

P.S. It threw me for a minute when you used asterisks (*) for emphasis, it's
easily confused with wildcards.

On Mon, Apr 5, 2010 at 5:30 PM, Mark Fletcher
wrote:

> Hi,
>
> Suppose I search for the word  *international. *A particular record (say *
> recordX*) I am looking for is coming as the Nth result now.
> I have a requirement that when a user queries for *international *I need
> recordX to always be the first result. How can I achieve this.
>
> Note:- When user searches with a *different* keyword, *recordX*  need not
> be
> the expected first result record; it may be a different record that has to
> be made to come as the first in the result for that keyword.
>
> Is there a way to achieve this requirement. I am using dismax.
>
> Thanks in advance.
>
> BR,
> Mark
>


Re: exact match coming as second record

2010-04-05 Thread Erick Erickson
What do you get back when you specify &debugQuery=on?

Best
Erick

On Mon, Apr 5, 2010 at 7:31 PM, Mark Fletcher
wrote:

> Hi,
>
> I am using the dismax handler.
> I have a field named *myfield* which has a value say XXX.YYY.ZZZ. I have
> boosted myfield^20.0.
> Even with such a high boost (in fact among the qf fields specified this
> field has the max boost given), when I search for XXX.YYY.ZZZ I see my
> record as the second one in the results and a record of  the form
> XXX.YYY.ZZZ.AAA.BBB is appearing as the first one.
>
> Can any one help me understand why is this so, as I thought an exact match
> on a heavily boosted field would give the exact match record first in
> dismax.
>
> Thanks and Rgds,
> Mark
>


Re: one particular doc in results should always come first for a particular query

2010-04-05 Thread Chris Hostetter

: If that's the case, you could copy the magic keyword to a different field
: (say magic_keyword) and boost it right into orbit as an OR clause
: (magic_keyword:bonkers ^1). This kind of assumes that a magic keyword
: corresponds to one and only one document
: 
: If this is way off base, perhaps you could characterize how keywords map to
: specific documents you want at the top.

This smells like...

http://wiki.apache.org/solr/QueryElevationComponent

-Hoss


Re: Multicore and TermVectors

2010-04-05 Thread Chris Hostetter

: Subject: Multicore and TermVectors

It doesn't sound like Multicore is your issue ... it seems like what you 
mean is that you are using distributed search with TermVectors, and that 
is causing a problem.  Can you please clarify exactly what you mean ... 
describe your exact setup (ie: how manay machines, how many solr ports 
running on each of those machines, what the solr.xml looks like on each of 
those ports, how many SolrCores running in each of those ports, what 
the slrconfig.xml looks like for each of those instances, which instances 
coordinate distributed searches of which shards, what urls your client 
hits, what URLs get hit on each of your shards (according to the logs) as 
a result, etc... 

details, details, details.


-Hoss



Re: Solr caches and nearly static indexes

2010-04-05 Thread Chris Hostetter

: times.  Is there any way to have the index keep its caches when the only thing
: that happens is deletions, then invalidate them when it's time to actually add
: data?  It would have to be something I can dynamically change when switching
: between deletions and the daily import.

The problem is a delete is a genuine hange that invalidates hte cache 
objects.  The worst case is the QueryResultCache where a deleted doc would 
require shifting all of hte other docs up in any result set that it 
matched on -- even if that doc isn't in the actual DocSlice that's cached 
(ie: the cached version of results 50-100 is affected by deleting a doc 
from 1-50)

In theory something like the filterCache could be warmed by copying 
entires from the old cache and just unsetting the bits corrisponding to 
the deleted docs -- except that i'm pretty sure even if all you do is 
delete some docs, a MergePolicy *could* decide to merge segments and 
collapse away the docids of the delete docs.


-Hoss



Re: Solr caches and nearly static indexes

2010-04-05 Thread Chris Hostetter

: We had exactly this problem in a consumer app; we had a small but
: continuously growing list of obscene documents in the index, and did
: not want to display these. So, we had a filter query with all of the
: obscene words, and used this with every query.

that doesn't seem like it would really help with the caching issue ... the 
reusing the FieldCache seems like hte only thing that would be 
advantageous in that case, the filterCache and queryResultCache are going 
to have a low cache hit rate as the filter queries involved keep changing 
as new doc eys get added to the filter query.

or am i completely missunderstanding how you had this working?



-Hoss



Re: Solr caches and nearly static indexes

2010-04-05 Thread Yonik Seeley
On Mon, Apr 5, 2010 at 9:04 PM, Chris Hostetter
 wrote:
> ... the reusing the FieldCache seems like hte only thing that would be
> advantageous in that case

And FieldCache entries are currently reused when there have only been
deletions on a segment (since Solr 1.4).

-Yonik
http://www.lucidimagination.com


Re: Solr caches and nearly static indexes

2010-04-05 Thread Chris Hostetter


: > ... the reusing the FieldCache seems like hte only thing that would be
: > advantageous in that case
: 
: And FieldCache entries are currently reused when there have only been
: deletions on a segment (since Solr 1.4).

But that's kind of orthoginal to (what i think) Lance's point was: that 
instead of deleting docs and open a new searcher, you could instead just 
add the doc keys to a (negated) filter query (and never open a new 
searcher at all)




-Hoss



Re: Solr caches and nearly static indexes

2010-04-05 Thread Yonik Seeley
On Mon, Apr 5, 2010 at 9:10 PM, Chris Hostetter
 wrote:
>
>
> : > ... the reusing the FieldCache seems like hte only thing that would be
> : > advantageous in that case
> :
> : And FieldCache entries are currently reused when there have only been
> : deletions on a segment (since Solr 1.4).
>
> But that's kind of orthogina

Yeah - just coming into the middle and pointing out the FieldCache
reuse thing (which is new for 1.4).

>l to (what i think) Lance's point was: that
> instead of deleting docs and open a new searcher, you could instead just
> add the doc keys to a (negated) filter query (and never open a new
> searcher at all)

I guess as long as you versioned the filter that could work.
It would have the effect of invalidating all of the query cache, but
wouldn't affect the filter cache.

-Yonik
http://www.lucidimagination.com


Re: exact match coming as second record

2010-04-05 Thread Mark Fletcher
Hi Eric,

Thanks many for your mail!
Please find attached the debugQuery results.

Thanks!
Mark

On Mon, Apr 5, 2010 at 7:38 PM, Erick Erickson wrote:

> What do you get back when you specify &debugQuery=on?
>
> Best
> Erick
>
> On Mon, Apr 5, 2010 at 7:31 PM, Mark Fletcher
> wrote:
>
>  > Hi,
> >
> > I am using the dismax handler.
> > I have a field named *myfield* which has a value say XXX.YYY.ZZZ. I have
> > boosted myfield^20.0.
> > Even with such a high boost (in fact among the qf fields specified this
> > field has the max boost given), when I search for XXX.YYY.ZZZ I see my
> > record as the second one in the results and a record of  the form
> > XXX.YYY.ZZZ.AAA.BBB is appearing as the first one.
> >
> > Can any one help me understand why is this so, as I thought an exact
> match
> > on a heavily boosted field would give the exact match record first in
> > dismax.
> >
> > Thanks and Rgds,
> > Mark
> >
>
A personal note:-
I have boosted the id field to the highest among my qf values specified in my 
dismax. 
Even then when I search for an id say XX.YYY.ZZZ, instead of pushing the record 
with id=XX.YYY.ZZZ to the first place, it is displaying another record 
XX.YYY.ZZZ.ME.PK as the first one...There are total 4 results but I have 
included details of only the first and second. Am surprised why XX.YY.ZZZ 
doesn't come as the first record even after an exact match found in it.

My qf fields in dismax:-
 
name^10.0 id^20.0 subtopic1^1.0 indicator_value^1.0 country_name^1.0 
country_code^1.0 source^0.8 database^1.4 definition^1.2 dr_report_name^1.0 
dr_header^1.0 dr_footer^1.0 dr_mdx_query^1.0 dr_reportmetadata^1.0 content^1.0 
aag_indicators^1.0 type^1.0 text^.3
 

id^6.0
 
 
type:Timeseries^1000.0
 

Debug Report:-


 xx.yyy.
 xx.yyy.
 +DisjunctionMaxQuery((text:"(xx.yyy.zzz xx) yyy 
"^0.3 | definition:"(xx.yyy.zzz xx) yyy "^0.2 | 
indicator_value:"(xx.yyy.zzz xx) yyy " | subtopic1:"(xx.yyy.zzz xx) yyy 
" | dr_report_name:"(xx.yyy.zzz xx) yyy " | 
dr_reportmetadata:"(xx.yyy.zzz xx) yyy " | dr_footer:"(xx.yyy.zzz xx) yyy 
" | type:"(xx.yyy.zzz xx) yyy " | country_code:"(xx.yyy.zzz xx) yyy 
"^2.0 | country_name:"(xx.yyy.zzz xx) yyy "^2.0 | database:"(xx.yyy.zzz 
xx) yyy "^1.4 | aag_indicators:"(xx.yyy.zzz xx) yyy " | 
content:"(xx.yyy.zzz xx) yyy " | id:xx.yyy.^1000.0 | 
dr_mdx_query:"(xx.yyy.zzz xx) yyy " | source:"(xx.yyy.zzz xx) yyy "^0.2 
| name:"(xx.yyy.zzz xx) yyy "^10.0 | dr_header:"(xx.yyy.zzz xx) yyy 
")~0.01) DisjunctionMaxQuery((id:xx.yyy.^6.0)~0.01) 
type:timeseries^1000.0
 +(text:"(xx.yyy.zzz xx) yyy "^0.3 | 
definition:"(xx.yyy.zzz xx) yyy "^0.2 | indicator_value:"(xx.yyy.zzz xx) 
yyy " | subtopic1:"(xx.yyy.zzz xx) yyy " | dr_report_name:"(xx.yyy.zzz 
xx) yyy " | dr_reportmetadata:"(xx.yyy.zzz xx) yyy " | 
dr_footer:"(xx.yyy.zzz xx) yyy " | type:"(xx.yyy.zzz xx) yyy " | 
country_code:"(xx.yyy.zzz xx) yyy "^2.0 | country_name:"(xx.yyy.zzz xx) yyy 
"^2.0 | database:"(xx.yyy.zzz xx) yyy "^1.4 | 
aag_indicators:"(xx.yyy.zzz xx) yyy " | content:"(xx.yyy.zzz xx) yyy " 
| id:xx.yyy.^1000.0 | dr_mdx_query:"(xx.yyy.zzz xx) yyy " | 
source:"(xx.yyy.zzz xx) yyy "^0.2 | name:"(xx.yyy.zzz xx) yyy "^10.0 | 
dr_header:"(xx.yyy.zzz xx) yyy ")~0.01 (id:xx.yyy.^6.0)~0.01 
type:timeseries^1000.0
 

0.15786289 = (MATCH) sum of:
  6.086512E-4 = (MATCH) max plus 0.01 times others of:
6.086512E-4 = (MATCH) weight(text:"(xx.yyy. sp) yyy "^0.3 in 1004), 
product of:
  7.562088E-4 = queryWeight(text:"(xx.yyy. xx) yyy "^0.3), product 
of:
0.3 = boost
20.604721 = idf(text:"(xx.yyy. xx) yyy "^0.3)
1.2233584E-4 = queryNorm
  0.8048719 = (MATCH) fieldWeight(text:"(xx.yyy. xx) yyy "^0.3 in 
1004), product of:
1.0 = tf(phraseFreq=1.0)
20.604721 = idf(text:"(xx.yyy. xx) yyy "^0.3)
0.0390625 = fieldNorm(field=text, doc=1004)
  0.15725423 = (MATCH) weight(type:timeseries^1000.0 in 1004), product of:
0.1387005 = queryWeight(type:timeseries^1000.0), product of:
  1000.0 = boost
  1.1337683 = idf(docFreq=1054, maxDocs=1206)
  1.2233584E-4 = queryNorm
1.1337683 = (MATCH) fieldWeight(type:timeseries in 1004), product of:
  1.0 = tf(termFreq(type:timeseries)=1)
  1.1337683 = idf(docFreq=1054, maxDocs=1206)
  1.0 = fieldNorm(field=type, doc=1004)

  
0.15774116 = (MATCH) sum of:
  4.8692097E-4 = (MATCH) max plus 0.01 times others of:
4.8692097E-4 = (MATCH) weight(text:"(xx.yyy. xx) yyy "^0.3 in 
1003), product of:
  7.562088E-4 = queryWeight(text:"(xx.yyy. xx) yyy "^0.3), product 
of:
0.3 = boost
20.604721 = idf(text:"(xx.yyy.zzz xx) yyy "^0.3)
1.2233584E-4 = queryNorm
  0.64389753 = (MATCH) fieldWeight(text:"(xx.yyy. xx) yyy "^0.3 in 
1003

Re: including external files in config by corename

2010-04-05 Thread Mark Miller

On 04/05/2010 01:53 PM, Shawn Heisey wrote:
Is it possible to access the core name in a config file (such as 
solrconfig.xml) so I can include core-specific configlets into a 
common config file?  I would like to pull in different configurations 
for things like shards and replication, but have all the cores 
otherwise use an identical config file.


Also, I have been looking for the syntax to include a snippet and 
haven't turned anything up yet.


Thanks,
Shawn



The best you have to work with at the moment is Xincludes:

http://wiki.apache.org/solr/SolrConfigXml#XInclude

and System Property Substitution:

http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution

--
- Mark

http://www.lucidimagination.com





Re: Need info on CachedSQLentity processor

2010-04-05 Thread Mark Miller

On 04/05/2010 02:28 PM, bbarani wrote:

Hi,

I am using cachedSqlEntityprocessor in DIH to index the data. Please find
below my dataconfig structure,

  --->  object
  -->  object properties

For each and every object I would be retrieveing corresponding object
properties (in my subqueries).

I get in to OOM very often and I think thats a trade off if I use
cachedSqlEntityprocessor.

My assumption is that when I use cachedSqlEntityprocessor the indexing
happens as follows,

First entity x will get executed and the entire table gets stored in cache

next entity y gets executed and entire table gets stored in cache

Finally the compasion heppens through hash map .

So always I need to have the memory allocated to SOLR JVM more than or equal
to the data present in tables?


Now my final question is that even after SOLR complexes indexing the memory
used previously is not getting released. I could still see the JVM consuming
1.5 GB after the indexing completes. I tried to use Java hotspot options but
didnt see any differences..

Any thoughts / confirmation on my assumptions above would be of great help
to me to get in to  a decision of choosing cachedSqlEntityprocessor or not.

Thanks,
BB



   


You are right - CacheSQLEntityProcessor: the cache is an unbounded 
HashMap, with no option to bound it.


IMO this should be fixed - want to make a JIRA issue? I've brought it up 
on the list before, but I don't think I ever got around to making an issue.


As to why its not getting released - that is odd. Perhaps a GC has just 
not been triggered yet and it will be released? If not, that's a pretty 
nasty bug. Can you try forcing a GC to see? (say with jconsole?)


--
- Mark

http://www.lucidimagination.com





Re: including external files in config by corename

2010-04-05 Thread Chris Hostetter

: The best you have to work with at the moment is Xincludes:
: 
: http://wiki.apache.org/solr/SolrConfigXml#XInclude
: 
: and System Property Substitution:
: 
: http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution

Except that XInclude is a feature of hte XML parser, while property 
substitution is soemthing Solr does after the XML has been parsed into a 
DOM -- so you can't have an XInclude of a fle whose name is determined by 
a property (like the core name)

what you cna do however, is have a distinct solrconfig.xml for each core, 
which is just a thin shell that uses XInclude to include big chunkcs of 
frequently reused declarations, and some cores can exclude some of thes 
includes.  (ie: turn the problem inside out)



-Hoss



Re: Some help for folks trying to get new Solr/Lucene up in Eclipse

2010-04-05 Thread Lance Norskog
I had a slight hiccup that I just ignored. Even when I used Java 1.6
JDK mode, Eclipse did not know this method. I had to comment out the
three places that use this method.

javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(true)

Lance Norskog

On Mon, Apr 5, 2010 at 1:49 PM, Mattmann, Chris A (388J)
 wrote:
> Hey All,
>
> Just to save some folks some time in case you are trying to get new
> Lucene/Solr up in running in Eclipse. If you continue to get weird errors,
> e.g., in solr/src/test/TestConfig.java regarding
> org.w3c.dom.Node#getTextContent(), I found for me this error was caused by
> including the Tidy.jar (which includes its own version of the Node API) in
> the build path. If you take that out, you should be good.
>
> Wanted to pass that along.
>
> Cheers,
> Chris
>
>
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.mattm...@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Need info on CachedSQLentity processor

2010-04-05 Thread bbarani

Mark,

I have opened a JIRA issue - https://issues.apache.org/jira/browse/SOLR-1867

Thanks,
Barani
-- 
View this message in context: 
http://n3.nabble.com/Need-info-on-CachedSQLentity-processor-tp698418p699329.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multicore and TermVectors

2010-04-05 Thread Lance Norskog
There is no query parameter. The query parser throws an NPE if there
is no query parameter:

http://issues.apache.org/jira/browse/SOLR-435

It does not look like term vectors are processed in distributed search anyway.

On Mon, Apr 5, 2010 at 4:45 PM, Chris Hostetter
 wrote:
>
> : Subject: Multicore and TermVectors
>
> It doesn't sound like Multicore is your issue ... it seems like what you
> mean is that you are using distributed search with TermVectors, and that
> is causing a problem.  Can you please clarify exactly what you mean ...
> describe your exact setup (ie: how manay machines, how many solr ports
> running on each of those machines, what the solr.xml looks like on each of
> those ports, how many SolrCores running in each of those ports, what
> the slrconfig.xml looks like for each of those instances, which instances
> coordinate distributed searches of which shards, what urls your client
> hits, what URLs get hit on each of your shards (according to the logs) as
> a result, etc...
>
> details, details, details.
>
>
> -Hoss
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: including external files in config by corename

2010-04-05 Thread Mark Miller

On 04/05/2010 10:12 PM, Chris Hostetter wrote:

: The best you have to work with at the moment is Xincludes:
:
: http://wiki.apache.org/solr/SolrConfigXml#XInclude
:
: and System Property Substitution:
:
: http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution

Except that XInclude is a feature of hte XML parser, while property
substitution is soemthing Solr does after the XML has been parsed into a
DOM -- so you can't have an XInclude of a fle whose name is determined by
a property (like the core name

Didn't suggest he could - just giving him the features he has to work with.

--
- Mark

http://www.lucidimagination.com





Re: How to add new entity to the solr index without having to re-index previously stored data.

2010-04-05 Thread MitchK

Maddy,

you need to reindex the whole record, if you change or add any kind of data
that belongs to it. 

Please, note that you need to subscribe to the solr-user-mailing list, since
not everyone is using Nabble to get Mailinglist-postings. 

Kind regards,
- Mitch


Maddy.Jsh wrote:
> 
> I indexed my solr using DIH. This is my config file
>  query="select id FROM db1">
> 
> 
>   
>   
>  query="select name FROM db2 where 
> id='${db1.id}'">
>   
> 
>   
> 
> 
> I now need to add another entity "entity3". Is there a way to only index
> entity3?
> 
-- 
View this message in context: 
http://n3.nabble.com/How-to-add-new-entity-to-the-solr-index-without-having-to-re-index-previously-stored-data-tp699537p699561.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Realtime search and facets with very frequent commits

2010-04-05 Thread Janne Majaranta
Yeah, thanks for pointing this out.
I'm not using any relevancy functions (yet). The data indexed for my app is
basically log events.
The most relevant events are the newest ones, so sorting by timestamp is
enough.

BTW, your book is great ;)

-Janne

2010/3/31 Smiley, David W. 

> Janne,
>Have you found your query relevancy to deteriorate with this setup?
>  Something to be aware of with distributed searches is that the relevancy of
> each Solr core response is based on the local index to that core.  So if
> you're distributed Solr setup does not distribute documents randomly (as is
> certainly the case for you) your relevancy scores will be poor.
>
> ~ David Smiley
> Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
>
> On Feb 11, 2010, at 12:35 PM, Janne Majaranta wrote:
> ...
> >
> > I have tested putting a second solr instance on the same server and
> sending
> > the updates to that new instance.
> > Warming up the new small instance is very fast while the large instance
> has
> > very hot caches.
> ...
> >
> > Best Regards,
> >
> > Janne Majaranta
>
>