date:20110426

timezone DIH and dataimport.properties

2011-04-26 Thread stockii

Hello.

How can i set the timezone oft java in my java properties ? 

my problem is, that in the dataimport-properties is a wrong timezone and i
dont know how to set the correct timezone ... !?!? thx

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores < 100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/timezone-DIH-and-dataimport-properties-tp2864928p2864928.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to concatenate two nodes of xml with xpathentityprocessor

2011-04-26 Thread Stefan Matheis

Vishal,

i don't really understand what you're trying to achieve? indexing what
(complete/sample documents, valid if possible)? And getting what
exactly as result?

Regards
Stefan

On Mon, Apr 25, 2011 at 5:01 PM, vrpar...@gmail.com  wrote:
> hello ,
>
> i am using Xpathentityprocessor to do index xml files
>
> below is my xml file
>
> 
>   >CustomerA
>   ThisB
>   AnyC
> 
>
> now i want to concatenate in index so that when i search it gives below
> result
>
> CData with id attribute---  like CustomerA id="2">ThisB or something like that
>
> is it possible by RegexTransformer or templatetransformer? i did googling
> little for both but could not get excat/useful solution
>
> Thanks
>
> Vishal Parekh
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-to-concatenate-two-nodes-of-xml-with-xpathentityprocessor-tp2861260p2861260.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: timezone DIH and dataimport.properties

2011-04-26 Thread Stefan Matheis

java -Duser.timezone=UTC -jar start.jar ?

On Tue, Apr 26, 2011 at 9:54 AM, stockii  wrote:
> Hello.
>
> How can i set the timezone oft java in my java properties ?
>
> my problem is, that in the dataimport-properties is a wrong timezone and i
> dont know how to set the correct timezone ... !?!? thx
>
> -
> --- System 
> 
>
> One Server, 12 GB RAM, 2 Solr Instances, 7 Cores,
> 1 Core with 31 Million Documents other Cores < 100.000
>
> - Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
> - Solr2 for Update-Request  - delta every Minute - 4GB Xmx
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/timezone-DIH-and-dataimport-properties-tp2864928p2864928.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Problem with autogeneratePhraseQueries

2011-04-26 Thread Solr Beginner

Hi,

I'm new to solr. My solr instance version is:

Solr Specification Version: 3.1.0
Solr Implementation Version: 3.1.0 1085815 - grantingersoll - 2011-03-26
18:00:07
Lucene Specification Version: 3.1.0
Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58
Current Time: Tue Apr 26 08:01:09 CEST 2011
Server Start Time:Tue Apr 26 07:59:05 CEST 2011

I have following definition for textgen type:

 
  





  
  





  



I'm using this type for name field in my index. As you can see I'm
using autoGeneratePhraseQueries="false" but for query sony vaio 4gb I'm
getting following query in debug:


  sony vaio 4gb
  sony vaio 4gb
  +name:sony +name:vaio +MultiPhraseQuery(name:"(4gb
4) gb")
  +name:sony +name:vaio +name:"(4gb 4)
gb"

Do you have any idea how can I avoid this MultiPhraseQuery?

Best Regards,
solr_beginner

Re: Problem with autogeneratePhraseQueries

2011-04-26 Thread Robert Muir

What do you have in solrconfig.xml for luceneMatchVersion?

If you don't set this, then its going to default to "Lucene 2.9"
emulation so that old solr 1.4 configs work the same way. I tried your
example and it worked fine here, and I'm guessing this is probably
whats happening.

the default in the example/solrconfig.xml looks like this:


LUCENE_31

On Tue, Apr 26, 2011 at 6:51 AM, Solr Beginner  wrote:
> Hi,
>
> I'm new to solr. My solr instance version is:
>
> Solr Specification Version: 3.1.0
> Solr Implementation Version: 3.1.0 1085815 - grantingersoll - 2011-03-26
> 18:00:07
> Lucene Specification Version: 3.1.0
> Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58
> Current Time: Tue Apr 26 08:01:09 CEST 2011
> Server Start Time:Tue Apr 26 07:59:05 CEST 2011
>
> I have following definition for textgen type:
>
>   autoGeneratePhraseQueries="false">
>  
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> preserveOriginal="1"/>
> 
>  side="front" preserveOriginal="1"/>
>  
>  
> 
>  ignoreCase="true" expand="true"/>
>  ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"/>
>  generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" preserveOriginal="1"/>
> 
>  
> 
>
>
> I'm using this type for name field in my index. As you can see I'm
> using autoGeneratePhraseQueries="false" but for query sony vaio 4gb I'm
> getting following query in debug:
>
> 
>  sony vaio 4gb
>  sony vaio 4gb
>  +name:sony +name:vaio +MultiPhraseQuery(name:"(4gb
> 4) gb")
>  +name:sony +name:vaio +name:"(4gb 4)
> gb"
>
> Do you have any idea how can I avoid this MultiPhraseQuery?
>
> Best Regards,
> solr_beginner
>

Re: Query regarding solr plugin.

2011-04-26 Thread Erick Erickson

Sorry, but there's too much here to debug remotely. I strongly advise you
back wy up. Undo (but save) all your changes. Start by doing
the simplest thing you can, just get a dummy class in place and
get it called. Perhaps create a really dumb logger method that
opens a text file, writes a message, and closes the file. Inefficient
I know, but this is just to find out the problem. Debugging by println is
an ancient technique...

Once you're certain the dummy class is called, gradually build it up
to the complex class you eventually want.

One problem here is that you've changed a bunch of moving parts, copied
jars around (it's unclear whether you have two copies of solr-core in your
classpath, for instance). So knowing exactly which one of those is the issue
is very difficult, especially since you may have forgotten one of the things
you did. I know when I've been trying to do something for days, lots of
details get lost.

Try to avoid changing the underlying Solr code, can you do what you want
by subclassing instead and calling your new class? That would avoid
a bunch of problems.  If you can't subclass, copy the whole thing and
rename it to something new and call *that* rather than re-use the
synonymfilterfactory. The only jar you should copy to the  directory
would be the one you put your new class in.

I can't emphasize strongly enough that you'll save yourself lots of grief if
you start with a fresh install and build up gradually rather than try to
unravel the current code. It feels wasteful, but winds up being faster in
my experience...

Good Luck!
Erick

On Tue, Apr 26, 2011 at 12:41 AM, rajini maski  wrote:
> Thanks Erick. I have added my replies to the points you did mention. I am
> somewhere going wrong. I guess do I need to club both the jars or something
> ? If yes, how do i do that? I have no much idea about java and jar files.
> Please guide me here.
>
> A couple of things to try.
>
> 1> when you do a 'jar -tfv ", you should see
> output like:
>  1183 Sun Jun 06 01:31:14 EDT 2010
> org/apache/lucene/analysis/sinks/TokenTypeSinkTokenizer.class
> and your  statement may need the whole path, in this example...
>  (note,
> this
> is just an example of the pathing, this class has nothing to do with
> your filter)...
>
> I could see this output..
>
> 2> But I'm guessing your path is actually OK, because I'd expect to be
> seeing a
> "class not found" error. So my guess is that your class depends on
> other jars that
> aren't packaged up in your jar and if you find which ones they are and copy
> them
> to your lib directory you'll be OK. Or your code is throwing an error
> on load. Or
> something like that...
>
> There is jar - "apache-solr-core-1.4.1.jar" this has the
> BaseTokenFilterFacotry class and the Synonymfilterfactory class..I made the
> changes in second class file and created it as new. Now i created a jar of
> that java file and placed this in solr home/lib and also placed
> "apache-solr-core-1.4.1.jar" file in lib folder of solr home.  [solr home -
> c:\orch\search\solr  lib path - c:\orch\search\solr\lib]
>
> 3> to try to understand what's up, I'd back up a step. Make a really
> stupid class
> that doesn't do anything except derive from BaseTokenFilterFacotry and see
> if
> you can load that. If you can, then your process is OK and you need to
> find out what classes your new filter depend on. If you still can't, then we
> can
> see what else we can come up with..
>
>
> I am perhaps doing same. In the synonymfilterfactory class, there is a
> function parse rules which takes delimiters as one of the input parameter.
> Here i changed  comma ',' to '~' tilde symbol and  thats it.
>
>
> Regards,
> Rajani
>
>
> On Mon, Apr 25, 2011 at 6:26 PM, Erick Erickson 
> wrote:
>
>> Looking at things more carefully, it may be one of your dependent classes
>> that's not being found.
>>
>> A couple of things to try.
>>
>> 1> when you do a 'jar -tfv ", you should see
>> output like:
>>  1183 Sun Jun 06 01:31:14 EDT 2010
>> org/apache/lucene/analysis/sinks/TokenTypeSinkTokenizer.class
>> and your  statement may need the whole path, in this example...
>>  (note,
>> this
>> is just an example of the pathing, this class has nothing to do with
>> your filter)...
>>
>> 2> But I'm guessing your path is actually OK, because I'd expect to be
>> seeing a
>> "class not found" error. So my guess is that your class depends on
>> other jars that
>> aren't packaged up in your jar and if you find which ones they are and copy
>> them
>> to your lib directory you'll be OK. Or your code is throwing an error
>> on load. Or
>> something like that...
>>
>> 3> to try to understand what's up, I'd back up a step. Make a really
>> stupid class
>> that doesn't do anything except derive from BaseTokenFilterFacotry and see
>> if
>> you can load that. If you can, then your process is OK and you need to
>> find out what classes your new filter depend on. If you still can't, then
>> we can
>> see what else we can come up with..
>>

Re: Problem with autogeneratePhraseQueries

2011-04-26 Thread Solr Beginner

Thank you very much for answer.

You were right. There was no luceneMatchVersion in solrconfig.xml of our dev
core. We thought that values not present in core configuration are copied
from main solrconfig.xml. I will investigate if our administrators did
something wrong during upgrade to 3.1.

On Tue, Apr 26, 2011 at 1:35 PM, Robert Muir  wrote:

> What do you have in solrconfig.xml for luceneMatchVersion?
>
> If you don't set this, then its going to default to "Lucene 2.9"
> emulation so that old solr 1.4 configs work the same way. I tried your
> example and it worked fine here, and I'm guessing this is probably
> whats happening.
>
> the default in the example/solrconfig.xml looks like this:
>
> 
> LUCENE_31
>
> On Tue, Apr 26, 2011 at 6:51 AM, Solr Beginner 
> wrote:
> > Hi,
> >
> > I'm new to solr. My solr instance version is:
> >
> > Solr Specification Version: 3.1.0
> > Solr Implementation Version: 3.1.0 1085815 - grantingersoll - 2011-03-26
> > 18:00:07
> > Lucene Specification Version: 3.1.0
> > Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58
> > Current Time: Tue Apr 26 08:01:09 CEST 2011
> > Server Start Time:Tue Apr 26 07:59:05 CEST 2011
> >
> > I have following definition for textgen type:
> >
> >   positionIncrementGap="100"
> > autoGeneratePhraseQueries="false">
> >  
> > 
> >  > words="stopwords.txt" enablePositionIncrements="true" />
> >  > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > preserveOriginal="1"/>
> > 
> >  maxGramSize="15"
> > side="front" preserveOriginal="1"/>
> >  
> >  
> > 
> >  > ignoreCase="true" expand="true"/>
> >  > ignoreCase="true"
> > words="stopwords.txt"
> > enablePositionIncrements="true"/>
> >  > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> > catenateAll="0" preserveOriginal="1"/>
> > 
> >  
> > 
> >
> >
> > I'm using this type for name field in my index. As you can see I'm
> > using autoGeneratePhraseQueries="false" but for query sony vaio 4gb I'm
> > getting following query in debug:
> >
> > 
> >  sony vaio 4gb
> >  sony vaio 4gb
> >  +name:sony +name:vaio
> +MultiPhraseQuery(name:"(4gb
> > 4) gb")
> >  +name:sony +name:vaio +name:"(4gb 4)
> > gb"
> >
> > Do you have any idea how can I avoid this MultiPhraseQuery?
> >
> > Best Regards,
> > solr_beginner
> >
>

Re: how to concatenate two nodes of xml with xpathentityprocessor

2011-04-26 Thread vrpar...@gmail.com

Thanks Stefan 

currently in dataconfig file part of xPathEntityProcessor






and when i do make search i get following search result


 
 
CustomerA
AnyC

 
1
3

   


but i want following result


 
 
1,CustomerA
3,AnyC

   
 OR


 
 
CustomerA
AnyC

   


or any other format but i want both combine,

Thanks

Vishal Parekh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-concatenate-two-nodes-of-xml-with-xpathentityprocessor-tp2861260p2865508.html
Sent from the Solr - User mailing list archive at Nabble.com.

What initialize new searcher?

2011-04-26 Thread Solr Beginner

Hi,

I'm reading solr cache documentation -
http://wiki.apache.org/solr/SolrCaching I found there "The current
Index Searcher serves requests and when a new searcher is opened...".
Could you explain when new searcher is opened? Does it have something
to do with index commit?

Best Regards,
Solr Beginner

TermsCompoment + Dist. Search + Large Index + HEAP SPACE

2011-04-26 Thread mdz-munich

Hi!

We've got one index splitted into 4 shards á 70.000 records of large
full-text data from (very dirty) OCR. Thus we got a lot of "unique" terms. 
No we try to obtain the first 400 most common words for "CommonGramsFilter"
via TermsComponent but the request runs allways out of memory. The VM is
equipped with 32 GB of RAM, 16-26 GB alocated to the Java-VM. 

Any Ideas how to get the most common terms without increasing VMs Memory?   
 
Thanks & best regards,

Sebastian 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2865609.html
Sent from the Solr - User mailing list archive at Nabble.com.

org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.dataimport.DataImportHandler'

2011-04-26 Thread vrpar...@gmail.com

Hello,

i got following source

org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.handler.dataimport.DataImportHandler' at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:389)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:423) at
org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:459) .

actually this error comes in solr 3.1 only in solr 1.4.1 it works fine

how to solve this problem?

Thanks

Vishal Parekh


--
View this message in context: 
http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-Error-loading-class-org-apache-solr-handler-dataimport-DataImpo-tp2865625p2865625.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.dataimport.DataImportHandler'

2011-04-26 Thread Stefan Matheis

http://www.lucidimagination.com/blog/2011/04/01/solr-powered-isfdb-part-8/

On Tue, Apr 26, 2011 at 3:34 PM, vrpar...@gmail.com  wrote:
> Hello,
>
> i got following source
>
> org.apache.solr.common.SolrException: Error loading class
> 'org.apache.solr.handler.dataimport.DataImportHandler' at
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:389)
> at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:423) at
> org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:459) .
>
> actually this error comes in solr 3.1 only in solr 1.4.1 it works fine
>
> how to solve this problem?
>
> Thanks
>
> Vishal Parekh
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-Error-loading-class-org-apache-solr-handler-dataimport-DataImpo-tp2865625p2865625.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Robert Muir

On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
 wrote:

> But somehow this feels bad (well, so does sticking word variations in what's
> supposed to be a synonyms file), partly because it means that the person 
> adding
> new synonyms would need to know what they stem to (or always check it against
> Solr before editing the file).

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.

WhitespaceTokenizer and scoring(field length)

2011-04-26 Thread roySolr

Hello,

I have a problem with the whitespaceTokenizer and scoring. An example:

id Titel
1  Manchester united
2  Manchester

With the whitespaceTokenizer "Manchester united" will be splitted to
"Manchester" and "united". When
i search for "manchester" i get id 1 and 2 in my results. What i want is
that id 2 scores higher(field length).
How can i fix this?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2865784.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

2011-04-26 Thread Burton-West, Tom

Don't know your use case, but if you just want a list of the 400 most common 
words you can use the lucene contrib. HighFreqTerms.java with the - t flag.  
You have to point it at your lucene index.  You also probably don't want Solr 
to be running and want to give the JVM running HighFreqTerms a lot of memory.

http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java?view=log

Tom
http://www.hathitrust.org/blogs/large-scale-search

-Original Message-
From: mdz-munich [mailto:sebastian.lu...@bsb-muenchen.de] 
Sent: Tuesday, April 26, 2011 9:29 AM
To: solr-user@lucene.apache.org
Subject: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

Hi!

We've got one index splitted into 4 shards á 70.000 records of large
full-text data from (very dirty) OCR. Thus we got a lot of "unique" terms. 
No we try to obtain the first 400 most common words for "CommonGramsFilter"
via TermsComponent but the request runs allways out of memory. The VM is
equipped with 32 GB of RAM, 16-26 GB alocated to the Java-VM. 

Any Ideas how to get the most common terms without increasing VMs Memory?   
 
Thanks & best regards,

Sebastian 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2865609.html
Sent from the Solr - User mailing list archive at Nabble.com.

Apache Solr 3.1.0

2011-04-26 Thread Wodek Siebor

I'm trying to tokenize email  and IP addresses using
StandardTokenizerFactory.
It does correctly tokenize IP address but it divides email address into two
tokens one with value before '@' and the other with value after that. 

It works correctly under Solr 1.4.1

Has anybody else tried similar thing on Solr 3.1.0 successfully or is it a
potential bug?

Thanks,
Wlodek S.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-3-1-0-tp2866007p2866007.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Apache Solr 3.1.0

2011-04-26 Thread Steven A Rowe

Hi Wodek,

UAX29URLEmailTokenizer includes all of StandardTokenizer's rules and adds rules 
to tokenize URLs and Emails:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory

Steve

> -Original Message-
> From: Wodek Siebor [mailto:siebor_wlo...@bah.com]
> Sent: Tuesday, April 26, 2011 11:29 AM
> To: solr-user@lucene.apache.org
> Subject: Apache Solr 3.1.0
> 
> I'm trying to tokenize email  and IP addresses using
> StandardTokenizerFactory.
> It does correctly tokenize IP address but it divides email address into
> two
> tokens one with value before '@' and the other with value after that.
> 
> It works correctly under Solr 1.4.1
> 
> Has anybody else tried similar thing on Solr 3.1.0 successfully or is it
> a
> potential bug?
> 
> Thanks,
> Wlodek S.
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Apache-
> Solr-3-1-0-tp2866007p2866007.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Problems with Spellchecker in 3.1

2011-04-26 Thread Bob Sandiford

Hi, all.

Sorry for any duplication - seems like what I sent yesterday never made it 
through...


We're having some troubles with the Solr Spellcheck Response.  We're running 
version 3.1.



Overview:  If we search for something really ugly like:



  "kljhklsdjahfkljsdhf book rck"



then when we get back the response, there's a suggestions list for 'rck', but 
no suggestions list for the other two words.  For 'book', that's fine, because 
it is 'spelled correctly' (i.e. we got hits on the word) and there shouldn't be 
any suggestions.  For the ugly thing, though, there aren't any hits.



The problem is that when we're handling the result, we can't tell the 
difference between no suggestions for a 'correctly spelled' term, and no 
suggestions for something that's odd like this.



(Now - this is happening with searches that aren't as obviously garbage - i.e. 
words that are real words, just that just don't show up in the index and have 
no suggestions - this was just to illustrate the point).



Our setup:

We're running multiple shards, which may be part of the issue.  For example, 
'book' might be found in one of the shards, but not another.



I don't *think* this has anything to do with our schema, since it's really how 
the Search Suggestions are being returned to us.  But, here are some bits and 
pieces:

>From schema.xml:



   

   





>From solrconfig.xml:



   

  



textSpell





  default

  textSpell

  ./spellchecker





  



What we'd really like to see is the response coming back with an indication 
that a word wasn't found / had no suggestions.  We've hacked around in the code 
a little bit to do this, but were wondering if anyone has come across this, and 
what approaches you've taken.



We created new classes which extend IndexBasedSpellChecker and 
SpellCheckComponent, as follows (package and imports excluded for (sort of) 
brevity).  The methods are as taken from the overridden classes, with changes 
noted by "SD" type comments...





/**

* This has a slight modification of Solr's 
AbstractLuceneSpellChecker.getSuggestions(SpellingOptions).

* The modification allows correctly spelled words to be returned in the 
suggestion.  This modification working in tandem

* with the SirsiDynixSpellCheckComponent allows words with no suggestions to be 
returned from the spell check component

* even in a sharded search.

* Changes are marked with SD in the comments.

*/

public class SirsiDynixIndexBasedSpellChecker extends IndexBasedSpellChecker{

  @Override

  public SpellingResult getSuggestions(SpellingOptions options) throws 
IOException {

  boolean shardRequest = false;

  SolrParams params = options.customParams;

  if(params!=null)

  {

shardRequest = "true".equals(params.get(ShardParams.IS_SHARD));

  }

SpellingResult result = new SpellingResult(options.tokens);

IndexReader reader = determineReader(options.reader);

Term term = field != null ? new Term(field, "") : null;

float theAccuracy = (options.accuracy == Float.MIN_VALUE) ? 
spellChecker.getAccuracy() : options.accuracy;



int count = Math.max(options.count, 
AbstractLuceneSpellChecker.DEFAULT_SUGGESTION_COUNT);

for (Token token : options.tokens) {

  String tokenText = new String(token.buffer(), 0, token.length());

  String[] suggestions = spellChecker.suggestSimilar(tokenText,

  count,

field != null ? reader : null, //workaround LUCENE-1295

field,

options.onlyMorePopular, theAccuracy);

  if (suggestions.length == 1 && suggestions[0].equals(tokenText)) {

//These are spelled the same, continue on

List suggList = Arrays.asList(suggestions); //SD added

result.add(token, suggList);//SD added

continue;

  }



  if (options.extendedResults == true && reader != null && field != null) {

term = term.createTerm(tokenText);

result.add(token, reader.docFreq(term));

int countLimit = Math.min(options.count, suggestions.length);

if(countLimit>0)

{

  for (int i = 0; i < countLimit; i++) {

term = term.createTerm(suggestions[i]);

result.add(token, suggestions[i], reader.docFreq(term));

  }

} else if(shardRequest) {

List suggList = Collections.emptyList();

result.add(token, suggList);

}

  } else {

if (suggestions.length > 0) {

  List suggList = Arrays.asList(suggestions);

  if (suggestions.length > options.count) {

suggList = suggList.subList(0, options.count);

  }

  result.add(token, suggList);

} else if(shardRequest) {

List suggList = Collections.emptyList();

result.add(token, suggList);

}

  }

}

return result;

  }

}







/**

* This is a

Ebay Kleinanzeigen and Auto Suggest

2011-04-26 Thread Eric Grobler

Hi

Someone told me that ebay is using solr.
I was looking at their Auto Suggest implementation and I guess they are
using Shingles and the TermsComponent.

I managed to get a satisfactory implementation but I have a problem with
category specific filtering.
Ebay suggestions are sensitive to categories like Cars and Pets.

As far as I understand it is not possible to using filters with a term
query.
Unless one uses multiple fields or special prefixes for the words to index I
cannot think how to implement this.

Is their perhaps a workaround for this limitation?

Best Regards
EricZ

---

I am have a shingle type like:


  
   
   
  




and a query like
http://localhost:8983/solr/terms?q=*%3A*&terms.fl=suggest_text&terms.sort=count&terms.prefix=audi

Solr Newbie: Starting embedded server with multicore

2011-04-26 Thread Simon, Richard T


I'm just starting with Solr. I'm using Solr 3.1.0, and I want to use 
EmbeddedSolrServer with a multicore setup, even though I currently have only 
one core (various documents I read suggest starting that way even if you have 
one core, to get the better administrative tools supported by mutlicore).

I have two questions:

1.   Does the first code sample below start the server with multicore or 
not?

2.   Why is it the first sample work and the second does not?

My solr.xml looks like this:


  

  


It's in a directory called solrhome in war/WEB-INF.

I can get the server to come up cleanly if I follow an example in the Packt 
Solr book (p. 231), but I'm not sure if this enables multi-core or not:


  File solrXML = new File("war/WEB-INF/solrhome/solr.xml");

  String solrHome = solrXML.getParentFile().getAbsolutePath();
  String dataDir = solrHome + "/data";

coreContainer = new CoreContainer(solrHome);

SolrConfig solrConfig = new SolrConfig(solrHome, "solrconfig.xml", 
null);

CoreDescriptor coreDescriptor = new CoreDescriptor(coreContainer, 
"mycore",
solrHome);

SolrCore solrCore = new SolrCore("mycore",
dataDir + "/" + "mycore", solrConfig, null, 
coreDescriptor);

coreContainer.register(solrCore, false);

  embeddedSolr = new EmbeddedSolrServer(coreContainer, 
"mycore");


The documentation on the Solr wiki says I should configure the 
EmbeddedSolrServer for multicore like this:

  File home = new File( "/path/to/solr/home" );
File f = new File( home, "solr.xml" );
CoreContainer container = new CoreContainer();
container.load( "/path/to/solr/home", f );

EmbeddedSolrServer server = new EmbeddedSolrServer( container, "core name 
as defined in solr.xml" );


When I try to do this, I get an error saying that it cannot find solrconfig.xml:


  File solrXML = new File("war/WEB-INF/solrhome/solr.xml");

  String solrHome = solrXML.getParentFile().getAbsolutePath();

  coreContainer = new CoreContainer();


coreContainer.load(solrHome, solrXML);

  embeddedSolr = new EmbeddedSolrServer(coreContainer, 
"mycore");



The message says it is looking in an odd place (I removed my user name from 
this). Why is it looking in solrhome/mycore/conf for solrconfig.xml? Both that 
and my schema.xml are in solrhome/conf. How can I point it at the right place? 
I tried adding 
"\workspace-Solr\institution-webapp\war\WEB-INF\solrhome\conf" to the 
classpath, but got the same result:


SEVERE: java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in 
classpath or 
'\workspace-Solr\institution-webapp\war\WEB-INF\solrhome\mycore\conf/',
 cwd=\workspace-Solr\institution-webapp
  at 
org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268)
  at 
org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:234)
  at org.apache.solr.core.Config.(Config.java:141)
  at org.apache.solr.core.SolrConfig.(SolrConfig.java:132)
  at org.apache.solr.core.CoreContainer.create(CoreContainer.java:430)
  at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
  at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)

RE: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

2011-04-26 Thread mdz-munich

Thanks for your suggestion. It seems to be the use of shards and
TermsComponent together. Now we simple requesting shard-by-shard without
"shard" and "shard.qt" params and merge the results via XSLT.

Sebastian 



 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2866499.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: What initialize new searcher?

2011-04-26 Thread Erick Erickson

You're on the right track. In a system where the indexing process and
search process are on the same machine, commits by the index
process cause a new searcher to opened.

In a master/slave situation (assuming you are indexing on the
master and searching on the slave), then the searchers are
reopened on the slaves after a replication. Replications happen
after 1> a commit happens on the master and 2> the slave
polls the master and pulls down the new commits.

Hope that helps
Erick

On Tue, Apr 26, 2011 at 8:50 AM, Solr Beginner  wrote:
> Hi,
>
> I'm reading solr cache documentation -
> http://wiki.apache.org/solr/SolrCaching I found there "The current
> Index Searcher serves requests and when a new searcher is opened...".
> Could you explain when new searcher is opened? Does it have something
> to do with index commit?
>
> Best Regards,
> Solr Beginner
>

Re: WhitespaceTokenizer and scoring(field length)

2011-04-26 Thread Erick Erickson

First, you can give us some more data to work with ...

In particular, attach &debugQuery=on to your http request and post
the results. That will show how the documents got their score.

Also, show us the  definition and  definition for the field
in question.

Best
Erick

On Tue, Apr 26, 2011 at 10:27 AM, roySolr  wrote:
> Hello,
>
> I have a problem with the whitespaceTokenizer and scoring. An example:
>
> id                     Titel
> 1                      Manchester united
> 2                      Manchester
>
> With the whitespaceTokenizer "Manchester united" will be splitted to
> "Manchester" and "united". When
> i search for "manchester" i get id 1 and 2 in my results. What i want is
> that id 2 scores higher(field length).
> How can i fix this?
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2865784.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Question on Batch process

2011-04-26 Thread Charles Wardell

I am sure that this question has been asked a few times, but I can't seem to 
find the sweetspot for indexing.

I have about 100,000 files each containing 1,000 xml documents ready to be 
posted to Solr. My desire is to have it index as quickly as possible and then 
once completed the daily stream of ADDs will be small in comparison.

The individual documents are small. Essentially web postings from the net. 
Title, postPostContent, date. 

What would be the ideal configuration? For RamBufferSize, mergeFactor, 
MaxbufferedDocs, etc..

My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in TOP
I have 16GB of available ram.


Thanks in advance.
Charlie

Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov

Suppose your analysis stack includes lower-casing, but your synonyms are 
only supposed to apply to upper-case tokens.  For example, "PET" might 
be a synonym of "positron emission tomography", but "pet" wouldn't be.


-Mike

On 04/26/2011 09:51 AM, Robert Muir wrote:

On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
  wrote:

   

But somehow this feels bad (well, so does sticking word variations in what's
supposed to be a synonyms file), partly because it means that the person adding
new synonyms would need to know what they stem to (or always check it against
Solr before editing the file).
 

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.

Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Robert Muir

Mike, thanks a lot for your example: the idea here would be you would
put the lowercasefilter after the synonymfilter, and then you get this
exact flexibility?

e.g.
WhitespaceTokenizer
SynonymFilter -> no lowercasing of tokens are done as it "analyzes"
your synonyms with just the tokenizer
LowerCaseFilter

but
WhitespaceTokenizer
LowerCaseFilter
SynonymFilter -> the synonyms are lowercased, as it "analyzes"
synonyms with the tokenizer+filter

its already inconsistent today, because if you do:

LowerCaseTokenizer
SynonymFilter

then your synonyms are in fact all being lowercased... its just
arbitrary that they are only being analyzed with the "tokenizer".

On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolov  wrote:
> Suppose your analysis stack includes lower-casing, but your synonyms are
> only supposed to apply to upper-case tokens.  For example, "PET" might be a
> synonym of "positron emission tomography", but "pet" wouldn't be.
>
> -Mike
>
> On 04/26/2011 09:51 AM, Robert Muir wrote:
>>
>> On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
>>   wrote:
>>
>>
>>>
>>> But somehow this feels bad (well, so does sticking word variations in
>>> what's
>>> supposed to be a synonyms file), partly because it means that the person
>>> adding
>>> new synonyms would need to know what they stem to (or always check it
>>> against
>>> Solr before editing the file).
>>>
>>
>> when creating the synonym map from your input file, currently the
>> factory actually uses your Tokenizer only to pre-process the synonyms
>> file.
>>
>> One idea would be to use the tokenstream up to the synonymfilter
>> itself (including filters). This way if you put a stemmer before the
>> synonymfilter, it would stem your synonyms file, too.
>>
>> I haven't totally thought the whole thing through to see if theres a
>> big reason why this wouldn't work (the synonymsfilter is complicated,
>> sorry). But it does seem like it would produce more consistent
>> results... and perhaps the inconsistency isnt so obvious since in the
>> default configuration the synonymfilter is directly after the
>> tokenizer.
>>
>

Re: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.dataimport.DataImportHandler'

2011-04-26 Thread Scott Bigelow

I experienced the same issue. With Solr 1.x, I was copying out the
'example' directory to make my solr installation. However, for the
Solr 3.x distributions, the DataImportHandler class exists in a
directory that is at the same level as example: "dist", not a
directory within.

You'll either want to take the entire apache 3.1 directory, or modify
solrconfig to point to the new place you've copied it:

  



On Tue, Apr 26, 2011 at 6:38 AM, Stefan Matheis
 wrote:
> http://www.lucidimagination.com/blog/2011/04/01/solr-powered-isfdb-part-8/
>
> On Tue, Apr 26, 2011 at 3:34 PM, vrpar...@gmail.com  
> wrote:
>> Hello,
>>
>> i got following source
>>
>> org.apache.solr.common.SolrException: Error loading class
>> 'org.apache.solr.handler.dataimport.DataImportHandler' at
>> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:389)
>> at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:423) at
>> org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:459) .
>>
>> actually this error comes in solr 3.1 only in solr 1.4.1 it works fine
>>
>> how to solve this problem?
>>
>> Thanks
>>
>> Vishal Parekh
>>
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-Error-loading-class-org-apache-solr-handler-dataimport-DataImpo-tp2865625p2865625.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>

RE: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-26 Thread Robert Petersen

OK this is even more weird... everything is working much better except
for one thing: I was testing use cases with our top query terms to make
sure the below query settings wouldn't break any existing behavior, and
got this most unusual result.  The analyzer stack completely eliminated
the word McAfee from the query terms!  I'm like huh?  Here is the
analyzer page output for that search term:

Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.SynonymFilterFactory
{synonyms=query_synonyms.txt, expand=true, ignoreCase=true}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
generateNumberParts=0, catenateWords=0, generateWordParts=0,
catenateAll=0, catenateNumbers=0}
term position
term text
term type
source start,end
payload
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position
term text
term type
source start,end
payload
com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
{protected=protwords.txt}
term position
term text
term type
source start,end
payload
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position
term text
term type
source start,end
payload



-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Monday, April 25, 2011 11:27 AM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Subject: RE: term position question from analyzer stack for
WordDelimiterFilterFactory

Aha!  I knew something must be awry, but when I looked at the analysis
page output, well it sure looked like it should match.  :)

OK here is the query side WDF that finally works, I just turned
everything off.  (yay)  First I tried just completely removeing WDF from
the query side analyzer stack but that didn't work.  So anyway I suppose
I should turn off the catenate all plus the preserve original settings,
reindex, and see if I still get a match huh?  (PS  thank you very much
for the help!!!)





-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
Seeley
Sent: Monday, April 25, 2011 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: term position question from analyzer stack for
WordDelimiterFilterFactory

On Mon, Apr 25, 2011 at 12:15 PM, Robert Petersen 
wrote:
> The search and index analyzer stack are the same.

Ahhh, they should not be!
Using both generate and catenate in WDF at query time is a no-no.
Same reason you can't have multi-word synonyms at query time:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym
FilterFactory

I'd recommend going back to the WDF settings in the solr example
server as a starting point.


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: WhitespaceTokenizer and scoring(field length)

2011-04-26 Thread Otis Gospodnetic

Hi,

If you run your query with debugQuery=true you will see the explanation about 
how Lucene/Solr went about scoring your 2 docs.  If you can't figure out what's 
going on from there, send the relevant part to the list, along with the parsed 
query (which you can also see from debugQuery=true output) and maybe we can 
help.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: roySolr 
> To: solr-user@lucene.apache.org
> Sent: Tue, April 26, 2011 10:27:44 AM
> Subject: WhitespaceTokenizer and scoring(field length)
> 
> Hello,
> 
> I have a problem with the whitespaceTokenizer and scoring. An  example:
> 
> id  Titel
> 1   Manchester united
> 2   Manchester
> 
> With the  whitespaceTokenizer "Manchester united" will be splitted to
> "Manchester" and  "united". When
> i search for "manchester" i get id 1 and 2 in my results. What  i want is
> that id 2 scores higher(field length).
> How can i fix  this?
> 
> 
> --
> View this message in context: 
>http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2865784.html
>
> Sent  from the Solr - User mailing list archive at Nabble.com.
>

Re: What initialize new searcher?

2011-04-26 Thread Otis Gospodnetic

Hi,

Yes, typically after your index has been replicated from master to a slave a 
commit will be issued and the new searcher will be opened.  Before being 
exposed 
to regular clients it's a good practice to warm things up.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Solr Beginner 
> To: solr-user@lucene.apache.org
> Sent: Tue, April 26, 2011 8:50:21 AM
> Subject: What initialize new searcher?
> 
> Hi,
> 
> I'm reading solr cache documentation -
> http://wiki.apache.org/solr/SolrCaching I found there "The current
> Index  Searcher serves requests and when a new searcher is opened...".
> Could you  explain when new searcher is opened? Does it have something
> to do with index  commit?
> 
> Best Regards,
> Solr Beginner
>

Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov

Yes, I see.  Makes sense.  It is a bit hard to see a "bad" case for your 
proposal in that light. Here is one other example; I'm not sure whether 
it presents difficulties or not, and may be a bit contrived, but hey, 
food for thought at least:


Say you have set up synonyms between names and commonly-used pseudonyms 
or alternate names that should not be stemmed:


Malcolm X <=> Malcolm Little
Prince <=> Rogers Nelson Prince
Little Kim <=> Kimberly Denise Jones
Biggy Smalls etc.

You don't want "Malcolm Littler" or "Littlest Kim" or "Big Small" to 
match anything. And Princely shouldn't bring up the artist.


But you also have regular linguistic synonyms (not names) that *should* 
be stemmed (as in the original example).  So little <=> small should 
imply littler <=> smaller and so on via stemming.


Ideally  you could put one SynonymFilter before the stemming and the 
other one after.  In that case do the SynonymFilters get composed?  I 
can't think of a believable example where that would cause a problem, 
but maybe you can?


-Mike


On 04/26/2011 04:25 PM, Robert Muir wrote:

Mike, thanks a lot for your example: the idea here would be you would
put the lowercasefilter after the synonymfilter, and then you get this
exact flexibility?

e.g.
WhitespaceTokenizer
SynonymFilter ->  no lowercasing of tokens are done as it "analyzes"
your synonyms with just the tokenizer
LowerCaseFilter

but
WhitespaceTokenizer
LowerCaseFilter
SynonymFilter ->  the synonyms are lowercased, as it "analyzes"
synonyms with the tokenizer+filter

its already inconsistent today, because if you do:

LowerCaseTokenizer
SynonymFilter

then your synonyms are in fact all being lowercased... its just
arbitrary that they are only being analyzed with the "tokenizer".

On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolov  wrote:
   

Suppose your analysis stack includes lower-casing, but your synonyms are
only supposed to apply to upper-case tokens.  For example, "PET" might be a
synonym of "positron emission tomography", but "pet" wouldn't be.

-Mike

On 04/26/2011 09:51 AM, Robert Muir wrote:
 

On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
wrote:


   

But somehow this feels bad (well, so does sticking word variations in
what's
supposed to be a synonyms file), partly because it means that the person
adding
new synonyms would need to know what they stem to (or always check it
against
Solr before editing the file).

 

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.

Re: Ebay Kleinanzeigen and Auto Suggest

2011-04-26 Thread Otis Gospodnetic

Hi Eric,

Before using the terms component, allow me to point out:

* http://sematext.com/products/autocomplete/index.html (used on 
http://search-lucene.com/ for example)

* http://wiki.apache.org/solr/Suggester


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Eric Grobler 
> To: solr-user@lucene.apache.org
> Sent: Tue, April 26, 2011 1:11:11 PM
> Subject: Ebay Kleinanzeigen and Auto Suggest
> 
> Hi
> 
> Someone told me that ebay is using solr.
> I was looking at their  Auto Suggest implementation and I guess they are
> using Shingles and the  TermsComponent.
> 
> I managed to get a satisfactory implementation but I have  a problem with
> category specific filtering.
> Ebay suggestions are sensitive  to categories like Cars and Pets.
> 
> As far as I understand it is not  possible to using filters with a term
> query.
> Unless one uses multiple  fields or special prefixes for the words to index I
> cannot think how to  implement this.
> 
> Is their perhaps a workaround for this  limitation?
> 
> Best  Regards
> EricZ
> 
> ---
> 
> I am have  a shingle type like:
>  positionIncrementGap="100">
> 
>
> maxShingleSize="4"  />
>
>
> 
> 
> 
> 
> and a query like
>http://localhost:8983/solr/terms?q=*%3A*&terms.fl=suggest_text&terms.sort=count&terms.prefix=audi
>i
>

SynonymFilterFactory case changes

2011-04-26 Thread Robert Petersen

So if there is a hit in the synonym filter factory, do I need to put the
various case changes for a term so that the following
WordDelimiterFilter analyzer can do its 'split on case changes' work?
Here we see SynonymFilterFactory makes all terms lowercase because this
is what is in my synonmyms.txt file and I have ignoreCase=true:
"macafee, mcafee" 

Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.SynonymFilterFactory
{synonyms=index_synonyms.txt, expand=true, ignoreCase=true}
term position   1
term text   macafee
mcafee
term type   word
word
source start,end0,6
0,6
payload

Re: Question on Batch process

2011-04-26 Thread Otis Gospodnetic

Charlie,

How's this:
* -Xmx2g
* ramBufferSizeMB 512
* mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n allows)
* ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
* use SolrStreamingUpdateServer (with params matching your number of CPU cores) 
or send batches of say 1000 docs with the other SolrServer impl using N threads 
(N=# of your CPU cores)

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Charles Wardell 
> To: solr-user@lucene.apache.org
> Sent: Tue, April 26, 2011 2:32:29 PM
> Subject: Question on Batch process
> 
> I am sure that this question has been asked a few times, but I can't seem to  
>find the sweetspot for indexing.
> 
> I have about 100,000 files each  containing 1,000 xml documents ready to be 
>posted to Solr. My desire is to have  it index as quickly as possible and then 
>once completed the daily stream of ADDs  will be small in comparison.
> 
> The individual documents are small.  Essentially web postings from the net. 
>Title, postPostContent, date. 
>
> 
> What would be the ideal configuration? For RamBufferSize, mergeFactor,  
>MaxbufferedDocs, etc..
> 
> My machine is a quad core hyper-threaded. So it  shows up as 8 cpu's in TOP
> I have 16GB of available ram.
> 
> 
> Thanks in  advance.
> Charlie

Re: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-26 Thread Otis Gospodnetic

Hi Robert,

I'm no WDFF expert, but all these zero look suspicious:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
generateNumberParts=0, catenateWords=0, generateWordParts=0,
catenateAll=0, catenateNumbers=0}

A quick visit to 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
 makes me think you want:

splitOnCaseChange=1  (if you want Mc Afee for some reason?)
generateWordParts=1 (if you want Mc Afee for some reason?)
preserveOriginal=1


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Robert Petersen 
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Sent: Tue, April 26, 2011 4:39:49 PM
> Subject: RE: term position question from analyzer stack for 
>WordDelimiterFilterFactory
> 
> OK this is even more weird... everything is working much better except
> for  one thing: I was testing use cases with our top query terms to make
> sure the  below query settings wouldn't break any existing behavior, and
> got this most  unusual result.  The analyzer stack completely eliminated
> the word  McAfee from the query terms!  I'm like huh?  Here is the
> analyzer  page output for that search term:
> 
> Query  Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term  position 1
> term text McAfee
> term  type word
> source start,end  0,6
> payload 
> org.apache.solr.analysis.SynonymFilterFactory
> {synonyms=query_synonyms.txt,  expand=true, ignoreCase=true}
> term position 1
> term  text McAfee
> term type word
> source  start,end 0,6
> payload 
> org.apache.solr.analysis.StopFilterFactory  {words=stopwords.txt,
> ignoreCase=true}
> term position  1
> term text McAfee
> term type  word
> source start,end 0,6
> payload 
> org.apache.solr.analysis.WordDelimiterFilterFactory  {preserveOriginal=0,
> generateNumberParts=0, catenateWords=0,  generateWordParts=0,
> catenateAll=0, catenateNumbers=0}
> term  position
> term text
> term type
> source  start,end
> payload
> org.apache.solr.analysis.LowerCaseFilterFactory  {}
> term position
> term text
> term type
> source  start,end
> payload
> com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
> {protected=protwords.txt}
> term  position
> term text
> term type
> source  start,end
> payload
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory  {}
> term position
> term text
> term type
> source  start,end
> payload
> 
> 
> 
> -Original Message-
> From: Robert  Petersen [mailto:rober...@buy.com] 
> Sent: Monday, April 25,  2011 11:27 AM
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Subject:  RE: term position question from analyzer stack  for
> WordDelimiterFilterFactory
> 
> Aha!  I knew something must be  awry, but when I looked at the analysis
> page output, well it sure looked like  it should match.  :)
> 
> OK here is the query side WDF that finally  works, I just turned
> everything off.  (yay)  First I tried just  completely removeing WDF from
> the query side analyzer stack but that didn't  work.  So anyway I suppose
> I should turn off the catenate all plus the  preserve original settings,
> reindex, and see if I still get a match  huh?  (PS  thank you very much
> for the help!!!)
> 
>  generateWordParts="0"
>  generateNumberParts="0"
>  catenateWords="0"
>  catenateNumbers="0"
>  catenateAll="0"
>  preserveOriginal="0"
>  />
> 
> 
> 
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of  Yonik
> Seeley
> Sent: Monday, April 25, 2011 9:24 AM
> To: solr-user@lucene.apache.org
> Subject:  Re: term position question from analyzer stack  for
> WordDelimiterFilterFactory
> 
> On Mon, Apr 25, 2011 at 12:15 PM,  Robert Petersen 
> wrote:
> > The  search and index analyzer stack are the same.
> 
> Ahhh, they should not  be!
> Using both generate and catenate in WDF at query time is a no-no.
> Same  reason you can't have multi-word synonyms at query time:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym
> FilterFactory
> 
> I'd  recommend going back to the WDF settings in the solr example
> server as a  starting point.
> 
> 
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User  Conference, May
> 25-26, San Francisco
>

RE: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-26 Thread Robert Petersen

Yeah I am about to try turning one on at a time and see what happens.  I
had a meeting so couldn't do it yet...  (darn those meetings)  (lol)


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, April 26, 2011 2:37 PM
To: solr-user@lucene.apache.org
Subject: Re: term position question from analyzer stack for
WordDelimiterFilterFactory

Hi Robert,

I'm no WDFF expert, but all these zero look suspicious:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
generateNumberParts=0, catenateWords=0, generateWordParts=0,
catenateAll=0, catenateNumbers=0}

A quick visit to 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
imiterFilterFactory
 makes me think you want:

splitOnCaseChange=1  (if you want Mc Afee for some reason?)
generateWordParts=1 (if you want Mc Afee for some reason?)
preserveOriginal=1


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Robert Petersen 
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Sent: Tue, April 26, 2011 4:39:49 PM
> Subject: RE: term position question from analyzer stack for 
>WordDelimiterFilterFactory
> 
> OK this is even more weird... everything is working much better except
> for  one thing: I was testing use cases with our top query terms to
make
> sure the  below query settings wouldn't break any existing behavior,
and
> got this most  unusual result.  The analyzer stack completely
eliminated
> the word  McAfee from the query terms!  I'm like huh?  Here is the
> analyzer  page output for that search term:
> 
> Query  Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term  position 1
> term text McAfee
> term  type word
> source start,end  0,6
> payload 
> org.apache.solr.analysis.SynonymFilterFactory
> {synonyms=query_synonyms.txt,  expand=true, ignoreCase=true}
> term position 1
> term  text McAfee
> term type word
> source  start,end 0,6
> payload 
> org.apache.solr.analysis.StopFilterFactory  {words=stopwords.txt,
> ignoreCase=true}
> term position  1
> term text McAfee
> term type  word
> source start,end 0,6
> payload 
> org.apache.solr.analysis.WordDelimiterFilterFactory
{preserveOriginal=0,
> generateNumberParts=0, catenateWords=0,  generateWordParts=0,
> catenateAll=0, catenateNumbers=0}
> term  position
> term text
> term type
> source  start,end
> payload
> org.apache.solr.analysis.LowerCaseFilterFactory  {}
> term position
> term text
> term type
> source  start,end
> payload
> com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
> {protected=protwords.txt}
> term  position
> term text
> term type
> source  start,end
> payload
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory  {}
> term position
> term text
> term type
> source  start,end
> payload
> 
> 
> 
> -Original Message-
> From: Robert  Petersen [mailto:rober...@buy.com] 
> Sent: Monday, April 25,  2011 11:27 AM
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Subject:  RE: term position question from analyzer stack  for
> WordDelimiterFilterFactory
> 
> Aha!  I knew something must be  awry, but when I looked at the
analysis
> page output, well it sure looked like  it should match.  :)
> 
> OK here is the query side WDF that finally  works, I just turned
> everything off.  (yay)  First I tried just  completely removeing WDF
from
> the query side analyzer stack but that didn't  work.  So anyway I
suppose
> I should turn off the catenate all plus the  preserve original
settings,
> reindex, and see if I still get a match  huh?  (PS  thank you very
much
> for the help!!!)
> 
>  generateWordParts="0"
>  generateNumberParts="0"
>  catenateWords="0"
>  catenateNumbers="0"
>  catenateAll="0"
>  preserveOriginal="0"
>  />
> 
> 
> 
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of  Yonik
> Seeley
> Sent: Monday, April 25, 2011 9:24 AM
> To: solr-user@lucene.apache.org
> Subject:  Re: term position question from analyzer stack  for
> WordDelimiterFilterFactory
> 
> On Mon, Apr 25, 2011 at 12:15 PM,  Robert Petersen 
> wrote:
> > The  search and index analyzer stack are the same.
> 
> Ahhh, they should not  be!
> Using both generate and catenate in WDF at query time is a no-no.
> Same  reason you can't have multi-word synonyms at query time:
>
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym
> FilterFactory
> 
> I'd  recommend going back to the WDF settings in the solr example
> server as a  starting point.
> 
> 
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User  Conference, May
> 25-26, San Francisco
>

Reader per query request

2011-04-26 Thread cyang2010

Hi,

I was wondering if solr open a new lucene IndexReader for every query
request?  

>From performance point of view, is there any problem of opening a lot of
IndexReaders concurrently, or application shall have some logic to reuse the
same IndexReader?


Thanks,


cy




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reader-per-query-request-tp2867778p2867778.html
Sent from the Solr - User mailing list archive at Nabble.com.

Field Length and Highlight

2011-04-26 Thread Alejandro Delgadillo

Hi,

I¹ve been using solr with Coldfusion9,  I¹ve made a couple of adjustment to
it in order to fulfill my needs of my client, I¹m using solr as a document
search engine for a online library which has documents larger then 20MB and
some of them have more than 20 pages.

The thing is that... At first the solr didn¹t indexed all the text, I
already fix it by changing the number of the maxfieldlength in the
collections, now when I search for some word at the end of a document that
has like 150 pages, it shows me the document but won¹t highlight the words
that are almost at the end.

Any ideas?

Re: SynonymFilterFactory case changes

2011-04-26 Thread Erick Erickson

Yes, order does matter.  You're right, putting, say, lowercase in front
of WordDelimiter... will mess up the operations of WDFF.

The admin/analysis page is *extremely* useful for understanding what
happens in the analysis of input. Make sure to check the "verbose"
checkbox.

Best
Erick

On Tue, Apr 26, 2011 at 5:10 PM, Robert Petersen  wrote:
> So if there is a hit in the synonym filter factory, do I need to put the
> various case changes for a term so that the following
> WordDelimiterFilter analyzer can do its 'split on case changes' work?
> Here we see SynonymFilterFactory makes all terms lowercase because this
> is what is in my synonmyms.txt file and I have ignoreCase=true:
> "macafee, mcafee"
>
> Index Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term position   1
> term text       McAfee
> term type       word
> source start,end        0,6
> payload
> org.apache.solr.analysis.SynonymFilterFactory
> {synonyms=index_synonyms.txt, expand=true, ignoreCase=true}
> term position   1
> term text       macafee
> mcafee
> term type       word
> word
> source start,end        0,6
> 0,6
> payload
>
>

Re: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-26 Thread Erick Erickson

I second Otis' comments. Is it possible that you've gotten twisted
around by trying to modify these settings and would be better off
going back to the WDDF settings in the example schema? I've
sometimes found that to be very useful.

Also (although I don't think it applies in this case) be aware that
the analysis page may introduce it's own errors, so when you see
something really wonky, try a query with &debugQuery=on and see
if the parsed query squares with the results on the analysis page...

 Best
Erick

On Tue, Apr 26, 2011 at 5:44 PM, Robert Petersen  wrote:
> Yeah I am about to try turning one on at a time and see what happens.  I
> had a meeting so couldn't do it yet...  (darn those meetings)  (lol)
>
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> Sent: Tuesday, April 26, 2011 2:37 PM
> To: solr-user@lucene.apache.org
> Subject: Re: term position question from analyzer stack for
> WordDelimiterFilterFactory
>
> Hi Robert,
>
> I'm no WDFF expert, but all these zero look suspicious:
>
> org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
> generateNumberParts=0, catenateWords=0, generateWordParts=0,
> catenateAll=0, catenateNumbers=0}
>
> A quick visit to
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
> imiterFilterFactory
>  makes me think you want:
>
> splitOnCaseChange=1  (if you want Mc Afee for some reason?)
> generateWordParts=1 (if you want Mc Afee for some reason?)
> preserveOriginal=1
>
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
>> From: Robert Petersen 
>> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
>> Sent: Tue, April 26, 2011 4:39:49 PM
>> Subject: RE: term position question from analyzer stack for
>>WordDelimiterFilterFactory
>>
>> OK this is even more weird... everything is working much better except
>> for  one thing: I was testing use cases with our top query terms to
> make
>> sure the  below query settings wouldn't break any existing behavior,
> and
>> got this most  unusual result.  The analyzer stack completely
> eliminated
>> the word  McAfee from the query terms!  I'm like huh?  Here is the
>> analyzer  page output for that search term:
>>
>> Query  Analyzer
>> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>> term  position     1
>> term text     McAfee
>> term  type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.SynonymFilterFactory
>> {synonyms=query_synonyms.txt,  expand=true, ignoreCase=true}
>> term position     1
>> term  text     McAfee
>> term type     word
>> source  start,end     0,6
>> payload
>> org.apache.solr.analysis.StopFilterFactory  {words=stopwords.txt,
>> ignoreCase=true}
>> term position      1
>> term text     McAfee
>> term type      word
>> source start,end     0,6
>> payload
>> org.apache.solr.analysis.WordDelimiterFilterFactory
> {preserveOriginal=0,
>> generateNumberParts=0, catenateWords=0,  generateWordParts=0,
>> catenateAll=0, catenateNumbers=0}
>> term  position
>> term text
>> term type
>> source  start,end
>> payload
>> org.apache.solr.analysis.LowerCaseFilterFactory  {}
>> term position
>> term text
>> term type
>> source  start,end
>> payload
>> com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
>> {protected=protwords.txt}
>> term  position
>> term text
>> term type
>> source  start,end
>> payload
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory  {}
>> term position
>> term text
>> term type
>> source  start,end
>> payload
>>
>>
>>
>> -Original Message-
>> From: Robert  Petersen [mailto:rober...@buy.com]
>> Sent: Monday, April 25,  2011 11:27 AM
>> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
>> Subject:  RE: term position question from analyzer stack  for
>> WordDelimiterFilterFactory
>>
>> Aha!  I knew something must be  awry, but when I looked at the
> analysis
>> page output, well it sure looked like  it should match.  :)
>>
>> OK here is the query side WDF that finally  works, I just turned
>> everything off.  (yay)  First I tried just  completely removeing WDF
> from
>> the query side analyzer stack but that didn't  work.  So anyway I
> suppose
>> I should turn off the catenate all plus the  preserve original
> settings,
>> reindex, and see if I still get a match  huh?  (PS  thank you very
> much
>> for the help!!!)
>>
>>            >                  generateWordParts="0"
>>                  generateNumberParts="0"
>>                  catenateWords="0"
>>                  catenateNumbers="0"
>>                  catenateAll="0"
>>                  preserveOriginal="0"
>>                  />
>>
>>
>>
>> -Original Message-
>> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of  Yonik
>> Seeley
>> Sent: Monday, April 25, 2011 9:24 AM
>> To: solr-user@lucene.apache.org
>> Subject:  Re: term position qu

Re: Reader per query request

2011-04-26 Thread Erick Erickson

See below

On Tue, Apr 26, 2011 at 6:15 PM, cyang2010  wrote:
> Hi,
>
> I was wondering if solr open a new lucene IndexReader for every query
> request?
>
no, absolutely not. Solr only opens a reader when the underlying index
has changed, say a commit or a replication happens.

> From performance point of view, is there any problem of opening a lot of
> IndexReaders concurrently, or application shall have some logic to reuse the
> same IndexReader?

Every time you open a reader, a whole new set of caches are initiated.
I have a hard
time imagining a situation in which opening a new searcher for each
request would
be a good idea. Opening a new reader, especially for a large index is
a very expensive
operation and should be done as rarely as possible. But Solr will do
this automatically
for you, by and large you don't have to think about it.

Best
Erick

>
>
> Thanks,
>
>
> cy
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Reader-per-query-request-tp2867778p2867778.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Too many open files exception related to solrj getServer too often?

2011-04-26 Thread cyang2010

Just pushing up the topic and look for answers.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Too-many-open-files-exception-related-to-solrj-getServer-too-often-tp2808718p2867976.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Reader per query request

2011-04-26 Thread cyang2010

Thanks a lot.  That makes sense.  -- CY

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reader-per-query-request-tp2867778p2867995.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: SynonymFilterFactory case changes

2011-04-26 Thread Robert Petersen

But in this case lowercase is after WDF.  The question is that when you get a 
hit in the SynonymFilter on a synonym and where the entries in synonmyms.txt 
file are all in lower case do I need to add the case changing versions to make 
WDF work on case changes because it appears the synonym text is replaced 
verbatim by what is in the txt file and so that defeats the WDF filter.  In 
fact, adding the case changing versions of this term to the synonyms.txt file 
makes this use case work.  (yay)

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, April 26, 2011 3:39 PM
To: solr-user@lucene.apache.org
Subject: Re: SynonymFilterFactory case changes

Yes, order does matter.  You're right, putting, say, lowercase in front
of WordDelimiter... will mess up the operations of WDFF.

The admin/analysis page is *extremely* useful for understanding what
happens in the analysis of input. Make sure to check the "verbose"
checkbox.

Best
Erick

On Tue, Apr 26, 2011 at 5:10 PM, Robert Petersen  wrote:
> So if there is a hit in the synonym filter factory, do I need to put the
> various case changes for a term so that the following
> WordDelimiterFilter analyzer can do its 'split on case changes' work?
> Here we see SynonymFilterFactory makes all terms lowercase because this
> is what is in my synonmyms.txt file and I have ignoreCase=true:
> "macafee, mcafee"
>
> Index Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term position   1
> term text       McAfee
> term type       word
> source start,end        0,6
> payload
> org.apache.solr.analysis.SynonymFilterFactory
> {synonyms=index_synonyms.txt, expand=true, ignoreCase=true}
> term position   1
> term text       macafee
> mcafee
> term type       word
> word
> source start,end        0,6
> 0,6
> payload
>
>

Re: Field Length and Highlight

2011-04-26 Thread Koji Sekiguchi


(11/04/27 7:35), Alejandro Delgadillo wrote:

Hi,

I¹ve been using solr with Coldfusion9,  I¹ve made a couple of adjustment to
it in order to fulfill my needs of my client, I¹m using solr as a document
search engine for a online library which has documents larger then 20MB and
some of them have more than 20 pages.

The thing is that... At first the solr didn¹t indexed all the text, I
already fix it by changing the number of the maxfieldlength in the
collections, now when I search for some word at the end of a document that
has like 150 pages, it shows me the document but won¹t highlight the words
that are almost at the end.

Any ideas?



So your maxAnalyzedChars is too small?
http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars

Koji
--
http://www.rondhuit.com/en/

Re: Question on Batch process

2011-04-26 Thread Charles Wardell

Thank you Otis.
Without trying to appear to stupid, when you refer to having the params 
matching your # of CPU cores, you are talking about the # of threads I can 
spawn with the StreamingUpdateSolrServer object?
Up until now, I have been just utilizing post.sh or post.jar. Are these capable 
of that or do I need to write some code to collect a bunch of files into the 
buffer and send it off?

Also, Do you have a sense for how long it should take to index 100,000 files or 
in my case 100,000,000 documents?
StreamingUpdateSolrServer
public StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int 
threadCount) throws MalformedURLException

Thanks again,
Charlie

-- 
Best Regards,

Charles Wardell
Blue Chips Technology, Inc.
www.bcsolution.com

On Tuesday, April 26, 2011 at 5:12 PM, Otis Gospodnetic wrote: 
> Charlie,
> 
> How's this:
> * -Xmx2g
> * ramBufferSizeMB 512
> * mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n allows)
> * ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
> * use SolrStreamingUpdateServer (with params matching your number of CPU 
> cores) 
> or send batches of say 1000 docs with the other SolrServer impl using N 
> threads 
> (N=# of your CPU cores)
> 
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> - Original Message 
> > From: Charles Wardell 
> > To: solr-user@lucene.apache.org
> > Sent: Tue, April 26, 2011 2:32:29 PM
> > Subject: Question on Batch process
> > 
> > I am sure that this question has been asked a few times, but I can't seem 
> > to 
> > find the sweetspot for indexing.
> > 
> > I have about 100,000 files each containing 1,000 xml documents ready to be 
> > posted to Solr. My desire is to have it index as quickly as possible and 
> > then 
> > once completed the daily stream of ADDs will be small in comparison.
> > 
> > The individual documents are small. Essentially web postings from the net. 
> > Title, postPostContent, date. 
> > 
> > 
> > What would be the ideal configuration? For RamBufferSize, mergeFactor, 
> > MaxbufferedDocs, etc..
> > 
> > My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in TOP
> > I have 16GB of available ram.
> > 
> > 
> > Thanks in advance.
> > Charlie
>

Re: SynonymFilterFactory case changes

2011-04-26 Thread Erick Erickson

Ahhh, I mis-read your post..

First, it's not the synonymfilterfactory that's lowercasing anything. The
ingorecase="true" affects the matching, not the output. The output is
probably lowercased because you have it that way in the synonyms.txt
file. At least that's what I just saw using the analysis page from the
Solr admin page.

So yes, if you want the WDF to do anything on tokens put into the input
stream by SynonymFilterFactory, you need to make the
replacement be the accurate case.

But I think you already figured all that out

Best
Erick

On Tue, Apr 26, 2011 at 7:19 PM, Robert Petersen  wrote:
> But in this case lowercase is after WDF.  The question is that when you get a 
> hit in the SynonymFilter on a synonym and where the entries in synonmyms.txt 
> file are all in lower case do I need to add the case changing versions to 
> make WDF work on case changes because it appears the synonym text is replaced 
> verbatim by what is in the txt file and so that defeats the WDF filter.  In 
> fact, adding the case changing versions of this term to the synonyms.txt file 
> makes this use case work.  (yay)
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Tuesday, April 26, 2011 3:39 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SynonymFilterFactory case changes
>
> Yes, order does matter.  You're right, putting, say, lowercase in front
> of WordDelimiter... will mess up the operations of WDFF.
>
> The admin/analysis page is *extremely* useful for understanding what
> happens in the analysis of input. Make sure to check the "verbose"
> checkbox.
>
> Best
> Erick
>
> On Tue, Apr 26, 2011 at 5:10 PM, Robert Petersen  wrote:
>> So if there is a hit in the synonym filter factory, do I need to put the
>> various case changes for a term so that the following
>> WordDelimiterFilter analyzer can do its 'split on case changes' work?
>> Here we see SynonymFilterFactory makes all terms lowercase because this
>> is what is in my synonmyms.txt file and I have ignoreCase=true:
>> "macafee, mcafee"
>>
>> Index Analyzer
>> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>> term position   1
>> term text       McAfee
>> term type       word
>> source start,end        0,6
>> payload
>> org.apache.solr.analysis.SynonymFilterFactory
>> {synonyms=index_synonyms.txt, expand=true, ignoreCase=true}
>> term position   1
>> term text       macafee
>> mcafee
>> term type       word
>> word
>> source start,end        0,6
>> 0,6
>> payload
>>
>>
>

Suggester or spellcheck return stored fields

2011-04-26 Thread wakemaster 39

Hello all,

I am trying to build an autocomplete solution for a website that I run. The
current implementation of it is going to be used on who you want to send
PM's too. I have it basically working up to this point, The UI is done and
the suggester is working in returning possible solutions without any major
problems. The problem I am currently running into is that the suggestions it
is returning are not necessarily unique.

To solve this, I would like to return the user ID (a stored field) along
with the suggestion. This would help in other areas but would ensure things
are unique. Is it possible to make suggester to return these other fields or
is it strictly returning text as I assume is the case. I know I am likely
stretching what the suggester is suppose to do, so I am ok rolling back to a
different plan using normal queries. But would prefer to be able to use
suggester if possible.

Thanks for the help,

Cameron

Re: How to Update Value of One Field of a Document in Index?

2011-04-26 Thread Peter Spam

My schema: id, name, checksum, body, notes, date

I'd like for a user to be able to add notes to the notes field, and not have to 
re-index the document (since the body field may contain 100MB of text).  Some 
ideas:

1) How about creating another core which only contains id, checksum, and notes? 
 Then, "updating" (delete followed by add) wouldn't be that painful?

2) What about using a multValued field?  Could you just keep adding values as 
the user enters more notes?

Pete

On Sep 9, 2010, at 11:06 PM, Liam O'Boyle wrote:

> Hi Savannah,
> 
> You can only reindex the entire document; if you only have the ID,
> then do a search to retrieve the rest of the data, then reindex.  This
> assumes that all of the fields you need to index are stored (so that
> you can retrieve them) and not just indexed.
> 
> Liam
> 
> On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
>  wrote:
>> 
>> I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
>> update the value of one of the fields of a document in the solr index after 
>> the
>> document was already indexed, and I have only the document id.  How do I do
>> that?
>> 
>> Thanks.
>> 
>> 
>>

Re: What initialize new searcher?

2011-04-26 Thread Solr Beginner

Thank you for the answers. I'm moving forward and have few more
questions but for separate threads.

On Tue, Apr 26, 2011 at 10:47 PM, Otis Gospodnetic
 wrote:
> Hi,
>
> Yes, typically after your index has been replicated from master to a slave a
> commit will be issued and the new searcher will be opened.  Before being 
> exposed
> to regular clients it's a good practice to warm things up.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
>> From: Solr Beginner 
>> To: solr-user@lucene.apache.org
>> Sent: Tue, April 26, 2011 8:50:21 AM
>> Subject: What initialize new searcher?
>>
>> Hi,
>>
>> I'm reading solr cache documentation -
>> http://wiki.apache.org/solr/SolrCaching I found there "The current
>> Index  Searcher serves requests and when a new searcher is opened...".
>> Could you  explain when new searcher is opened? Does it have something
>> to do with index  commit?
>>
>> Best Regards,
>> Solr Beginner
>>
>

fieldCache only on stats page

2011-04-26 Thread Solr Beginner

Hi,

I can see only fieldCache (nothing about filter, query or document
cache) on stats page. What I'm doing wrong? We have two servers with
replication. There are two cores(prod, dev) on each server. Maybe I
have to add something to solrconfig.xml of cores?

Best Regards,
Solr Beginner

DataImportHandler in Solr 3.1.0: not updating dataimport.properties last_index_time on delta-import?

2011-04-26 Thread Scott Bigelow

Title pretty much says it all; I've configured the DIH in 3.1.0, and
it works great, except the delta-imports are always from the last time
a full-import happened, not a delta-import. After a delta-import,
dataimport.properties is completely untouched. The documentation
implies that the delta-import should update the last_index_time:

"The DataImportHandler exposes a variable called last_index_time which
is a timestamp value denoting the last time full-import 'or'
delta-import was run"
- http://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example

Is there a configuration preventing delta-import from updating
dataimport.properties? It updates properly on each full-import.

Re: Ebay Kleinanzeigen and Auto Suggest

2011-04-26 Thread Eric Grobler

Thanks for the links Otis,

I will have a look.

Regards
Ericz

On Tue, Apr 26, 2011 at 10:06 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hi Eric,
>
> Before using the terms component, allow me to point out:
>
> * http://sematext.com/products/autocomplete/index.html (used on
> http://search-lucene.com/ for example)
>
> * http://wiki.apache.org/solr/Suggester
>
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Eric Grobler 
> > To: solr-user@lucene.apache.org
> > Sent: Tue, April 26, 2011 1:11:11 PM
> > Subject: Ebay Kleinanzeigen and Auto Suggest
> >
> > Hi
> >
> > Someone told me that ebay is using solr.
> > I was looking at their  Auto Suggest implementation and I guess they are
> > using Shingles and the  TermsComponent.
> >
> > I managed to get a satisfactory implementation but I have  a problem with
> > category specific filtering.
> > Ebay suggestions are sensitive  to categories like Cars and Pets.
> >
> > As far as I understand it is not  possible to using filters with a term
> > query.
> > Unless one uses multiple  fields or special prefixes for the words to
> index I
> > cannot think how to  implement this.
> >
> > Is their perhaps a workaround for this  limitation?
> >
> > Best  Regards
> > EricZ
> >
> > ---
> >
> > I am have  a shingle type like:
> >  > positionIncrementGap="100">
> > 
> >
> > > maxShingleSize="4"  />
> >
> >
> > 
> >
> >
> >
> > and a query like
> >
> http://localhost:8983/solr/terms?q=*%3A*&terms.fl=suggest_text&terms.sort=count&terms.prefix=audi
> >i
> >
>

53 matches

Mail list logo