Re: Solr 3.5 takes very long to commit gradually

2012-04-12 Thread Jan Høydahl
What operating system?
Are you using spellchecker with buildOnCommit?
Anything special in your Update Chain?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. apr. 2012, at 06:45, Rohit wrote:

> We recently migrated from solr3.1 to solr3.5, we have one master and one
> slave configured. The master has two cores,
> 
> 1) Core1 - 44555972 documents
> 
> 2) Core2 - 29419244 documents
> 
> We commit every 5000 documents, but lately the commit time gradually
> increase and solr is taking as very long 15 minutes plus in some cases. What
> could have caused this, I have checked the logs and the only warning i can
> see is,
> 
> "WARNING: Use of deprecated update request parameter update.processor
> detected. Please use the new parameter update.chain instead, as support for
> update.processor will be removed in a later version."
> 
> Memory details:
> 
> export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
> 
> Solr Config:
> 
> false
> 
> 10
> 
> 32
> 
> 
> 
> 1
> 
> 1000
> 
> 1
> 
> Also noticed, that top command show almost 350GB of Virtual memory usage.
> 
> What could be causing this, as everything was running fine a few days back?
> 
> 
> 
> 
> 
> Regards,
> 
> Rohit
> 
> Mobile: +91-9901768202
> 
> About Me:   http://about.me/rohitg
> 
> 
> 



Re: Solr 3.5 takes very long to commit gradually

2012-04-12 Thread Tirthankar Chatterjee
Hi Rohit,
What would be the average size of your documents and also can you please share 
your idea of having 2 cores in the master. I just wanted to know the reasoning 
behind the design. 

Thanks in advance 

Tirthankar
On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote:

> What operating system?
> Are you using spellchecker with buildOnCommit?
> Anything special in your Update Chain?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
> 
> On 12. apr. 2012, at 06:45, Rohit wrote:
> 
>> We recently migrated from solr3.1 to solr3.5, we have one master and one
>> slave configured. The master has two cores,
>> 
>> 1) Core1 - 44555972 documents
>> 
>> 2) Core2 - 29419244 documents
>> 
>> We commit every 5000 documents, but lately the commit time gradually
>> increase and solr is taking as very long 15 minutes plus in some cases. What
>> could have caused this, I have checked the logs and the only warning i can
>> see is,
>> 
>> "WARNING: Use of deprecated update request parameter update.processor
>> detected. Please use the new parameter update.chain instead, as support for
>> update.processor will be removed in a later version."
>> 
>> Memory details:
>> 
>> export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
>> 
>> Solr Config:
>> 
>> false
>> 
>> 10
>> 
>> 32
>> 
>> 
>> 
>> 1
>> 
>> 1000
>> 
>> 1
>> 
>> Also noticed, that top command show almost 350GB of Virtual memory usage.
>> 
>> What could be causing this, as everything was running fine a few days back?
>> 
>> 
>> 
>> 
>> 
>> Regards,
>> 
>> Rohit
>> 
>> Mobile: +91-9901768202
>> 
>> About Me:   http://about.me/rohitg
>> 
>> 
>> 
> 

**Legal Disclaimer***
"This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you."
*


Re: Solr 3.5 takes very long to commit gradually

2012-04-12 Thread Tirthankar Chatterjee


Hi Rohit,
Can you please check the solrconfig.xml in 3.5 and compare it with 3.1 if there 
are any warming queries specified while opening the searchers after a commit. 

Thanks,
Tirthankar
On Apr 12, 2012, at 3:30 AM, Tirthankar Chatterjee wrote:

> Hi Rohit,
> What would be the average size of your documents and also can you please 
> share your idea of having 2 cores in the master. I just wanted to know the 
> reasoning behind the design. 
> 
> Thanks in advance 
> 
> Tirthankar
> On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote:
> 
>> What operating system?
>> Are you using spellchecker with buildOnCommit?
>> Anything special in your Update Chain?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>> 
>> On 12. apr. 2012, at 06:45, Rohit wrote:
>> 
>>> We recently migrated from solr3.1 to solr3.5, we have one master and one
>>> slave configured. The master has two cores,
>>> 
>>> 1) Core1 - 44555972 documents
>>> 
>>> 2) Core2 - 29419244 documents
>>> 
>>> We commit every 5000 documents, but lately the commit time gradually
>>> increase and solr is taking as very long 15 minutes plus in some cases. What
>>> could have caused this, I have checked the logs and the only warning i can
>>> see is,
>>> 
>>> "WARNING: Use of deprecated update request parameter update.processor
>>> detected. Please use the new parameter update.chain instead, as support for
>>> update.processor will be removed in a later version."
>>> 
>>> Memory details:
>>> 
>>> export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
>>> 
>>> Solr Config:
>>> 
>>> false
>>> 
>>> 10
>>> 
>>> 32
>>> 
>>> 
>>> 
>>> 1
>>> 
>>> 1000
>>> 
>>> 1
>>> 
>>> Also noticed, that top command show almost 350GB of Virtual memory usage.
>>> 
>>> What could be causing this, as everything was running fine a few days back?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Regards,
>>> 
>>> Rohit
>>> 
>>> Mobile: +91-9901768202
>>> 
>>> About Me:   http://about.me/rohitg
>>> 
>>> 
>>> 
>> 
> 

**Legal Disclaimer***
"This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you."
*


RE: Solr 3.5 takes very long to commit gradually

2012-04-12 Thread Rohit
Hi Tirthankar,

The average size of documents would be a few Kb's this is mostly tweets
which are being saved. The two cores are storing different kind of data and
nothing else.

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg

-Original Message-
From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] 
Sent: 12 April 2012 13:14
To: solr-user@lucene.apache.org
Subject: Re: Solr 3.5 takes very long to commit gradually

Hi Rohit,
What would be the average size of your documents and also can you please
share your idea of having 2 cores in the master. I just wanted to know the
reasoning behind the design. 

Thanks in advance 

Tirthankar
On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote:

> What operating system?
> Are you using spellchecker with buildOnCommit?
> Anything special in your Update Chain?
> 
> --
> Jan Høydahl, search solution architect Cominvent AS - 
> www.cominvent.com Solr Training - www.solrtraining.com
> 
> On 12. apr. 2012, at 06:45, Rohit wrote:
> 
>> We recently migrated from solr3.1 to solr3.5, we have one master and 
>> one slave configured. The master has two cores,
>> 
>> 1) Core1 - 44555972 documents
>> 
>> 2) Core2 - 29419244 documents
>> 
>> We commit every 5000 documents, but lately the commit time gradually 
>> increase and solr is taking as very long 15 minutes plus in some 
>> cases. What could have caused this, I have checked the logs and the 
>> only warning i can see is,
>> 
>> "WARNING: Use of deprecated update request parameter update.processor 
>> detected. Please use the new parameter update.chain instead, as 
>> support for update.processor will be removed in a later version."
>> 
>> Memory details:
>> 
>> export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
>> 
>> Solr Config:
>> 
>> false
>> 
>> 10
>> 
>> 32
>> 
>> 
>> 
>> 1
>> 
>> 1000
>> 
>> 1
>> 
>> Also noticed, that top command show almost 350GB of Virtual memory usage.
>> 
>> What could be causing this, as everything was running fine a few days
back?
>> 
>> 
>> 
>> 
>> 
>> Regards,
>> 
>> Rohit
>> 
>> Mobile: +91-9901768202
>> 
>> About Me:   http://about.me/rohitg
>> 
>> 
>> 
> 

**Legal Disclaimer***
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or
distribution by others is strictly prohibited. If you have received the
message in error, please advise the sender by reply email and delete the
message. Thank you."
*




RE: Solr 3.5 takes very long to commit gradually

2012-04-12 Thread Rohit
Operating system in linux ubuntu.
No not using spellchecker 
Only language detection in my update chain.

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg


-Original Message-
From: Jan Høydahl [mailto:jan@cominvent.com] 
Sent: 12 April 2012 12:50
To: solr-user@lucene.apache.org
Subject: Re: Solr 3.5 takes very long to commit gradually

What operating system?
Are you using spellchecker with buildOnCommit?
Anything special in your Update Chain?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. apr. 2012, at 06:45, Rohit wrote:

> We recently migrated from solr3.1 to solr3.5, we have one master and 
> one slave configured. The master has two cores,
> 
> 1) Core1 - 44555972 documents
> 
> 2) Core2 - 29419244 documents
> 
> We commit every 5000 documents, but lately the commit time gradually 
> increase and solr is taking as very long 15 minutes plus in some 
> cases. What could have caused this, I have checked the logs and the 
> only warning i can see is,
> 
> "WARNING: Use of deprecated update request parameter update.processor 
> detected. Please use the new parameter update.chain instead, as 
> support for update.processor will be removed in a later version."
> 
> Memory details:
> 
> export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
> 
> Solr Config:
> 
> false
> 
> 10
> 
> 32
> 
> 
> 
> 1
> 
> 1000
> 
> 1
> 
> Also noticed, that top command show almost 350GB of Virtual memory usage.
> 
> What could be causing this, as everything was running fine a few days
back?
> 
> 
> 
> 
> 
> Regards,
> 
> Rohit
> 
> Mobile: +91-9901768202
> 
> About Me:   http://about.me/rohitg
> 
> 
> 




Re: Solr 3.5 takes very long to commit gradually

2012-04-12 Thread Tirthankar Chatterjee
thanks Rohit.. for the information.
On Apr 12, 2012, at 4:08 AM, Rohit wrote:

> Hi Tirthankar,
> 
> The average size of documents would be a few Kb's this is mostly tweets
> which are being saved. The two cores are storing different kind of data and
> nothing else.
> 
> Regards,
> Rohit
> Mobile: +91-9901768202
> About Me: http://about.me/rohitg
> 
> -Original Message-
> From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] 
> Sent: 12 April 2012 13:14
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 3.5 takes very long to commit gradually
> 
> Hi Rohit,
> What would be the average size of your documents and also can you please
> share your idea of having 2 cores in the master. I just wanted to know the
> reasoning behind the design. 
> 
> Thanks in advance 
> 
> Tirthankar
> On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote:
> 
>> What operating system?
>> Are you using spellchecker with buildOnCommit?
>> Anything special in your Update Chain?
>> 
>> --
>> Jan Høydahl, search solution architect Cominvent AS - 
>> www.cominvent.com Solr Training - www.solrtraining.com
>> 
>> On 12. apr. 2012, at 06:45, Rohit wrote:
>> 
>>> We recently migrated from solr3.1 to solr3.5, we have one master and 
>>> one slave configured. The master has two cores,
>>> 
>>> 1) Core1 - 44555972 documents
>>> 
>>> 2) Core2 - 29419244 documents
>>> 
>>> We commit every 5000 documents, but lately the commit time gradually 
>>> increase and solr is taking as very long 15 minutes plus in some 
>>> cases. What could have caused this, I have checked the logs and the 
>>> only warning i can see is,
>>> 
>>> "WARNING: Use of deprecated update request parameter update.processor 
>>> detected. Please use the new parameter update.chain instead, as 
>>> support for update.processor will be removed in a later version."
>>> 
>>> Memory details:
>>> 
>>> export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
>>> 
>>> Solr Config:
>>> 
>>> false
>>> 
>>> 10
>>> 
>>> 32
>>> 
>>> 
>>> 
>>> 1
>>> 
>>> 1000
>>> 
>>> 1
>>> 
>>> Also noticed, that top command show almost 350GB of Virtual memory usage.
>>> 
>>> What could be causing this, as everything was running fine a few days
> back?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Regards,
>>> 
>>> Rohit
>>> 
>>> Mobile: +91-9901768202
>>> 
>>> About Me:   http://about.me/rohitg
>>> 
>>> 
>>> 
>> 
> 
> **Legal Disclaimer***
> "This communication may contain confidential and privileged material for the
> sole use of the intended recipient. Any unauthorized review, use or
> distribution by others is strictly prohibited. If you have received the
> message in error, please advise the sender by reply email and delete the
> message. Thank you."
> *
> 
> 

**Legal Disclaimer***
"This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you."
*


Problem to integrate Solr in Jetty (the first example in the Apache Solr 3.1 Cookbook)

2012-04-12 Thread Bastian Hepp
Hi,

I'm using Apache Solr 3.5.0 and Jetty 8.1.2 with Windows 7. (Versions in
the Book used... Solr 3.1, Jetty 6.1.26)

I've tried to get Solr running with Jetty.
- I copied the jetty.xml and the webdefault.xml from the example Solr.
- I copied the solr.war to webapps
- I copied the solr directory from the example dir to the jetty dir.

When I try to start I get this error message:

C:\\jetty-solr>java -jar start.jar
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.eclipse.jetty.start.Main.invokeMain(Main.java:457)
at org.eclipse.jetty.start.Main.start(Main.java:602)
at org.eclipse.jetty.start.Main.main(Main.java:82)
Caused by: java.lang.ClassNotFoundException: org.mortbay.jetty.Server
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at org.eclipse.jetty.util.Loader.loadClass(Loader.java:92)
at
org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.nodeClass(XmlConfiguration.java:349)
at
org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.configure(XmlConfiguration.java:327)
at
org.eclipse.jetty.xml.XmlConfiguration.configure(XmlConfiguration.java:291)
at
org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1203)
at java.security.AccessController.doPrivileged(Native Method)
at
org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138)
... 7 more
Usage: java -jar start.jar [options] [properties] [configs]
   java -jar start.jar --help  # for more information

Thanks for your help,
Bastian


Re: Facets involving multiple fields

2012-04-12 Thread Marc SCHNEIDER
Hi,

Thanks for your answer.
Let's say I have to fields : 'keywords' and 'short_title'.
For these fields I'd like to make a faceted search : if 'Computer' is
stored in at least one of these fields for a document I'd like to get
it added in my results.
doc1 => keywords : 'Computer' / short_title : 'Computer'
doc2 => keywords : 'Computer'
doc3 => short_title : 'Computer'

In this case I'd like to have : Computer (3)

I don't see how to solve this with facet.query.

Thanks,
Marc.

On Wed, Apr 11, 2012 at 5:13 PM, Erick Erickson  wrote:
> Have you considered facet.query? You can specify an arbitrary query
> to facet on which might do what you want. Otherwise, I'm not sure what
> you mean by "faceted search using two fields". How should these fields
> be combined into a single facet? What that means practically is not at
> all obvious from your problem statement.
>
> Best
> Erick
>
> On Tue, Apr 10, 2012 at 8:55 AM, Marc SCHNEIDER
>  wrote:
>> Hi,
>>
>> I'd like to make a faceted search using two fields. I want to have a
>> single result and not a result by field (like when using
>> facet.field=f1,facet.field=f2).
>> I don't want to use a copy field either because I want it to be
>> dynamic at search time.
>> As far as I know this is not possible for Solr 3.x...
>> But I saw a new parameter named "group.facet" for Solr4. Could that
>> solve my problem? If yes could somebody give me an example?
>>
>> Thanks,
>> Marc.


Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
Given an input of "Windjacke" (probably "wind jacket" in English), I'd
like the code that prepares the data for the index (tokenizer etc) to
understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
would include the "Windjacke" document in its result set.

It appears to me that such an analysis requires a dictionary-backed
approach, which doesn't have to be perfect at all; a list of the most
common 2000 words would probably do the job and fulfil a criterion of
reasonable usefulness.

Do you know of any implementation techniques or working implementations
to do this kind of lexical analysis for German language data? (Or other
languages, for that matter?) What are they, where can I find them?

I'm sure there is something out (commercial or free) because I've seen
lots of engines grokking German and the way it builds words.

Failing that, what are the proper terms do refer to these techniques so
you can search more successfully?

Michael


Re: Large Index and OutOfMemoryError: Map failed

2012-04-12 Thread Michael McCandless
Your largest index has 66 segments (690 files) ... biggish but not
insane.  With 64K maps you should be able to have ~47 searchers open
on each core.

Enabling compound file format (not the opposite!) will mean fewer maps
... ie should improve this situation.

I don't understand why Solr defaults to compound file off... that
seems dangerous.

Really we need a Solr dev here... to answer "how long is a stale
searcher kept open".  Is it somehow possible 46 old searchers are
being left open...?

I don't see any other reason why you'd run out of maps.  Hmm, unless
MMapDirectory didn't think it could safely invoke unmap in your JVM.
Which exact JVM are you using?  If you can print the
MMapDirectory.UNMAP_SUPPORTED constant, we'd know for sure.

Yes, switching away from MMapDir will sidestep the "too many maps"
issue, however, 1) MMapDir has better perf than NIOFSDir, and 2) if
there really is a leak here (Solr not closing the old searchers or a
Lucene bug or something...) then you'll eventually run out of file
descriptors (ie, same  problem, different manifestation).

Mike McCandless

http://blog.mikemccandless.com

2012/4/11 Gopal Patwa :
>
> I have not change the mergefactor, it was 10. Compound index file is disable
> in my config but I read from below post, that some one had similar issue and
> it was resolved by switching from compound index file format to non-compound
> index file.
>
> and some folks resolved by "changing lucene code to disable MMapDirectory."
> Is this best practice to do, if so is this can be done in configuration?
>
> http://lucene.472066.n3.nabble.com/MMapDirectory-failed-to-map-a-23G-compound-index-segment-td3317208.html
>
> I have index document of core1 = 5 million, core2=8million and
> core3=3million and all index are hosted in single Solr instance
>
> I am going to use Solr for our site StubHub.com, see attached "ls -l" list
> of index files for all core
>
> SolrConfig.xml:
>
>
>   
>   false
>   10
>   2147483647
>   1
>   4096
>   10
>   1000
>   1
>   single
>   
>   
> 0.0
> 10.0
>   
>
>   
> false
> 0
>   
>   
>   
>
>
>   
>   1000
>
>  90
>  false
>
>
>  ${inventory.solr.softcommit.duration:1000}
>
>   
>   
>
>
> Forwarded conversation
> Subject: Large Index and OutOfMemoryError: Map failed
> 
>
> From: Gopal Patwa 
> Date: Fri, Mar 30, 2012 at 10:26 PM
> To: solr-user@lucene.apache.org
>
>
> I need help!!
>
>
>
>
>
> I am using Solr 4.0 nightly build with NRT and I often get this error during
> auto commit "java.lang.OutOfMemoryError: Map failed". I have search this
> forum and what I found it is related to OS ulimit setting, please se below
> my ulimit settings. I am not sure what ulimit setting I should have? and we
> also get "java.net.SocketException: Too many open files" NOT sure how many
> open file we need to set?
>
>
> I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 - 15GB,
> with Single shard
>
>
> We update the index every 5 seconds, soft commit every 1 second and hard
> commit every 15 minutes
>
>
> Environment: Jboss 4.2, JDK 1.6 , CentOS, JVM Heap Size = 24GB
>
>
> ulimit:
>
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 401408
> max locked memory       (kbytes, -l) 1024
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 10240
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 401408
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
>
>
> ERROR:
>
>
>
>
>
> 2012-03-29 15:14:08,560 [] priority=ERROR app_name= thread=pool-3-thread-1
> location=CommitTracker line=93 auto commit error...:java.io.IOException: Map
> failed
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:748)
>   at
> org.apache.lucene.store.MMapDirectory$MMapIndexInput.(MMapDirectory.java:293)
>   at 
> org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:221)
>   at
> org.apache.lucene.codecs.lucene40.Lucene40PostingsReader.(Lucene40PostingsReader.java:58)
>   at
> org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat.fieldsProducer(Lucene40PostingsFormat.java:80)
>   at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.visitOneFormat(PerFieldPostingsFormat.java:189)
>   at
> org.apac

codecs for sorted indexes

2012-04-12 Thread Carlos Gonzalez-Cadenas
Hello,

We're using a sorted index in order to implement early termination
efficiently over an index of hundreds of millions of documents. As of now,
we're using the default codecs coming with Lucene 4, but we believe that
due to the fact that the docids are sorted, we should be able to do much
better in terms of storage and achieve much better performance, especially
decompression performance.

In particular, Robert Muir is commenting on these lines here:

https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411

We're aware that the in the bulkpostings branch there are different codecs
being implemented and different experiments being done. We don't know
whether we should implement our own codec (i.e. using some RLE-like
techniques) or we should use one of the codecs implemented there (PFOR,
Simple64, ...).

Can you please give us some advice on this?

Thanks
Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Given an input of "Windjacke" (probably "wind jacket" in English),
> I'd like the code that prepares the data for the index (tokenizer
> etc) to understand that this is a "Jacke" ("jacket") so that a
> query for "Jacke" would include the "Windjacke" document in its
> result set.
> 
> It appears to me that such an analysis requires a dictionary-
> backed approach, which doesn't have to be perfect at all; a list
> of the most common 2000 words would probably do the job and fulfil
> a criterion of reasonable usefulness.

A simple approach would obviously be a word list and a regular
expression. There will, however, be nuts and bolts to take care of.
A more sophisticated and tested approach might be known to you.

Michael


Re: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht

Michael,

I'm on this list and the lucene list since several years and have not found 
this yet.
It's been one "neglected topics" to my taste.

There is a CompoundAnalyzer but it requires the compounds to be dictionary 
based, as you indicate.

I am convinced there's a way to build the de-compounding words efficiently from 
a broad corpus but I have never seen it (and the experts at DFKI I asked for 
for also told me they didn't know of one).

paul

Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :

> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
> like the code that prepares the data for the index (tokenizer etc) to
> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
> would include the "Windjacke" document in its result set.
> 
> It appears to me that such an analysis requires a dictionary-backed
> approach, which doesn't have to be perfect at all; a list of the most
> common 2000 words would probably do the job and fulfil a criterion of
> reasonable usefulness.
> 
> Do you know of any implementation techniques or working implementations
> to do this kind of lexical analysis for German language data? (Or other
> languages, for that matter?) What are they, where can I find them?
> 
> I'm sure there is something out (commercial or free) because I've seen
> lots of engines grokking German and the way it builds words.
> 
> Failing that, what are the proper terms do refer to these techniques so
> you can search more successfully?
> 
> Michael



Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling

You might have a look at:
http://www.basistech.com/lucene/


Am 12.04.2012 11:52, schrieb Michael Ludwig:
> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
> like the code that prepares the data for the index (tokenizer etc) to
> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
> would include the "Windjacke" document in its result set.
> 
> It appears to me that such an analysis requires a dictionary-backed
> approach, which doesn't have to be perfect at all; a list of the most
> common 2000 words would probably do the job and fulfil a criterion of
> reasonable usefulness.
> 
> Do you know of any implementation techniques or working implementations
> to do this kind of lexical analysis for German language data? (Or other
> languages, for that matter?) What are they, where can I find them?
> 
> I'm sure there is something out (commercial or free) because I've seen
> lots of engines grokking German and the way it builds words.
> 
> Failing that, what are the proper terms do refer to these techniques so
> you can search more successfully?
> 
> Michael


Re: EmbeddedSolrServer and StreamingUpdateSolrServer

2012-04-12 Thread pcrao
Hi Mikhail Khludnev,

Thank you for the reply.
I think the index is getting corrupted because StreamingUpdateSolrServer is
keeping reference
to some index files that are being deleted by EmbeddedSolrServer during
commit/optimize process.
As a result when I Index(Full) using EmbeddedSolrServer and then do
Incremental index using StreamingUpdateSolrServer it fails with a
FileNotFound exception.
 A special note: we don't optimize the index after Incremental
indexing(StreamingUpdateSolrServer) but we do optimize it after the Full
index(EmbeddedSolrServer). Please see the below log and let me know
if you need further information.
---
Mar 29, 2012 12:05:03 AM org.apache.solr.update.processor.LogUpdateProcessor
finish 
INFO: {add=[035405]} 0 28
Mar 29, 2012 12:05:03 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract
params={stream.type=text/html&literal.stream_source_info=/snps/docs/customer/q_and_a/html/035405.html&literal.stream_name=035405.html&wt=javabin&collectionName=docs&version=2}
status=0 QTime=28
Mar 29, 2012 12:05:03 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit(optimize=false,waitSearcher=true,expungeDeletes=false,softCommit=false)
Mar 29, 2012 12:05:03 AM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {commit=} 0 10
Mar 29, 2012 12:05:03 AM org.apache.solr.common.SolrException log
SEVERE: java.io.FileNotFoundException:
/opt/solr/home/data/docs_index/index/_3d.cfs (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:233)
at
org.apache.lucene.store.MMapDirectory.createSlicer(MMapDirectory.java:229)
at
org.apache.lucene.store.CompoundFileDirectory.(CompoundFileDirectory.java:65)
at
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:82)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:112)
at
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:700)
at
org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:263)
at
org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2852)
at
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2843)
at
org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2616)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2731)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2719)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2703)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:325)
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:84)
at
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
at
org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:107)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:52)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1477)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
-

Thanks,
PC Rao.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-and-StreamingUpdateSolrServer-tp3889073p3905071.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Lexical analysis tools for German language data

2012-04-12 Thread Valeriy Felberg
If you want that query "jacke" matches a document containing the word
"windjacke" or "kinderjacke", you could use a custom update processor.
This processor could search the indexed text for words matching the
pattern ".*jacke" and inject the word "jacke" into an additional field
which you can search against. You would need a whole list of possible
suffixes, of course. It would slow down the update process but you
don't need to split words during search.

Best,
Valeriy

On Thu, Apr 12, 2012 at 12:39 PM, Paul Libbrecht  wrote:
>
> Michael,
>
> I'm on this list and the lucene list since several years and have not found 
> this yet.
> It's been one "neglected topics" to my taste.
>
> There is a CompoundAnalyzer but it requires the compounds to be dictionary 
> based, as you indicate.
>
> I am convinced there's a way to build the de-compounding words efficiently 
> from a broad corpus but I have never seen it (and the experts at DFKI I asked 
> for for also told me they didn't know of one).
>
> paul
>
> Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
>
>> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
>> like the code that prepares the data for the index (tokenizer etc) to
>> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
>> would include the "Windjacke" document in its result set.
>>
>> It appears to me that such an analysis requires a dictionary-backed
>> approach, which doesn't have to be perfect at all; a list of the most
>> common 2000 words would probably do the job and fulfil a criterion of
>> reasonable usefulness.
>>
>> Do you know of any implementation techniques or working implementations
>> to do this kind of lexical analysis for German language data? (Or other
>> languages, for that matter?) What are they, where can I find them?
>>
>> I'm sure there is something out (commercial or free) because I've seen
>> lots of engines grokking German and the way it builds words.
>>
>> Failing that, what are the proper terms do refer to these techniques so
>> you can search more successfully?
>>
>> Michael
>


Re: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht
Bernd,

can you please say a little more?
I think this list is ok to contain some description for commercial solutions 
that satisfy a request formulated on list.

Is there any product at BASIS Tech that provides a compound-analyzer with a big 
dictionary of decomposed compounds in German? If yes, for which domain? The 
Google Search result (I wonder if this is politically correct to not have yours 
;-)) shows me that there's an amount of job done in this direction (e.g. Gärten 
to match Garten) but being precise for this question would be more helpful!

paul


Le 12 avr. 2012 à 12:46, Bernd Fehling a écrit :

> 
> You might have a look at:
> http://www.basistech.com/lucene/
> 
> 
> Am 12.04.2012 11:52, schrieb Michael Ludwig:
>> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
>> like the code that prepares the data for the index (tokenizer etc) to
>> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
>> would include the "Windjacke" document in its result set.
>> 
>> It appears to me that such an analysis requires a dictionary-backed
>> approach, which doesn't have to be perfect at all; a list of the most
>> common 2000 words would probably do the job and fulfil a criterion of
>> reasonable usefulness.
>> 
>> Do you know of any implementation techniques or working implementations
>> to do this kind of lexical analysis for German language data? (Or other
>> languages, for that matter?) What are they, where can I find them?
>> 
>> I'm sure there is something out (commercial or free) because I've seen
>> lots of engines grokking German and the way it builds words.
>> 
>> Failing that, what are the proper terms do refer to these techniques so
>> you can search more successfully?
>> 
>> Michael



Solr Scoring

2012-04-12 Thread Kissue Kissue
Hi,

I have a field in my index called itemDesc which i am applying
EnglishMinimalStemFilterFactory to. So if i index a value to this field
containing "Edges", the EnglishMinimalStemFilterFactory applies stemming
and "Edges" becomes "Edge". Now when i search for "Edges", documents with
"Edge" score better than documents with the actual search word - "Edges".
Is there a way i can make documents with the actual search word in this
case "Edges" score better than document with "Edge"?

I am using Solr 3.5. My field definition is shown below:


  

   
 


  
  







  


Thanks.


two structures in solr

2012-04-12 Thread tkoomzaaskz
Hi all,

I'm a solr newbie, so sorry if I do anything wrong ;)

I want to use SOLR not only for fast text search, but mainly to create a
very fast search engine for a high-traffic system (MySQL would not do the
job if the db grows too big).

I need to store *two big structures* in SOLR: projects and contractors.
Contractors will search for available projects and project owners will
search for contractors who would do it for them.

So far, I have found a solr tutorial for newbies
http://www.solrtutorial.com, where I found the schema file which defines the
data structure: http://www.solrtutorial.com/schema-xml.html. But my case is
that *I want to have two structures*. I guess running two parallel solr
instances is not the idea. I took a look at
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup
and I can see that the schema goes like:



  
...
  

  
  
  
  
  ...



But still, this is a single structure. And I need 2.

Great thanks in advance for any help. There are not many tutorials for SOLR
in the web.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/two-structures-in-solr-tp3905143p3905143.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question about solr.WordDelimiterFilterFactory

2012-04-12 Thread Erick Erickson
WordDelimiterFilterFactory will _almost_ do what you want
by setting things like catenateWords=0 and catenateNumbers=1,
_except_ that the punctuation will be removed. So
12.34 -> 1234
ab,cd -> ab cd

is that "close enough"?

Otherwise, writing a simple Filter is probably the way to go.

Best
Erick

On Wed, Apr 11, 2012 at 1:59 PM, Jian Xu  wrote:
> Hello,
>
> I am new to solr/lucene. I am tasked to index a large number of documents. 
> Some of these documents contain decimal points. I am looking for a way to 
> index these documents so that adjacent numeric characters (such as [0-9.,]) 
> are treated as single token. For example,
>
> 12.34 => "12.34"
> 12,345 => "12,345"
>
> However, "," and "." should be treated as usual when around non-digital 
> characters. For example,
>
> ab,cd => "ab" "cd".
>
> It is so that searching for "12.34" will match "12.34" not "12 34". Searching 
> for "ab.cd" should match both "ab.cd" and "ab cd".
>
> After doing some research on solr, It seems that there is a build-in analyzer 
> called solr.WordDelimiterFilter that supports a "types" attribute which map 
> special characters as different delimiters.  However, it isn't exactly what I 
> want. It doesn't provide context check such as "," or "." must surround by 
> digital characters, etc.
>
> Does anyone have any experience configuring solr to meet this requirements?  
> Is writing my own plugin necessary for this simple thing?
>
> Thanks in advance!
>
> -Jian


Dismax request handler differences Between Solr Version 3.5 and 1.4

2012-04-12 Thread mechravi25
Hi,

We are currently using solr (version 1.4.0.2010.01.13.08.09.44). we have a
strange situation in dismax request handler. when we search for a keyword
and append qt=dismax, we are not getting the any results. The solr request
is as follows: 
http://local:8983/solr/core2/select/?q=Bank&version=2.2&start=0&rows=10&indent=on&defType=dismax&debugQuery=on

The Response is as follows : 

  
- 
  Bank 
  Bank 
  +() () 
  +() () 
   
  DisMaxQParser 
   
   
- 
  0.0 
- 
  0.0 
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
  
- 
  0.0 
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
  
  
  
  


We are currently testing the Solr Version 3.5, But the same is working fine
in that version. 

Also the Query alternative params are not working properly in SOlr 1.5 when
compared with version 3.5. The request seems to be the same, but dono where
its making the issue. Please help me out. Thanks i advance.

Regards,
Sivaganesh


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-request-handler-differences-Between-Solr-Version-3-5-and-1-4-tp3905192p3905192.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facets involving multiple fields

2012-04-12 Thread Erick Erickson
facet.query=keywords:computer short_title:computer
seems like what you're asking for.

On Thu, Apr 12, 2012 at 3:19 AM, Marc SCHNEIDER
 wrote:
> Hi,
>
> Thanks for your answer.
> Let's say I have to fields : 'keywords' and 'short_title'.
> For these fields I'd like to make a faceted search : if 'Computer' is
> stored in at least one of these fields for a document I'd like to get
> it added in my results.
> doc1 => keywords : 'Computer' / short_title : 'Computer'
> doc2 => keywords : 'Computer'
> doc3 => short_title : 'Computer'
>
> In this case I'd like to have : Computer (3)
>
> I don't see how to solve this with facet.query.
>
> Thanks,
> Marc.
>
> On Wed, Apr 11, 2012 at 5:13 PM, Erick Erickson  
> wrote:
>> Have you considered facet.query? You can specify an arbitrary query
>> to facet on which might do what you want. Otherwise, I'm not sure what
>> you mean by "faceted search using two fields". How should these fields
>> be combined into a single facet? What that means practically is not at
>> all obvious from your problem statement.
>>
>> Best
>> Erick
>>
>> On Tue, Apr 10, 2012 at 8:55 AM, Marc SCHNEIDER
>>  wrote:
>>> Hi,
>>>
>>> I'd like to make a faceted search using two fields. I want to have a
>>> single result and not a result by field (like when using
>>> facet.field=f1,facet.field=f2).
>>> I don't want to use a copy field either because I want it to be
>>> dynamic at search time.
>>> As far as I know this is not possible for Solr 3.x...
>>> But I saw a new parameter named "group.facet" for Solr4. Could that
>>> solve my problem? If yes could somebody give me an example?
>>>
>>> Thanks,
>>> Marc.


Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling
Paul,

nearly two years ago I requested an evaluation license and tested BASIS Tech
Rosette for Lucene & Solr. Was working excellent but the price much much to 
high.

Yes, they also have compound analysis for several languages including German.
Just configure your pipeline in solr and setup the processing pipeline in
Rosette Language Processing (RLP) and thats it.

Example from my very old schema.xml config:


   
 
 
 
   
   
 
 
 
  


So you just point tokenizer to RLP and have two RLP pipelines configured,
one for indexing (rlp-index-context.xml) and one for querying 
(rlp-query-context.xml).

Example form my rlp-index-context.xml config:


  


  
  
Unicode Converter
Language Identifier
Encoding and Character Normalizer
European Language Analyzer

Stopword Locator
Base Noun Phrase Locator

Exact Match Entity Extractor
Pattern Match Entity Extractor
Entity Redactor
REXML Writer
  


As you can see I used the "European Language Analyzer".

Bernd



Am 12.04.2012 12:58, schrieb Paul Libbrecht:
> Bernd,
> 
> can you please say a little more?
> I think this list is ok to contain some description for commercial solutions 
> that satisfy a request formulated on list.
> 
> Is there any product at BASIS Tech that provides a compound-analyzer with a 
> big dictionary of decomposed compounds in German? 
> If yes, for which domain? 
> The Google Search result (I wonder if this is politically correct to not have 
> yours ;-)) shows me that there's an amount 
> of job done in this direction (e.g. Gärten to match Garten) but being precise 
> for this question would be more helpful!
> 
> paul
> 
> 


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni
You could use SolrCloud (for the automatic scaling) and just mount a
fuse[1] HDFS directory and configure solr to use that directory for its
data. 

[1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS

On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
> Hi,
> 
> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> seconds.
> 
> Needless to mention, the search index needs to scale to 5Billion pages. It
> is also possible that I might need to store multiple indexes -- one for
> crawled content, and one for ancillary data that is also very large. Each
> of these indices would likely require a logically distributed and
> replicated index.
> 
> However, I would like for such a system to be homogenous with the Hadoop
> infrastructure that is already installed on the cluster (for the crawl). In
> other words, I would much prefer if the replication and distribution of the
> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> using another scalability framework (such as SolrCloud). In addition, it
> would be ideal if this environment was flexible enough to be dynamically
> scaled based on the size requirements of the index and the search traffic
> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
> enough to automatically provision additional processing power into the
> cluster without requiring server re-starts).
> 
> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
> mature enough and would be the right architectural choice to go along with
> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
> above.
> 
> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
> estimate my needing with this setup, for regular web-data (HTML text) at
> this scale?
> 
> Any architectural guidance would be greatly appreciated. The more details
> provided, the wider my grin :).
> 
> Many many thanks in advance.
> 
> Thanks,
> Safdar




is there a downside to combining search fields with copyfield?

2012-04-12 Thread geeky2
hello everyone,

can people give me their thoughts on this.

currently, my schema has individual fields to search on.

are there advantages or disadvantages to taking several of the individual
search fields and combining them in to a single search field?

would this affect search times, term tokenization or possibly other things.

example of individual fields

brand
category
partno

example of a single combined search field

part_info (would combine brand, category and partno)

thank you for any feedback
mark





--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-there-a-downside-to-combining-search-fields-with-copyfield-tp3905349p3905349.html
Sent from the Solr - User mailing list archive at Nabble.com.


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Von: Valeriy Felberg

> If you want that query "jacke" matches a document containing the word
> "windjacke" or "kinderjacke", you could use a custom update processor.
> This processor could search the indexed text for words matching the
> pattern ".*jacke" and inject the word "jacke" into an additional field
> which you can search against. You would need a whole list of possible
> suffixes, of course.

Merci, Valeriy - I agree on the feasability of such an approach. The
list would likely have to be composed of the most frequently used terms
for your specific domain.

In our case, it's things people would buy in shops. Reducing overly
complicated and convoluted product descriptions to proper basic terms -
that would do the job. It's like going to a restaurant boasting fancy
and unintelligible names for the dishes you may order when they are
really just ordinary stuff like pork and potatoes.

Thinking some more about it, giving sufficient boost to the attached
category data might also do the job. That would shift the burden of
supplying proper semantics to the guys doing the categorization.

> It would slow down the update process but you don't need to split
> words during search.

> > Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
> >
> >> Given an input of "Windjacke" (probably "wind jacket" in English),
> >> I'd like the code that prepares the data for the index (tokenizer
> >> etc) to understand that this is a "Jacke" ("jacket") so that a
> >> query for "Jacke" would include the "Windjacke" document in its
> >> result set.

A query for "Windjacke" or "Kinderjacke" would probably not have to be
de-specialized to "Jacke" because, well, that's the user input and users
looking for specific things are probably doing so for a reason. If no
matches are found you can still tell them to just broaden their search.

Michael


Re: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma
Hi,

We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a 
from TeX generated FOP XML file for the Dutch language and have seen decent 
results. A bonus was that now some tokens can be stemmed properly because not 
all compounds are listed in the dictionary for the HunspellStemFilter.

It does introduce a recall/precision problem but it at least returns results 
for those many users that do not properly use compounds in their search query.

There seem to be a small issue with the filter where minSubwordSize=N yields 
subwords of size N-1.

Cheers,

On Thursday 12 April 2012 12:39:44 Paul Libbrecht wrote:
> Michael,
> 
> I'm on this list and the lucene list since several years and have not found
> this yet. It's been one "neglected topics" to my taste.
> 
> There is a CompoundAnalyzer but it requires the compounds to be dictionary
> based, as you indicate.
> 
> I am convinced there's a way to build the de-compounding words efficiently
> from a broad corpus but I have never seen it (and the experts at DFKI I
> asked for for also told me they didn't know of one).
> 
> paul
> 
> Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
> > Given an input of "Windjacke" (probably "wind jacket" in English), I'd
> > like the code that prepares the data for the index (tokenizer etc) to
> > understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
> > would include the "Windjacke" document in its result set.
> > 
> > It appears to me that such an analysis requires a dictionary-backed
> > approach, which doesn't have to be perfect at all; a list of the most
> > common 2000 words would probably do the job and fulfil a criterion of
> > reasonable usefulness.
> > 
> > Do you know of any implementation techniques or working implementations
> > to do this kind of lexical analysis for German language data? (Or other
> > languages, for that matter?) What are they, where can I find them?
> > 
> > I'm sure there is something out (commercial or free) because I've seen
> > lots of engines grokking German and the way it builds words.
> > 
> > Failing that, what are the proper terms do refer to these techniques so
> > you can search more successfully?
> > 
> > Michael

-- 
Markus Jelsma - CTO - Openindex


Further questions about behavior in ReversedWildcardFilterFactory

2012-04-12 Thread neosky
I ask the question in 
http://lucene.472066.n3.nabble.com/A-little-onfusion-with-maxPosAsterisk-tt3889226.html
However, when I do some implementation, I get a further questions.
1. Suppose I don't use ReversedWildcardFilterFactory in the index time, it
seems that Solr doesn't allow the leading wildcard search, it will return
the error:
org.apache.lucene.queryParser.ParseException: Cannot parse
'sequence:*A*': '*' or '?' not allowed as first character in
WildcardQuery
But when I use the ReversedWildcardFilterFactory, I can use the *A* in
the query. But as I know, the ReversedWildcardFilterFactory should work in
the index part, should not affect the query behavior. If it is true, how
does this happen?
2.Based on the question above 
suppose I have those tokens in index.
1.AB/MNO/UUFI
2.BC/MNO/IUYT
3.D/MNO/QEWA
4./MNO/KGJGLI
5.QOEOEF/MNO/
suppose I use the lucene, I can set the QueryParser with
AllowLeadingWildcard(true), to search *MNO*
it should return the tokens above(1-5)
But in solr, when I conduct the *MNO* with the ReversedWildcardFilterFactory 
in the index, but use the StandardAnalyzer in the query, I don't know what
happens here.
The leading *MNO should be fast to match the 5 with
ReversedWildcardFilterFactory  
The tailer MNO* should be fast to match 4
But What about *MNO* ?
Thanks! 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Further-questions-about-behavior-in-ReversedWildcardFilterFactory-tp3905416p3905416.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Ali S Kureishy
Thanks Darren.

Actually, I would like the system to be homogenous - i.e., use Hadoop based
tools that already provide all the necessary scaling for the lucene index
(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
its own layer of sharding/replication that is outside Hadoop, I feel that
using SolrCloud would be redundant, and a step in the opposite
direction, which is what I'm trying to avoid in the first place. Or am I
mistaken?

Thanks,
Safdar


On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni  wrote:

> You could use SolrCloud (for the automatic scaling) and just mount a
> fuse[1] HDFS directory and configure solr to use that directory for its
> data.
>
> [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
>
> On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
> > Hi,
> >
> > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
> > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> > crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> > seconds.
> >
> > Needless to mention, the search index needs to scale to 5Billion pages.
> It
> > is also possible that I might need to store multiple indexes -- one for
> > crawled content, and one for ancillary data that is also very large. Each
> > of these indices would likely require a logically distributed and
> > replicated index.
> >
> > However, I would like for such a system to be homogenous with the Hadoop
> > infrastructure that is already installed on the cluster (for the crawl).
> In
> > other words, I would much prefer if the replication and distribution of
> the
> > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> > using another scalability framework (such as SolrCloud). In addition, it
> > would be ideal if this environment was flexible enough to be dynamically
> > scaled based on the size requirements of the index and the search traffic
> > at the time (i.e. if it is deployed on an Amazon cluster, it should be
> easy
> > enough to automatically provision additional processing power into the
> > cluster without requiring server re-starts).
> >
> > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> > be ideal for this scenario. I've heard mention of Solr-on-HBase,
> Solandra,
> > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
> is
> > mature enough and would be the right architectural choice to go along
> with
> > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
> aspects
> > above.
> >
> > Lastly, how much hardware (assuming a medium sized EC2 instance) would
> you
> > estimate my needing with this setup, for regular web-data (HTML text) at
> > this scale?
> >
> > Any architectural guidance would be greatly appreciated. The more details
> > provided, the wider my grin :).
> >
> > Many many thanks in advance.
> >
> > Thanks,
> > Safdar
>
>
>


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Von: Markus Jelsma

> We've done a lot of tests with the HyphenationCompoundWordTokenFilter
> using a from TeX generated FOP XML file for the Dutch language and
> have seen decent results. A bonus was that now some tokens can be
> stemmed properly because not all compounds are listed in the
> dictionary for the HunspellStemFilter.

Thank you for pointing me to these two filter classes.

> It does introduce a recall/precision problem but it at least returns
> results for those many users that do not properly use compounds in
> their search query.

Could you define what the term "recall" should be taken to mean in this
context? I've also encountered it on the BASIStech website. Okay, I
found a definition:

http://en.wikipedia.org/wiki/Precision_and_recall

Dank je wel!

Michael


RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni

Solrcloud or any other tech specific replication isnt going to 'just work' with 
hadoop replication. But with some significant custom coding anything should be 
possible. Interesting idea.

br>--- Original Message ---
On 4/12/2012  09:21 AM Ali S Kureishy wrote:Thanks Darren.

Actually, I would like the system to be homogenous - i.e., use Hadoop based
tools that already provide all the necessary scaling for the lucene index
(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
its own layer of sharding/replication that is outside Hadoop, I feel that
using SolrCloud would be redundant, and a step in the opposite
direction, which is what I'm trying to avoid in the first place. Or am I
mistaken?

Thanks,
Safdar


On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni  wrote:

> You could use SolrCloud (for the automatic scaling) and just mount a
> fuse[1] HDFS directory and configure solr to use that directory for its
> data.
>
> [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
>
> On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
> > Hi,
> >
> > I'm trying to setup a large scale *Crawl + Index + Search 
*infrastructure
> > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web 
pages*,
> > crawled + indexed every *4 weeks, *with a search latency of less than 
0.5
> > seconds.
> >
> > Needless to mention, the search index needs to scale to 5Billion pages.
> It
> > is also possible that I might need to store multiple indexes -- one for
> > crawled content, and one for ancillary data that is also very large. 
Each
> > of these indices would likely require a logically distributed and
> > replicated index.
> >
> > However, I would like for such a system to be homogenous with the Hadoop
> > infrastructure that is already installed on the cluster (for the crawl).
> In
> > other words, I would much prefer if the replication and distribution of
> the
> > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead 
of
> > using another scalability framework (such as SolrCloud). In addition, it
> > would be ideal if this environment was flexible enough to be dynamically
> > scaled based on the size requirements of the index and the search 
traffic
> > at the time (i.e. if it is deployed on an Amazon cluster, it should be
> easy
> > enough to automatically provision additional processing power into the
> > cluster without requiring server re-starts).
> >
> > However, I'm not sure which Solr-based tool in the Hadoop ecosystem 
would
> > be ideal for this scenario. I've heard mention of Solr-on-HBase,
> Solandra,
> > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
> is
> > mature enough and would be the right architectural choice to go along
> with
> > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
> aspects
> > above.
> >
> > Lastly, how much hardware (assuming a medium sized EC2 instance) would
> you
> > estimate my needing with this setup, for regular web-data (HTML text) at
> > this scale?
> >
> > Any architectural guidance would be greatly appreciated. The more 
details
> > provided, the wider my grin :).
> >
> > Many many thanks in advance.
> >
> > Thanks,
> > Safdar
>
>
>



Re: Question about solr.WordDelimiterFilterFactory

2012-04-12 Thread Jian Xu
Erick,

Thank you for your response! 

The problem with this approach is that searching for "12:34" will also match 
"12.34" which is not what I want.



 From: Erick Erickson 
To: solr-user@lucene.apache.org; Jian Xu  
Sent: Thursday, April 12, 2012 8:01 AM
Subject: Re: Question about solr.WordDelimiterFilterFactory
 
WordDelimiterFilterFactory will _almost_ do what you want
by setting things like catenateWords=0 and catenateNumbers=1,
_except_ that the punctuation will be removed. So
12.34 -> 1234
ab,cd -> ab cd

is that "close enough"?

Otherwise, writing a simple Filter is probably the way to go.

Best
Erick

On Wed, Apr 11, 2012 at 1:59 PM, Jian Xu  wrote:
> Hello,
>
> I am new to solr/lucene. I am tasked to index a large number of documents. 
> Some of these documents contain decimal points. I am looking for a way to 
> index these documents so that adjacent numeric characters (such as [0-9.,]) 
> are treated as single token. For example,
>
> 12.34 => "12.34"
> 12,345 => "12,345"
>
> However, "," and "." should be treated as usual when around non-digital 
> characters. For example,
>
> ab,cd => "ab" "cd".
>
> It is so that searching for "12.34" will match "12.34" not "12 34". Searching 
> for "ab.cd" should match both "ab.cd" and "ab cd".
>
> After doing some research on solr, It seems that there is a build-in analyzer 
> called solr.WordDelimiterFilter that supports a "types" attribute which map 
> special characters as different delimiters.  However, it isn't exactly what I 
> want. It doesn't provide context check such as "," or "." must surround by 
> digital characters, etc.
>
> Does anyone have any experience configuring solr to meet this requirements?  
> Is writing my own plugin necessary for this simple thing?
>
> Thanks in advance!
>
> -Jian

RE: SOLR 3.3 DIH and Java 1.6

2012-04-12 Thread randolf.julian
Thanks guys for all the help. We moved to an upgraded O.S. version and the
java script worked.

- Randolf

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-3-3-DIH-and-Java-1-6-tp3841355p3905583.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 3.4 with nTiers >= 2: usage of ids param causes NullPointerException (NPE)

2012-04-12 Thread Dmitry Kan
Can anyone help me out with this? Is this too complicated / unclear? I
could share more detail if needed.

On Wed, Apr 11, 2012 at 3:16 PM, Dmitry Kan  wrote:

> Hello,
>
> Hopefully this question is not too complex to handle, but I'm currently
> stuck with it.
>
> We have a system with nTiers, that is:
>
> Solr front base ---> Solr front --> shards
>
> Inside QueryComponent there is a method createRetrieveDocs(ResponseBuilder
> rb) which collects doc ids of each shard and sends them in different
> queries using the ids parameter:
>
> [code]
> sreq.params.add(ShardParams.IDS, StrUtils.join(ids, ','));
> [/code]
>
> This actually produces NPE (same as in
> https://issues.apache.org/jira/browse/SOLR-1477) in the first tier,
> because Solr front (on the second tier) fails to process such a query. I
> have tried to fix this by using a unique field with a value of ids ORed
> (the following code substitutes the code above):
>
> [code]
>   StringBuffer idsORed = new StringBuffer();
>   for (Iterator iterator = ids.iterator(); iterator.hasNext();
> ) {
> String next = iterator.next();
>
> if (iterator.hasNext()) {
>   idsORed.append(next).append(" OR ");
> } else {
>   idsORed.append(next);
> }
>   }
>
>   sreq.params.add(rb.req.getSchema().getUniqueKeyField().getName(),
> idsORed.toString());
> [/code]
>
> This works perfectly if for rows=n there is n or less hits from a
> distributed query. However, if there are more than 2*n hits, the querying
> fails with an NPE in a completely different component, which is
> HighlightComponent (highlights are requested in the same query with
> hl=true&hl.fragsize=5&hl.requireFieldMatch=true&hl.fl=targetTextField):
>
> SEVERE: java.lang.NullPointerException
> at
> org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:161)
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
> at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
> at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
> at java.lang.Thread.run(Thread.java:619)
>
> It sounds like the ids of documents somehow get shuffled and the
> instruction (only a hypothesis)
>
> [code]
> ShardDoc sdoc = rb.resultIds.get(id);
> [/code]
>
> returns sdoc=null, which causes the next line of code to fail with an NPE:
>
> [code]
> int idx = sdoc.positionInResponse;
> [/code]
>
> Am I missing anything? Can something be done for solving this issue?
>
> Thanks.
>
> --
> Regards,
>
> Dmitry Kan
>



-- 
Regards,

Dmitry Kan


Re: Error

2012-04-12 Thread Erick Erickson
Please review:

http://wiki.apache.org/solr/UsingMailingLists

You haven't said whether, for instance, you're using trunk which
is the only version that supports the "termfreq" function.

Best
Erick

On Thu, Apr 12, 2012 at 4:08 AM, Abhishek tiwari
 wrote:
> http://xyz.com:8080/newschema/mainsearch/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on&sort=termfreq%28cuisine_priorities_list,%27Chinese%27%29%20desc
>
> Error :  HTTP Status 400 - Missing sort order.
> Why i am getting error ?


Import null values from XML file

2012-04-12 Thread randolf.julian
We import an XML file directly to SOLR using a the script called post.sh in
the exampledocs. This is the script:

FILES=$*
URL=http://localhost:8983/solr/update

for f in $FILES; do
  echo Posting file $f to $URL
  curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
  echo
done

#send the commit command to make sure all the changes are flushed and
visible
curl $URL --data-binary '' -H 'Content-type:text/xml;
charset=utf-8'
echo

Our XML file looks something like this:


  
D22BF0B9-EE3A-49AC-A4D6-000B07CDA18A
D22BF0B9-EE3A-49AC-A4D6-000B07CDA18A
1000
CK4475
CK4475
NULL
NULL
840655037330
NULL
EBC CLUTCH KIT
EBC CLUTCH KIT
  


How can I tell solr that the "NULL" value should be treated as null?

Thanks,
Randolf 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Import-null-values-from-XML-file-tp3905600p3905600.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
German noun decompounding is a little more complicated than it might seem.

There can be transformations or inflections, like the "s" in "Weinachtsbaum" 
(Weinachten/Baum).

Internal nouns should be recapitalized, like "Baum" above.

Some compounds probably should not be decompounded, like "Fahrrad" 
(farhren/Rad). With a dictionary-based stemmer, you might decide to avoid 
decompounding for words in the dictionary.

Verbs get more complicated inflections, and might need to be decapitalized, 
like "farhren" above.

Und so weiter.

Note that highlighting gets pretty weird when you are matching only part of a 
word.

Luckily, a lot of compounds are simple, and you could well get a measurable 
improvement with a very simple algorithm. There isn't anything complicated 
about compounds like Orgelmusik or Netzwerkbetreuer.

The Basis Technology linguistic analyzers aren't cheap or small, but they work 
well. 

wunder

On Apr 12, 2012, at 3:58 AM, Paul Libbrecht wrote:

> Bernd,
> 
> can you please say a little more?
> I think this list is ok to contain some description for commercial solutions 
> that satisfy a request formulated on list.
> 
> Is there any product at BASIS Tech that provides a compound-analyzer with a 
> big dictionary of decomposed compounds in German? If yes, for which domain? 
> The Google Search result (I wonder if this is politically correct to not have 
> yours ;-)) shows me that there's an amount of job done in this direction 
> (e.g. Gärten to match Garten) but being precise for this question would be 
> more helpful!
> 
> paul
> 
> 
> Le 12 avr. 2012 à 12:46, Bernd Fehling a écrit :
> 
>> 
>> You might have a look at:
>> http://www.basistech.com/lucene/
>> 
>> 
>> Am 12.04.2012 11:52, schrieb Michael Ludwig:
>>> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
>>> like the code that prepares the data for the index (tokenizer etc) to
>>> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
>>> would include the "Windjacke" document in its result set.
>>> 
>>> It appears to me that such an analysis requires a dictionary-backed
>>> approach, which doesn't have to be perfect at all; a list of the most
>>> common 2000 words would probably do the job and fulfil a criterion of
>>> reasonable usefulness.
>>> 
>>> Do you know of any implementation techniques or working implementations
>>> to do this kind of lexical analysis for German language data? (Or other
>>> languages, for that matter?) What are they, where can I find them?
>>> 
>>> I'm sure there is something out (commercial or free) because I've seen
>>> lots of engines grokking German and the way it builds words.
>>> 
>>> Failing that, what are the proper terms do refer to these techniques so
>>> you can search more successfully?
>>> 
>>> Michael






[Solr 4.0] Is it possible to do soft commit from code and not configuration only

2012-04-12 Thread Lyuba Romanchuk
Hi,



I need to configure the solr so that the opened searcher will see a new
document immidiately after it was adding to the index.

And I don't want to perform commit each time a new document is added.

I tried to configure maxDocs=1 under autoSoftCommit in solrconfig.xml but
it didn't help.

Is there way to perform soft commit from code in Solr 4.0 ?


Thank you in advance.

Best regards,

Lyuba


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Von: Walter Underwood

> German noun decompounding is a little more complicated than it might
> seem.
> 
> There can be transformations or inflections, like the "s" in
> "Weinachtsbaum" (Weinachten/Baum).

I remember from my linguistics studies that the terminus technicus for
these is "Fugenmorphem" (interstitial or joint morpheme). But there's
not many of them - phrased in a regex, it's /e?[ns]/. The Weinachtsbaum
in the example above is from the singular (die Weihnacht), then "s",
then Baum. Still, it's much more complex then, say, English or Italian.

> Internal nouns should be recapitalized, like "Baum" above.

Casing won't matter for indexing, I think. The way I would go about
obtaining stems from compound words is by using a dictionary of stems
and a regex. We'll see how far that'll take us.

> Some compounds probably should not be decompounded, like "Fahrrad"
> (farhren/Rad). With a dictionary-based stemmer, you might decide to
> avoid decompounding for words in the dictionary.

Good point.

> Note that highlighting gets pretty weird when you are matching only
> part of a word.

Guess it'll be a weird when you get it wrong, like "Noten" in
"Notentriegelung".

> Luckily, a lot of compounds are simple, and you could well get a
> measurable improvement with a very simple algorithm. There isn't
> anything complicated about compounds like Orgelmusik or
> Netzwerkbetreuer.

Exactly.

> The Basis Technology linguistic analyzers aren't cheap or small, but
> they work well.

We will consider our needs and options. Thanks for your thoughts.

Michael


Re: [Solr 4.0] Is it possible to do soft commit from code and not configuration only

2012-04-12 Thread Mark Miller

On Apr 12, 2012, at 11:28 AM, Lyuba Romanchuk wrote:

> Hi,
> 
> 
> 
> I need to configure the solr so that the opened searcher will see a new
> document immidiately after it was adding to the index.
> 
> And I don't want to perform commit each time a new document is added.
> 
> I tried to configure maxDocs=1 under autoSoftCommit in solrconfig.xml but
> it didn't help.

Can you elaborate on didn't help? You couldn't find any docs unless you did an 
explicit commit? If that is true and there is no user error, this would be a 
bug.

> 
> Is there way to perform soft commit from code in Solr 4.0 ?

Yes - check out the wiki docs - I can't remember how it is offhand (I think it 
was slightly changed recently).

> 
> 
> Thank you in advance.
> 
> Best regards,
> 
> Lyuba

- Mark Miller
lucidimagination.com













Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Mark Miller
Please see the documentation: 
http://wiki.apache.org/solr/SolrCloud#Required_Config

schema.xml

You must have a _version_ field defined:



On Apr 11, 2012, at 9:10 AM, Benson Margulies wrote:

> I didn't have a _version_ field, since nothing in the schema says that
> it's required!
> 
> On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni  wrote:
>> Hard to say why its not working for you. Start with a fresh Solr and
>> work forward from there or back out your configs and plugins until it
>> works again.
>> 
>> On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
>>> In my cloud configuration, if I push
>>> 
>>> 
>>>   *:*
>>> 
>>> 
>>> followed by:
>>> 
>>> 
>>> 
>>> I get no errors, the log looks happy enough, but the documents remain
>>> in the index, visible to /query.
>>> 
>>> Here's what seems my relevant bit of solrconfig.xml. My URP only
>>> implements processAdd.
>>> 
>>>
>>> 
>>> >> class="com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory"/>
>>> 
>>> 
>>> 
>>>   
>>> 
>>> 
>>>   >>   class="solr.XmlUpdateRequestHandler">
>>> 
>>>   RNI
>>> 
>>> 
>>> 
>> 
>> 

- Mark Miller
lucidimagination.com













Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht

Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit :
>> Some compounds probably should not be decompounded, like "Fahrrad"
>> (farhren/Rad). With a dictionary-based stemmer, you might decide to
>> avoid decompounding for words in the dictionary.
> 
> Good point.

More or less, Fahrrad is generally abbreviated as Rad.
(even though Rad can mean wheel and bike)

>> Note that highlighting gets pretty weird when you are matching only
>> part of a word.
> 
> Guess it'll be a weird when you get it wrong, like "Noten" in
> "Notentriegelung".

This decomposition should not happen because Noten-triegelung does not have a 
correct second term.

>> The Basis Technology linguistic analyzers aren't cheap or small, but
>> they work well.
> 
> We will consider our needs and options. Thanks for your thoughts.

My question remains as to which domain it aims at covering.
We had such need for mathematics texts... I would be pleasantly surprised if, 
for example, Differenzen-quotient  would be decompounded.

paul

Re: Problem to integrate Solr in Jetty (the first example in the Apache Solr 3.1 Cookbook)

2012-04-12 Thread Shawn Heisey

On 4/12/2012 2:21 AM, Bastian Hepp wrote:

When I try to start I get this error message:

C:\\jetty-solr>java -jar start.jar
java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.eclipse.jetty.start.Main.invokeMain(Main.java:457)
 at org.eclipse.jetty.start.Main.start(Main.java:602)
 at org.eclipse.jetty.start.Main.main(Main.java:82)
Caused by: java.lang.ClassNotFoundException: org.mortbay.jetty.Server
 at java.net.URLClassLoader$1.run(Unknown Source)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(Unknown Source)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at org.eclipse.jetty.util.Loader.loadClass(Loader.java:92)
 at
org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.nodeClass(XmlConfiguration.java:349)
 at
org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.configure(XmlConfiguration.java:327)
 at
org.eclipse.jetty.xml.XmlConfiguration.configure(XmlConfiguration.java:291)
 at
org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1203)
 at java.security.AccessController.doPrivileged(Native Method)
 at
org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138)


Bastian,

The jetty.xml included with Solr is littered with org.mortbay class 
references, which are appropriate for Jetty 6.  Jetty 7 and 8 use the 
org.eclipse prefix, and from the very small amount of investigation I 
did a few weeks ago, have also made other changes to the package names, 
so you might not be able to simply replace org.mortbay with org.eclipse.


The absolutely easiest option would be to just use the jetty included 
with Solr, not version 8.  If you want to keep using Jetty 8, you will 
need to find/make a new jetty.xml file.


If I were set on using Jetty 8 and had to make it work, I would check 
out trunk (Lucene/Solr 4.0) from the Apache SVN server, find the example 
jetty.xml there, and use it instead.  It's possible that you may need to 
still make changes, but that is probably the path of least resistance.  
The jetty version has been upgraded in trunk.


Another option would be to download Jetty 6, find its jetty.xml, and 
compare it with the one in Solr, to find out what the Lucene developers 
changed from default.  Then you would have to take the default jetty.xml 
from Jetty 8 and make similar changes to make a new config.


Apparently Jetty 8 no longer supports JSP with the JRE, so you're 
probably going to need the JDK.  The developers have eliminated JSP from 
trunk, so it will still work with the JRE.


Thanks,
Shawn



Re: Large Index and OutOfMemoryError: Map failed

2012-04-12 Thread Mark Miller

On Apr 12, 2012, at 6:07 AM, Michael McCandless wrote:

> Your largest index has 66 segments (690 files) ... biggish but not
> insane.  With 64K maps you should be able to have ~47 searchers open
> on each core.
> 
> Enabling compound file format (not the opposite!) will mean fewer maps
> ... ie should improve this situation.
> 
> I don't understand why Solr defaults to compound file off... that
> seems dangerous.
> 
> Really we need a Solr dev here... to answer "how long is a stale
> searcher kept open".  Is it somehow possible 46 old searchers are
> being left open...?

Probably only if there is a bug. When a new Searcher is opened, any previous 
Searcher is closed as soon as there are no more references to it (eg all in 
flight requests to that Searcher finish).

> 
> I don't see any other reason why you'd run out of maps.  Hmm, unless
> MMapDirectory didn't think it could safely invoke unmap in your JVM.
> Which exact JVM are you using?  If you can print the
> MMapDirectory.UNMAP_SUPPORTED constant, we'd know for sure.
> 
> Yes, switching away from MMapDir will sidestep the "too many maps"
> issue, however, 1) MMapDir has better perf than NIOFSDir, and 2) if
> there really is a leak here (Solr not closing the old searchers or a
> Lucene bug or something...) then you'll eventually run out of file
> descriptors (ie, same  problem, different manifestation).
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 2012/4/11 Gopal Patwa :
>> 
>> I have not change the mergefactor, it was 10. Compound index file is disable
>> in my config but I read from below post, that some one had similar issue and
>> it was resolved by switching from compound index file format to non-compound
>> index file.
>> 
>> and some folks resolved by "changing lucene code to disable MMapDirectory."
>> Is this best practice to do, if so is this can be done in configuration?
>> 
>> http://lucene.472066.n3.nabble.com/MMapDirectory-failed-to-map-a-23G-compound-index-segment-td3317208.html
>> 
>> I have index document of core1 = 5 million, core2=8million and
>> core3=3million and all index are hosted in single Solr instance
>> 
>> I am going to use Solr for our site StubHub.com, see attached "ls -l" list
>> of index files for all core
>> 
>> SolrConfig.xml:
>> 
>> 
>>  
>>  false
>>  10
>>  2147483647
>>  1
>>  4096
>>  10
>>  1000
>>  1
>>  single
>>  
>>  
>>0.0
>>10.0
>>  
>> 
>>  
>>false
>>0
>>  
>>  
>>  
>> 
>> 
>>  
>>  1000
>>   
>> 90
>> false
>>   
>>   
>> ${inventory.solr.softcommit.duration:1000}
>>   
>>  
>>  
>> 
>> 
>> Forwarded conversation
>> Subject: Large Index and OutOfMemoryError: Map failed
>> 
>> 
>> From: Gopal Patwa 
>> Date: Fri, Mar 30, 2012 at 10:26 PM
>> To: solr-user@lucene.apache.org
>> 
>> 
>> I need help!!
>> 
>> 
>> 
>> 
>> 
>> I am using Solr 4.0 nightly build with NRT and I often get this error during
>> auto commit "java.lang.OutOfMemoryError: Map failed". I have search this
>> forum and what I found it is related to OS ulimit setting, please se below
>> my ulimit settings. I am not sure what ulimit setting I should have? and we
>> also get "java.net.SocketException: Too many open files" NOT sure how many
>> open file we need to set?
>> 
>> 
>> I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 - 15GB,
>> with Single shard
>> 
>> 
>> We update the index every 5 seconds, soft commit every 1 second and hard
>> commit every 15 minutes
>> 
>> 
>> Environment: Jboss 4.2, JDK 1.6 , CentOS, JVM Heap Size = 24GB
>> 
>> 
>> ulimit:
>> 
>> core file size  (blocks, -c) 0
>> data seg size   (kbytes, -d) unlimited
>> scheduling priority (-e) 0
>> file size   (blocks, -f) unlimited
>> pending signals (-i) 401408
>> max locked memory   (kbytes, -l) 1024
>> max memory size (kbytes, -m) unlimited
>> open files  (-n) 1024
>> pipe size(512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> real-time priority  (-r) 0
>> stack size  (kbytes, -s) 10240
>> cpu time   (seconds, -t) unlimited
>> max user processes  (-u) 401408
>> virtual memory  (kbytes, -v) unlimited
>> file locks  (-x) unlimited
>> 
>> 
>> 
>> ERROR:
>> 
>> 
>> 
>> 
>> 
>> 2012-03-29 15:14:08,560 [] priority=ERROR app_name= thread=pool-3-thread-1
>> location=CommitTracker line=93 auto commit error...:java.io.IOException: Map
>> failed
>>  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:748)
>>  at
>> org.apache.lucene.store.MMapDirectory$MMapIndexInput.(MMapDirectory.java:293)
>>  at 
>> o

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Benson Margulies
On Thu, Apr 12, 2012 at 11:56 AM, Mark Miller  wrote:
> Please see the documentation: 
> http://wiki.apache.org/solr/SolrCloud#Required_Config

Did I fail to find this in google or did I just goad you into a writing job?

I'm inclined to write a JIRA asking for _version_ to be configurable
just like the uniqueKey in the schema.



>
> schema.xml
>
> You must have a _version_ field defined:
>
> 
>
> On Apr 11, 2012, at 9:10 AM, Benson Margulies wrote:
>
>> I didn't have a _version_ field, since nothing in the schema says that
>> it's required!
>>
>> On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni  wrote:
>>> Hard to say why its not working for you. Start with a fresh Solr and
>>> work forward from there or back out your configs and plugins until it
>>> works again.
>>>
>>> On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
 In my cloud configuration, if I push

 
   *:*
 

 followed by:

 

 I get no errors, the log looks happy enough, but the documents remain
 in the index, visible to /query.

 Here's what seems my relevant bit of solrconfig.xml. My URP only
 implements processAdd.

    
     
     >>> class="com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory"/>
     
     
     
   

     
   >>>                   class="solr.XmlUpdateRequestHandler">
     
       RNI
     
     

>>>
>>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>


Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
On Apr 12, 2012, at 8:46 AM, Michael Ludwig wrote:

> I remember from my linguistics studies that the terminus technicus for
> these is "Fugenmorphem" (interstitial or joint morpheme). 

That is some excellent linguistic jargon. I'll file that with "hapax legomenon".

If you don't highlight, you can get good results with pretty rough analyzers, 
but highlighting exposes those, even when they don't affect relevance. For 
example, you can get good relevance just indexing bigrams in Chinese, but it 
looks awful when you highlight them. As soon as you highlight, you need a 
dictionary-based segmenter.

wunder
--
Walter Underwood
wun...@wunderwood.org





Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma
On Thursday 12 April 2012 18:00:14 Paul Libbrecht wrote:
> Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit :
> >> Some compounds probably should not be decompounded, like "Fahrrad"
> >> (farhren/Rad). With a dictionary-based stemmer, you might decide to
> >> avoid decompounding for words in the dictionary.
> > 
> > Good point.
> 
> More or less, Fahrrad is generally abbreviated as Rad.
> (even though Rad can mean wheel and bike)
> 
> >> Note that highlighting gets pretty weird when you are matching only
> >> part of a word.
> > 
> > Guess it'll be a weird when you get it wrong, like "Noten" in
> > "Notentriegelung".
> 
> This decomposition should not happen because Noten-triegelung does not have
> a correct second term.
> 
> >> The Basis Technology linguistic analyzers aren't cheap or small, but
> >> they work well.
> > 
> > We will consider our needs and options. Thanks for your thoughts.
> 
> My question remains as to which domain it aims at covering.
> We had such need for mathematics texts... I would be pleasantly surprised
> if, for example, Differenzen-quotient  would be decompounded.

The HyphenationCompoundWordTokenFilter can do those things but those words 
must be listed in the dictionary or you'll get strange results. It still 
yields strange results when it emits tokens that are subwords of a subword.

> 
> paul

-- 
Markus Jelsma - CTO - Openindex


Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
On Apr 12, 2012, at 9:00 AM, Paul Libbrecht wrote:

> More or less, Fahrrad is generally abbreviated as Rad.
> (even though Rad can mean wheel and bike)

A synonym could handle this, since "farhren" would not be a good match. It is 
judgement call, but this seems more like an equivalence "Fahrrad = Rad" than 
decompounding.

wunder
--
Walter Underwood
wun...@wunderwood.org





Re: codecs for sorted indexes

2012-04-12 Thread Michael McCandless
Do you mean you are pre-sorting the documents (by what criteria?)
yourself, before adding them to the index?

In which case... you should already be seeing some benefits (smaller
index size) than had you "randomly" added them (ie the vInts should
take fewer bytes), I think.  (Probably the savings would be greater
for better intblock codecs like PForDelta, SimpleX, but I'm not
sure...).

Or do you mean having a codec re-sort the documents (on flush/merge)?
I think this should be possible w/ the Codec API... but nobody has
tried it yet that I know of.

Note that the bulkpostings branch is effectively dead (nobody is
iterating on it, and we've removed the old bulk API from trunk), but
there is likely a GSoC project to add a PForDelta codec to trunk:

https://issues.apache.org/jira/browse/LUCENE-3892

Mike McCandless

http://blog.mikemccandless.com



On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas
 wrote:
> Hello,
>
> We're using a sorted index in order to implement early termination
> efficiently over an index of hundreds of millions of documents. As of now,
> we're using the default codecs coming with Lucene 4, but we believe that
> due to the fact that the docids are sorted, we should be able to do much
> better in terms of storage and achieve much better performance, especially
> decompression performance.
>
> In particular, Robert Muir is commenting on these lines here:
>
> https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411
>
> We're aware that the in the bulkpostings branch there are different codecs
> being implemented and different experiments being done. We don't know
> whether we should implement our own codec (i.e. using some RLE-like
> techniques) or we should use one of the codecs implemented there (PFOR,
> Simple64, ...).
>
> Can you please give us some advice on this?
>
> Thanks
> Carlos
>
> Carlos Gonzalez-Cadenas
> CEO, ExperienceOn - New generation search
> http://www.experienceon.com
>
> Mobile: +34 652 911 201
> Skype: carlosgonzalezcadenas
> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


Re: Error

2012-04-12 Thread Abhishek tiwari
i am using 3.4 solr version... please assist...

On Thu, Apr 12, 2012 at 8:41 PM, Erick Erickson wrote:

> Please review:
>
> http://wiki.apache.org/solr/UsingMailingLists
>
> You haven't said whether, for instance, you're using trunk which
> is the only version that supports the "termfreq" function.
>
> Best
> Erick
>
> On Thu, Apr 12, 2012 at 4:08 AM, Abhishek tiwari
>  wrote:
> >
> http://xyz.com:8080/newschema/mainsearch/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on&sort=termfreq%28cuisine_priorities_list,%27Chinese%27%29%20desc
> >
> > Error :  HTTP Status 400 - Missing sort order.
> > Why i am getting error ?
>


Re: EmbeddedSolrServer and StreamingUpdateSolrServer

2012-04-12 Thread Shawn Heisey

On 4/12/2012 4:52 AM, pcrao wrote:

I think the index is getting corrupted because StreamingUpdateSolrServer is
keeping reference
to some index files that are being deleted by EmbeddedSolrServer during
commit/optimize process.
As a result when I Index(Full) using EmbeddedSolrServer and then do
Incremental index using StreamingUpdateSolrServer it fails with a
FileNotFound exception.
  A special note: we don't optimize the index after Incremental
indexing(StreamingUpdateSolrServer) but we do optimize it after the Full
index(EmbeddedSolrServer). Please see the below log and let me know
if you need further information.


I am a relative newbie to all this, and I've never used 
EmbeddedSolrServer, only CommonsHttpSolrServer and 
StreamingUpdateSolrServer.  I'm not even sure the embedded object is an 
option unless your program is running in the same JVM as Solr.  Mine is 
separate.


If I am right about ESS needing to be in the same JVM as Solr, then that 
means it can do a more direct interaction with Solr and therefore might 
not be coordinated with the HTTP access that SUSS uses.  I have read 
multiple times that the developers don't recommend using ESS.  If you 
are going to use it, you probably have to do everything with it.


SUSS does everything in the background, so you have no guarantees as to 
when it will happen, as well as no ability to check for completion or 
errors.  Because of the lack of error detection, I had to stop using SUSS.


Thanks,
Shawn



Re: [Solr 4.0] Is it possible to do soft commit from code and not configuration only

2012-04-12 Thread Lyuba Romanchuk
Hi Mark,

Thank you for reply.

I tried to normalize data like in relational databases:

   - there are some types of documents where \
  - documents with the same type have the same fields
  - documents with not equal types may have different fields
  - but all documents have "type" field and unique key field "id" .
   - there is "main" type (all records with this type contains "pointers"
   to the corresponding records of other types)

There is the configuration that defines what information should be stored
in each type.
When I get a new data for indexing first of all I check if such document is
already in the index\
using facets by the corresponding fields and query on relevant type.
I add documents to solr index without commit from the code but with
autocommit and autoSoftCommit with maxDocs=1 in the solrconfig.xml.
But here there is a problem that if I add a new record for some type the
searcher doesn't see it immediately.
It causes that I get some equal records with the same type but different
ids (unique key).

If I do commit from code after each document is added it works OK but it's
not a solution.
So I wanted to try to do soft commit after adding documents with "not-main"
type  from code. I searched in wiki documents
but found only commit without parameters and commit with parameters that
don't seem to be what I need.

Best regards,
Lyuba
*
*
*
*

On Thu, Apr 12, 2012 at 6:55 PM, Mark Miller  wrote:

>
> On Apr 12, 2012, at 11:28 AM, Lyuba Romanchuk wrote:
>
> > Hi,
> >
> >
> >
> > I need to configure the solr so that the opened searcher will see a new
> > document immidiately after it was adding to the index.
> >
> > And I don't want to perform commit each time a new document is added.
> >
> > I tried to configure maxDocs=1 under autoSoftCommit in solrconfig.xml but
> > it didn't help.
>
> Can you elaborate on didn't help? You couldn't find any docs unless you
> did an explicit commit? If that is true and there is no user error, this
> would be a bug.
>
> >
> > Is there way to perform soft commit from code in Solr 4.0 ?
>
> Yes - check out the wiki docs - I can't remember how it is offhand (I
> think it was slightly changed recently).
>
> >
> >
> > Thank you in advance.
> >
> > Best regards,
> >
> > Lyuba
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>


Re: Error

2012-04-12 Thread Erick Erickson
The "termfreq" function is only valid for trunk.
You're using 3.4. Since 'termfreq' is not recognized, Solr
gets confused.

Best
Erick

On Thu, Apr 12, 2012 at 10:20 AM, Abhishek tiwari
 wrote:
> i am using 3.4 solr version... please assist...
>
> On Thu, Apr 12, 2012 at 8:41 PM, Erick Erickson 
> wrote:
>
>> Please review:
>>
>> http://wiki.apache.org/solr/UsingMailingLists
>>
>> You haven't said whether, for instance, you're using trunk which
>> is the only version that supports the "termfreq" function.
>>
>> Best
>> Erick
>>
>> On Thu, Apr 12, 2012 at 4:08 AM, Abhishek tiwari
>>  wrote:
>> >
>> http://xyz.com:8080/newschema/mainsearch/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on&sort=termfreq%28cuisine_priorities_list,%27Chinese%27%29%20desc
>> >
>> > Error :  HTTP Status 400 - Missing sort order.
>> > Why i am getting error ?
>>


Re: is there a downside to combining search fields with copyfield?

2012-04-12 Thread Shawn Heisey

On 4/12/2012 7:27 AM, geeky2 wrote:

currently, my schema has individual fields to search on.

are there advantages or disadvantages to taking several of the individual
search fields and combining them in to a single search field?

would this affect search times, term tokenization or possibly other things.

example of individual fields

brand
category
partno

example of a single combined search field

part_info (would combine brand, category and partno)


You end up with one multivalued field, which means that you can only 
have one analyzer chain.  With separate fields, each field can be 
analyzed differently.  Also, if you are indexing and/or storing the 
individual fields, you may have data duplication in your index, making 
it larger and increasing your disk/RAM requirements.  That field will 
have a higher termcount than the individual fields, which means that 
searches against it will naturally be just a little bit slower.  Your 
application will not have to do as much work to construct a query, though.


If you are already planning to use dismax/edismax, then you don't need 
the overhead of a copyField.  You can simply provide access to (e)dismax 
search with the qf (and possibly pf) parameters predefined, or your 
application can provide these parameters.


http://wiki.apache.org/solr/ExtendedDisMax

Thanks,
Shawn



Re: Problem to integrate Solr in Jetty (the first example in the Apache Solr 3.1 Cookbook)

2012-04-12 Thread Bastian Hepp
Thanks Shawn,

I think I'll stay with the build in. I had problems with Solr Cell, but I
could fix it.

Greetings,
Bastian

Am 12. April 2012 18:02 schrieb Shawn Heisey :
>
> Bastian,
>
> The jetty.xml included with Solr is littered with org.mortbay class
> references, which are appropriate for Jetty 6.  Jetty 7 and 8 use the
> org.eclipse prefix, and from the very small amount of investigation I did a
> few weeks ago, have also made other changes to the package names, so you
> might not be able to simply replace org.mortbay with org.eclipse.
>
> The absolutely easiest option would be to just use the jetty included with
> Solr, not version 8.  If you want to keep using Jetty 8, you will need to
> find/make a new jetty.xml file.
>
> If I were set on using Jetty 8 and had to make it work, I would check out
> trunk (Lucene/Solr 4.0) from the Apache SVN server, find the example
> jetty.xml there, and use it instead.  It's possible that you may need to
> still make changes, but that is probably the path of least resistance.  The
> jetty version has been upgraded in trunk.
>
> Another option would be to download Jetty 6, find its jetty.xml, and
> compare it with the one in Solr, to find out what the Lucene developers
> changed from default.  Then you would have to take the default jetty.xml
> from Jetty 8 and make similar changes to make a new config.
>
> Apparently Jetty 8 no longer supports JSP with the JRE, so you're probably
> going to need the JDK.  The developers have eliminated JSP from trunk, so
> it will still work with the JRE.
>
> Thanks,
> Shawn
>
>


Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Mark Miller
google must not have found it - i put that in a month or so ago I believe -
at least weeks. As you can see, there is still a bit to fill in, but it
covers the high level. I'd like to add example snippets for the rest soon.

On Thu, Apr 12, 2012 at 12:04 PM, Benson Margulies wrote:

> On Thu, Apr 12, 2012 at 11:56 AM, Mark Miller 
> wrote:
> > Please see the documentation:
> http://wiki.apache.org/solr/SolrCloud#Required_Config
>
> Did I fail to find this in google or did I just goad you into a writing
> job?
>
> I'm inclined to write a JIRA asking for _version_ to be configurable
> just like the uniqueKey in the schema.
>
>
>
> >
> > schema.xml
> >
> > You must have a _version_ field defined:
> >
> > 
> >
> > On Apr 11, 2012, at 9:10 AM, Benson Margulies wrote:
> >
> >> I didn't have a _version_ field, since nothing in the schema says that
> >> it's required!
> >>
> >> On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni 
> wrote:
> >>> Hard to say why its not working for you. Start with a fresh Solr and
> >>> work forward from there or back out your configs and plugins until it
> >>> works again.
> >>>
> >>> On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
>  In my cloud configuration, if I push
> 
>  
>    *:*
>  
> 
>  followed by:
> 
>  
> 
>  I get no errors, the log looks happy enough, but the documents remain
>  in the index, visible to /query.
> 
>  Here's what seems my relevant bit of solrconfig.xml. My URP only
>  implements processAdd.
> 
> 
>  
>   class="com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory"/>
>  
>  
>  
>    
> 
>  
>    class="solr.XmlUpdateRequestHandler">
>  
>    RNI
>  
>  
> 
> >>>
> >>>
> >
> > - Mark Miller
> > lucidimagination.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>



-- 
- Mark

http://www.lucidimagination.com


Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Chris Hostetter

: Please see the documentation: 
http://wiki.apache.org/solr/SolrCloud#Required_Config
: 
: schema.xml
: 
: You must have a _version_ field defined:
: 
: 

Seems like this is the kind of thing that should make Solr fail hard and 
fast on SolrCore init if it sees you are running in cloud mode and yet it 
doesn't find this -- similar to how some other features fail hard and fast 
if you don't have uniqueKey.


-Hoss


Re: solr 3.4 with nTiers >= 2: usage of ids param causes NullPointerException (NPE)

2012-04-12 Thread Mikhail Khludnev
Dmitry,

The last NPE in HighlightingComponent is just a sad coding issue.
few rows later we can see that developer expected to have some docs not
found
// remove nulls in case not all docs were able to be retrieved
  rb.rsp.add("highlighting", SolrPluginUtils.removeNulls(new
SimpleOrderedMap(arr)));
But as you already know he forgot to check if(sdoc!=null){.
Is there anything that stopping you from contributing the patch, beside of
the lack of time, of course?

about the core issue I can't get into it and, particularly, how the using
disjunction query in place of IDS can help you. Could you please provide
more detailed info like stacktraces, etc. Btw, have you checked trunk for
your case?

On Thu, Apr 12, 2012 at 7:08 PM, Dmitry Kan  wrote:

> Can anyone help me out with this? Is this too complicated / unclear? I
> could share more detail if needed.
>
> On Wed, Apr 11, 2012 at 3:16 PM, Dmitry Kan  wrote:
>
> > Hello,
> >
> > Hopefully this question is not too complex to handle, but I'm currently
> > stuck with it.
> >
> > We have a system with nTiers, that is:
> >
> > Solr front base ---> Solr front --> shards
> >
> > Inside QueryComponent there is a method
> createRetrieveDocs(ResponseBuilder
> > rb) which collects doc ids of each shard and sends them in different
> > queries using the ids parameter:
> >
> > [code]
> > sreq.params.add(ShardParams.IDS, StrUtils.join(ids, ','));
> > [/code]
> >
> > This actually produces NPE (same as in
> > https://issues.apache.org/jira/browse/SOLR-1477) in the first tier,
> > because Solr front (on the second tier) fails to process such a query. I
> > have tried to fix this by using a unique field with a value of ids ORed
> > (the following code substitutes the code above):
> >
> > [code]
> >   StringBuffer idsORed = new StringBuffer();
> >   for (Iterator iterator = ids.iterator();
> iterator.hasNext();
> > ) {
> > String next = iterator.next();
> >
> > if (iterator.hasNext()) {
> >   idsORed.append(next).append(" OR ");
> > } else {
> >   idsORed.append(next);
> > }
> >   }
> >
> >   sreq.params.add(rb.req.getSchema().getUniqueKeyField().getName(),
> > idsORed.toString());
> > [/code]
> >
> > This works perfectly if for rows=n there is n or less hits from a
> > distributed query. However, if there are more than 2*n hits, the querying
> > fails with an NPE in a completely different component, which is
> > HighlightComponent (highlights are requested in the same query with
> >
> hl=true&hl.fragsize=5&hl.requireFieldMatch=true&hl.fl=targetTextField):
> >
> > SEVERE: java.lang.NullPointerException
> > at
> >
> org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:161)
> > at
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)
> > at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> > at
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> > at
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> > at
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> > at
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> > at
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> > at
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> > at
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> > at
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> > at
> >
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
> > at
> >
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
> > at
> > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
> > at java.lang.Thread.run(Thread.java:619)
> >
> > It sounds like the ids of documents somehow get shuffled and the
> > instruction (only a hypothesis)
> >
> > [code]
> > ShardDoc sdoc = rb.resultIds.get(id);
> > [/code]
> >
> > returns sdoc=null, which causes the next line of code to fail with an
> NPE:
> >
> > [code]
> > int idx = sdoc.positionInResponse;
> > [/code]
> >
> > Am I missing anything? Can something be done for solving this issue?
> >
> > Thanks.
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>



-- 
Sincerely yours
Mikhail Khludn

Re: solr 3.4 with nTiers >= 2: usage of ids param causes NullPointerException (NPE)

2012-04-12 Thread Yonik Seeley
On Wed, Apr 11, 2012 at 8:16 AM, Dmitry Kan  wrote:
> We have a system with nTiers, that is:
>
> Solr front base ---> Solr front --> shards

Although the architecture had this in mind (multi-tier), all of the
pieces are not yet in place to allow it.
The errors you see are a direct result of that.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10


RE: solr 3.5 taking long to index

2012-04-12 Thread Rohit
Thanks for pointing these out, but I still have one concern, why is the
Virtual Memory running in 300g+?

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg


-Original Message-
From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] 
Sent: 12 April 2012 11:58
To: solr-user@lucene.apache.org
Subject: Re: solr 3.5 taking long to index


There were some changes in solrconfig.xml between solr3.1 and solr3.5.
Always read CHANGES.txt when switching to a new version.
Also helpful is comparing both versions of solrconfig.xml from the examples.

Are you sure you need a MaxPermSize of 5g?
Use jvisualvm to see what you really need.
This is also for all other JAVA_OPTS.



Am 11.04.2012 19:42, schrieb Rohit:
> We recently migrated from solr3.1 to solr3.5,  we have one master and 
> one slave configured. The master has two cores,
> 
>  
> 
> 1) Core1 - 44555972 documents
> 
> 2) Core2 - 29419244 documents
> 
>  
> 
> We commit every 5000 documents, but lately the commit is taking very 
> long 15 minutes plus in some cases. What could have caused this, I 
> have checked the logs and the only warning i can see is,
> 
>  
> 
> "WARNING: Use of deprecated update request parameter update.processor 
> detected. Please use the new parameter update.chain instead, as 
> support for update.processor will be removed in a later version."
> 
>  
> 
> Memory details:
> 
>  
> 
> export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
> 
>  
> 
> Solr Config:
> 
>  
> 
> false
> 
> 10
> 
> 32
> 
> 
> 
>   1
> 
>   1000
> 
>   1
> 
>  
> 
> What could be causing this, as everything was running fine a few days
back?
> 
>  
> 
>  
> 
> Regards,
> 
> Rohit
> 
> Mobile: +91-9901768202
> 
> About Me:   http://about.me/rohitg
> 
>  
> 
> 




RE: Solr 3.5 takes very long to commit gradually

2012-04-12 Thread Rohit
Thanks for pointing these out, but I still have one concern, why is the
Virtual Memory running in 300g+?

Regards,
Rohit


-Original Message-
From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] 
Sent: 12 April 2012 13:43
To: solr-user@lucene.apache.org
Subject: Re: Solr 3.5 takes very long to commit gradually

thanks Rohit.. for the information.
On Apr 12, 2012, at 4:08 AM, Rohit wrote:

> Hi Tirthankar,
> 
> The average size of documents would be a few Kb's this is mostly 
> tweets which are being saved. The two cores are storing different kind 
> of data and nothing else.
> 
> Regards,
> Rohit
> Mobile: +91-9901768202
> About Me: http://about.me/rohitg
> 
> -Original Message-
> From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com]
> Sent: 12 April 2012 13:14
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 3.5 takes very long to commit gradually
> 
> Hi Rohit,
> What would be the average size of your documents and also can you 
> please share your idea of having 2 cores in the master. I just wanted 
> to know the reasoning behind the design.
> 
> Thanks in advance
> 
> Tirthankar
> On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote:
> 
>> What operating system?
>> Are you using spellchecker with buildOnCommit?
>> Anything special in your Update Chain?
>> 
>> --
>> Jan Høydahl, search solution architect Cominvent AS - 
>> www.cominvent.com Solr Training - www.solrtraining.com
>> 
>> On 12. apr. 2012, at 06:45, Rohit wrote:
>> 
>>> We recently migrated from solr3.1 to solr3.5, we have one master and 
>>> one slave configured. The master has two cores,
>>> 
>>> 1) Core1 - 44555972 documents
>>> 
>>> 2) Core2 - 29419244 documents
>>> 
>>> We commit every 5000 documents, but lately the commit time gradually 
>>> increase and solr is taking as very long 15 minutes plus in some 
>>> cases. What could have caused this, I have checked the logs and the 
>>> only warning i can see is,
>>> 
>>> "WARNING: Use of deprecated update request parameter 
>>> update.processor detected. Please use the new parameter update.chain 
>>> instead, as support for update.processor will be removed in a later
version."
>>> 
>>> Memory details:
>>> 
>>> export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
>>> 
>>> Solr Config:
>>> 
>>> false
>>> 
>>> 10
>>> 
>>> 32
>>> 
>>> 
>>> 
>>> 1
>>> 
>>> 1000
>>> 
>>> 1
>>> 
>>> Also noticed, that top command show almost 350GB of Virtual memory
usage.
>>> 
>>> What could be causing this, as everything was running fine a few 
>>> days
> back?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Regards,
>>> 
>>> Rohit
>>> 
>>> Mobile: +91-9901768202
>>> 
>>> About Me:   http://about.me/rohitg
>>> 
>>> 
>>> 
>> 
> 
> **Legal Disclaimer***
> "This communication may contain confidential and privileged material 
> for the sole use of the intended recipient. Any unauthorized review, 
> use or distribution by others is strictly prohibited. If you have 
> received the message in error, please advise the sender by reply email 
> and delete the message. Thank you."
> *
> 
> 

**Legal Disclaimer***
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or
distribution by others is strictly prohibited. If you have received the
message in error, please advise the sender by reply email and delete the
message. Thank you."
*




Re: term frequency outweighs exact phrase match

2012-04-12 Thread alxsss
In that case documents 1 and 2 will not be in the results. We need them also be 
shown in the results but be ranked after those docs with exact match.
I think omitting term frequency in calculating ranking in phrase queries will 
solve this issue, but I do not see that such a parameter in configs.
I see omitTermFreqAndPositions="true" but not sure if it is the setting I need, 
because its description is too vague.

Thanks.
Alex.


 

 

 

-Original Message-
From: Erick Erickson 
To: solr-user 
Sent: Wed, Apr 11, 2012 8:23 am
Subject: Re: term frequency outweighs exact phrase match


Consider boosting on phrase with a SHOULD clause, something
like field:"apache solr"^2..

Best
Erick


On Tue, Apr 10, 2012 at 12:46 PM,   wrote:
> Hello,
>
> I use solr 3.5 with edismax. I have the following issue with phrase search. 
For example if I have three documents with content like
>
> 1.apache apache
> 2. solr solr
> 3.apache solr
>
> then search for apache solr displays documents in the order 1,.2,3 instead of 
3, 2, 1 because term frequency in the first and second documents is higher than 
in the third document. We want results be displayed in the order as  3,2,1 
since 
the third document has exact match.
>
> My request handler is as follows.
>
> 
> 
> edismax
> explicit
> 0.01
> host^30  content^0.5 title^1.2
> host^30  content^20 title^22 
> url,id, site ,title
> 2<-1 5<-2 6<90%
> 1
> true
> *:*
> content
> 0
> 165
> title
> 0
> url
> regex
> true
> true
> 5
> true
> site
> true
> 
> 
>  spellcheck
> 
> 
>
> Any ideas how to fix this issue?
>
> Thanks in advance.
> Alex.

 


Re: solr 3.4 with nTiers >= 2: usage of ids param causes NullPointerException (NPE)

2012-04-12 Thread Dmitry Kan
Mikhail,

Thanks for sharing your thoughts. Yes I have tried checking for NULL and
the entire chain of queries between tiers seems to work. But I suspect,
that some docs will be missing. In principle, unless there is an
OutOfMemory or a shard down, the doc ids should be retrieving valid
documents. So this is just a design, as Yonik pointed out.

I would be willing to contribute a patch, it is just an issue of
understanding what exactly should be fixed in the architecture, and I
suspect it isn't a small change..

Dmitry

On Thu, Apr 12, 2012 at 9:22 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Dmitry,
>
> The last NPE in HighlightingComponent is just a sad coding issue.
> few rows later we can see that developer expected to have some docs not
> found
> // remove nulls in case not all docs were able to be retrieved
>  rb.rsp.add("highlighting", SolrPluginUtils.removeNulls(new
> SimpleOrderedMap(arr)));
> But as you already know he forgot to check if(sdoc!=null){.
> Is there anything that stopping you from contributing the patch, beside of
> the lack of time, of course?
>
> about the core issue I can't get into it and, particularly, how the using
> disjunction query in place of IDS can help you. Could you please provide
> more detailed info like stacktraces, etc. Btw, have you checked trunk for
> your case?
>
> On Thu, Apr 12, 2012 at 7:08 PM, Dmitry Kan  wrote:
>
> > Can anyone help me out with this? Is this too complicated / unclear? I
> > could share more detail if needed.
> >
> > On Wed, Apr 11, 2012 at 3:16 PM, Dmitry Kan 
> wrote:
> >
> > > Hello,
> > >
> > > Hopefully this question is not too complex to handle, but I'm currently
> > > stuck with it.
> > >
> > > We have a system with nTiers, that is:
> > >
> > > Solr front base ---> Solr front --> shards
> > >
> > > Inside QueryComponent there is a method
> > createRetrieveDocs(ResponseBuilder
> > > rb) which collects doc ids of each shard and sends them in different
> > > queries using the ids parameter:
> > >
> > > [code]
> > > sreq.params.add(ShardParams.IDS, StrUtils.join(ids, ','));
> > > [/code]
> > >
> > > This actually produces NPE (same as in
> > > https://issues.apache.org/jira/browse/SOLR-1477) in the first tier,
> > > because Solr front (on the second tier) fails to process such a query.
> I
> > > have tried to fix this by using a unique field with a value of ids ORed
> > > (the following code substitutes the code above):
> > >
> > > [code]
> > >   StringBuffer idsORed = new StringBuffer();
> > >   for (Iterator iterator = ids.iterator();
> > iterator.hasNext();
> > > ) {
> > > String next = iterator.next();
> > >
> > > if (iterator.hasNext()) {
> > >   idsORed.append(next).append(" OR ");
> > > } else {
> > >   idsORed.append(next);
> > > }
> > >   }
> > >
> > >   sreq.params.add(rb.req.getSchema().getUniqueKeyField().getName(),
> > > idsORed.toString());
> > > [/code]
> > >
> > > This works perfectly if for rows=n there is n or less hits from a
> > > distributed query. However, if there are more than 2*n hits, the
> querying
> > > fails with an NPE in a completely different component, which is
> > > HighlightComponent (highlights are requested in the same query with
> > >
> >
> hl=true&hl.fragsize=5&hl.requireFieldMatch=true&hl.fl=targetTextField):
> > >
> > > SEVERE: java.lang.NullPointerException
> > > at
> > >
> >
> org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:161)
> > > at
> > >
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)
> > > at
> > >
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
> > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> > > at
> > >
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> > > at
> > >
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> > > at
> > >
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> > > at
> > >
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> > > at
> > >
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> > > at
> > >
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> > > at
> > >
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> > > at
> > >
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> > > at
> > >
> >
> org

Wildcard searching

2012-04-12 Thread Kissue Kissue
Hi,

I am using the edismax query handler with solr 3.5. From the Solr admin
interface when i do a wildcard search with the string: edge*, all documents
are returned with exactly the same score. When i do the same search from my
application using SolrJ to the same solr instance, only a few documents
have the same maximum score and all the rest have the minimum score. I was
expecting all to have the same score just like in the Solr Admin.

Any pointers why this is happening?

Thanks.


Re: solr 3.4 with nTiers >= 2: usage of ids param causes NullPointerException (NPE)

2012-04-12 Thread Dmitry Kan
Thanks Yonik,

This is what I expected. How big the change would be, if I'd start just
with Query and Highlight components? Did the change to QueryComponent I
made make any sense to you? It would of course mean a custom solution,
which I'm willing to contribute as a patch (in case anyone interested). To
make it part of a releasable trunk, one would most probably need to provide
some way to configure "1st tier level".

Thanks,

Dmitry

On Thu, Apr 12, 2012 at 9:34 PM, Yonik Seeley wrote:

> On Wed, Apr 11, 2012 at 8:16 AM, Dmitry Kan  wrote:
> > We have a system with nTiers, that is:
> >
> > Solr front base ---> Solr front --> shards
>
> Although the architecture had this in mind (multi-tier), all of the
> pieces are not yet in place to allow it.
> The errors you see are a direct result of that.
>
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10
>



-- 
Regards,

Dmitry Kan


Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Mark Miller
I think someone already made a JIRA issue like that. I think Yonik might
have had an opinion about it that I cannot remember right now.

On Thu, Apr 12, 2012 at 2:21 PM, Chris Hostetter
wrote:

>
> : Please see the documentation:
> http://wiki.apache.org/solr/SolrCloud#Required_Config
> :
> : schema.xml
> :
> : You must have a _version_ field defined:
> :
> : 
>
> Seems like this is the kind of thing that should make Solr fail hard and
> fast on SolrCore init if it sees you are running in cloud mode and yet it
> doesn't find this -- similar to how some other features fail hard and fast
> if you don't have uniqueKey.
>
>
> -Hoss
>



-- 
- Mark

http://www.lucidimagination.com


Re: Wildcard searching

2012-04-12 Thread Kissue Kissue
Correction, this difference betweeen Solr admin scores and SolrJ scores
happens with leading wildcard queries e.g. *edge


On Thu, Apr 12, 2012 at 8:13 PM, Kissue Kissue  wrote:

> Hi,
>
> I am using the edismax query handler with solr 3.5. From the Solr admin
> interface when i do a wildcard search with the string: edge*, all documents
> are returned with exactly the same score. When i do the same search from my
> application using SolrJ to the same solr instance, only a few documents
> have the same maximum score and all the rest have the minimum score. I was
> expecting all to have the same score just like in the Solr Admin.
>
> Any pointers why this is happening?
>
> Thanks.
>


Re: is there a downside to combining search fields with copyfield?

2012-04-12 Thread geeky2

>>
You end up with one multivalued field, which means that you can only
have one analyzer chain.
<<

actually two of the three fields being considered for combination in to a
single field ARE multivalued fields.

would this be an issue?

>>
  With separate fields, each field can be
analyzed differently.  Also, if you are indexing and/or storing the
individual fields, you may have data duplication in your index, making
it larger and increasing your disk/RAM requirements.
<<

this makes sense


>>
  That field will
have a higher termcount than the individual fields, which means that
searches against it will naturally be just a little bit slower.
<<

ok

>>
  Your
application will not have to do as much work to construct a query, though.
<<

actually this is the primary reason this came up.  

>>
If you are already planning to use dismax/edismax, then you don't need
the overhead of a copyField.  You can simply provide access to (e)dismax
search with the qf (and possibly pf) parameters predefined, or your
application can provide these parameters.

http://wiki.apache.org/solr/ExtendedDisMax
<<

can you elaborate on this and how EDisMax would preclude the need for
copyfield?

i am using extended dismax now in my response handlers.

here is an example of one of my requestHandlers

  

  edismax
  all
  5
  itemNo^1.0
  *:*


  itemType:1
  rankNo asc, score desc


  false

  






Thanks,
Shawn 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-there-a-downside-to-combining-search-fields-with-copyfield-tp3905349p3906265.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Suggester not working for digit starting terms

2012-04-12 Thread jmlucjav
Well now I am really lost...

1. yes I want to suggest whole sentences too, I want the tokenizer to be
taken into account, and apparently it is working for me in 3.5.0?? I get
suggestions that are like "foo bar abc".  Maybe what you mention is only for
file based dictionaries? I am using the field itself.

2. but for the digit issue, in that case nothing is suggested, not even the
term 500 that is there cause I can find it with this query
http://localhost:8983/solr/select/?q={!prefix f=a_suggest}500 

I tried to set threshold to 0 in case the term was being removed, and is not
that.

Moving to 3.6.0 is not a problem (I had already downloaded the rc actually)
but I still see weird things here.

xab

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggester-not-working-for-digit-starting-terms-tp3893433p3906303.html
Sent from the Solr - User mailing list archive at Nabble.com.


searching across multiple fields using edismax - am i setting this up right?

2012-04-12 Thread geeky2
hello all,

i just want to check to make sure i have this right.

i was reading on this page: http://wiki.apache.org/solr/ExtendedDisMax,
thanks to shawn for educating me.

*i want the user to be able to fire a requestHandler but search across
multiple fields (itemNo, productType and brand) WITHOUT them having to
specify in the query url what fields they want / need to search on*

this is what i have in my request handler


  

  edismax
  all
  5
  *itemNo^1.0 productType^.8 brand^.5*
  *:*


  rankNo asc, score desc


  false

  

this would be an example of a single term search going against all three of
the fields

http://bogus:bogus/somecore/select?qt=partItemNoSearch&q=*dishwasher*&debugQuery=on&rows=100

this would be an example of a multiple term search across all three of the
fields

http://bogus:bogus/somecore/select?qt=partItemNoSearch&q=*dishwasher
123-xyz*&debugQuery=on&rows=100


do i understand this correctly?

thank you,
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/searching-across-multiple-fields-using-edismax-am-i-setting-this-up-right-tp3906334p3906334.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Responding to Requests with Chunks/Streaming

2012-04-12 Thread Mikhail Khludnev
Hello Developers,

I just want to ask don't you think that response streaming can be useful
for things like OLAP, e.g. is you have sharded index presorted and
pre-joined by BJQ way you can calculate counts in many cube cells in
parallel?
Essential distributed test for response streaming just passed.
https://github.com/m-khl/solr-patches/blob/ec4db7c0422a5515392a7019c5bd23ad3f546e4b/solr/core/src/test/org/apache/solr/response/RespStreamDistributedTest.java

branch is https://github.com/m-khl/solr-patches/tree/streaming

Regards

On Mon, Apr 2, 2012 at 10:55 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

>
> Hello,
>
> Small update - reading streamed response is done via callback. No
> SolrDocumentList in memory.
> https://github.com/m-khl/solr-patches/tree/streaming
> here is the test
> https://github.com/m-khl/solr-patches/blob/d028d4fabe0c20cb23f16098637e2961e9e2366e/solr/core/src/test/org/apache/solr/response/ResponseStreamingTest.java#L138
>
> no progress in distributed search via streaming yet.
>
> Pls let me know if you don't want to have updates from my playground.
>
> Regards
>
>
> On Thu, Mar 29, 2012 at 1:02 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
>> @All
>> Why nobody desires such a pretty cool feature?
>>
>> Nicholas,
>> I have a tiny progress: I'm able to stream in javabin codec format while
>> searching, It implies sorting by _docid_
>>
>> here is the diff
>>
>> https://github.com/m-khl/solr-patches/commit/2f9ff068c379b3008bb983d0df69dff714ddde95
>>
>> The current issue is that reading response by SolrJ is done as whole.
>> Reading by callback is supported by EmbeddedServer only. Anyway it should
>> not a big deal. ResponseStreamingTest.java somehow works.
>> I'm stuck on introducing response streaming in distributes search, it's
>> actually more challenging  - RespStreamDistributedTest fails
>>
>> Regards
>>
>>
>> On Fri, Mar 16, 2012 at 3:51 PM, Nicholas Ball > > wrote:
>>
>>>
>>> Mikhail & Ludovic,
>>>
>>> Thanks for both your replies, very helpful indeed!
>>>
>>> Ludovic, I was actually looking into just that and did some tests with
>>> SolrJ, it does work well but needs some changes on the Solr server if we
>>> want to send out individual documents a various times. This could be done
>>> with a write() and flush() to the FastOutputStream (daos) in
>>> JavBinCodec. I
>>> therefore think that a combination of this and Mikhail's solution would
>>> work best!
>>>
>>> Mikhail, you mention that your solution doesn't currently work and not
>>> sure why this is the case, but could it be that you haven't flushed the
>>> data (os.flush()) you've written in the collect method of
>>> DocSetStreamer? I
>>> think placing the output stream into the SolrQueryRequest is the way to
>>> go,
>>> so that we can access it and write to it how we intend. However, I think
>>> using the JavaBinCodec would be ideal so that we can work with SolrJ
>>> directly, and not mess around with the encoding of the docs/data etc...
>>>
>>> At the moment the entry point to JavaBinCodec is through the
>>> BinaryResponseWriter which calls the highest level marshal() method which
>>> decodes and sends out the entire SolrQueryResponse (line 49 @
>>> BinaryResponseWriter). What would be ideal is to be able to break up the
>>> response and call the JavaBinCodec for pieces of it with a flush after
>>> each
>>> call. Did a few tests with a simple Thread.sleep and a flush to see if
>>> this
>>> would actually work and looks like it's working out perfectly. Just
>>> trying
>>> to figure out the best way to actually do it now :) any ideas?
>>>
>>> An another note, for a solution to work with the chunked transfer
>>> encoding
>>> (and therefore web browsers), a lot more development is going to be
>>> needed.
>>> Not sure if it's worth trying yet but might look into it later down the
>>> line.
>>>
>>> Nick
>>>
>>> On Fri, 16 Mar 2012 07:29:20 +0300, Mikhail Khludnev
>>>  wrote:
>>> > Ludovic,
>>> >
>>> > I looked through. First of all, it seems to me you don't amend regular
>>> > "servlet" solr server, but the only embedded one.
>>> > Anyway, the difference is that you stream DocList via callback, but it
>>> > means that you've instantiated it in memory and keep it there until it
>>> will
>>> > be completely consumed. Think about a billion numfound. Core idea of my
>>> > approach is keep almost zero memory for response.
>>> >
>>> > Regards
>>> >
>>> > On Fri, Mar 16, 2012 at 12:12 AM, lboutros  wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> I was looking for something similar.
>>> >>
>>> >> I tried this patch :
>>> >>
>>> >> https://issues.apache.org/jira/browse/SOLR-2112
>>> >>
>>> >> it's working quite well (I've back-ported the code in Solr 3.5.0...).
>>> >>
>>> >> Is it really different from what you are trying to achieve ?
>>> >>
>>> >> Ludovic.
>>> >>
>>> >> -
>>> >> Jouve
>>> >> France.
>>> >> --
>>> >> View this message in context:
>>> >>
>>>
>>> http://lucene.472066.n3.nabble.com/Responding-to-Requests-

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Yonik Seeley
On Thu, Apr 12, 2012 at 2:21 PM, Chris Hostetter
 wrote:
>
> : Please see the documentation: 
> http://wiki.apache.org/solr/SolrCloud#Required_Config> :
>
> : schema.xml
> :
> : You must have a _version_ field defined:
> :
> : 
>
> Seems like this is the kind of thing that should make Solr fail hard and
> fast on SolrCore init if it sees you are running in cloud mode and yet it
> doesn't find this -- similar to how some other features fail hard and fast
> if you don't have uniqueKey.

Off the top of my head:
_version_ is needed for solr cloud where a leader forwards updates to
replicas, unless you're handing update distribution yourself or
providing pre-built shards.
_version_ is needed for realtime-get and optimistic locking

We should document for sure... but at this point it's not clear what
we should enforce. (not saying we shouldn't enforce anything... just
that I haven't really thought about it)

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10


[ANNOUNCE] Apache Solr 3.6 released

2012-04-12 Thread Robert Muir
12 April 2012, Apache Solr™ 3.6.0 available
The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0.

Solr is the popular, blazing fast open source enterprise search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial search.
Solr is highly scalable, providing distributed search and index replication,
and it powers the search and navigation features of many of the world's
largest internet sites.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below.  The release
is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see
note below).

See the CHANGES.txt file included with the release for a full list of
details.

Solr 3.6.0 Release Highlights:

 * New SolrJ client connector using Apache Http Components http client
   (SOLR-2020)

 * Many analyzer factories are now "multi term query aware" allowing for things
   like field type aware lowercasing when building prefix & wildcard queries.
   (SOLR-2438)

 * New Kuromoji morphological analyzer tokenizes Japanese text, producing
   both compound words and their segmentation. (SOLR-3056)

 * Range Faceting (Dates & Numbers) is now supported in distributed search
   (SOLR-1709)

 * HTMLStripCharFilter has been completely re-implemented, fixing many bugs
   and greatly improving the performance (LUCENE-3690)

 * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)

 * New LFU Cache option for use in Solr's internal caches. (SOLR-2906)

 * Memory performance improvements to all FST based suggesters (SOLR-2888)

 * New WFSTLookupFactory suggester supports finer-grained ranking for
   suggestions. (LUCENE-3714)

 * New options for configuring the amount of concurrency used in distributed
   searches (SOLR-3221)

 * Many bug fixes

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases.  It is possible that the mirror you are using may not
have replicated the release yet.  If that is the case, please try another
mirror.  This also goes for Maven access.

Happy searching,

Lucene/Solr developers


Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Chris Hostetter

: Off the top of my head:
: _version_ is needed for solr cloud where a leader forwards updates to
: replicas, unless you're handing update distribution yourself or
: providing pre-built shards.
: _version_ is needed for realtime-get and optimistic locking
: 
: We should document for sure... but at this point it's not clear what
: we should enforce. (not saying we shouldn't enforce anything... just
: that I haven't really thought about it)

well ... it may eventually make sense to global enforce it for 
consistency, but in the meantime the individual components that dpeend on 
it can certainly enforce it (just like my uniqueKey example; the search 
components that require it check for themselves on init and fail fast)

(ie: sounds like the RealTimeGetHandler and the existing 
DistributedUpdateProcessor should fail fast on init if the schema doesn't 
have it)


-Hoss


RE: [ANNOUNCE] Apache Solr 3.6 released

2012-04-12 Thread Robert Petersen
I think this page needs updating...  it says it's not out yet.  

https://wiki.apache.org/solr/Solr3.6


-Original Message-
From: Robert Muir [mailto:rm...@apache.org] 
Sent: Thursday, April 12, 2012 1:33 PM
To: d...@lucene.apache.org; solr-user@lucene.apache.org; Lucene mailing list; 
announce
Subject: [ANNOUNCE] Apache Solr 3.6 released

12 April 2012, Apache Solr™ 3.6.0 available
The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0.

Solr is the popular, blazing fast open source enterprise search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial search.
Solr is highly scalable, providing distributed search and index replication,
and it powers the search and navigation features of many of the world's
largest internet sites.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below.  The release
is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see
note below).

See the CHANGES.txt file included with the release for a full list of
details.

Solr 3.6.0 Release Highlights:

 * New SolrJ client connector using Apache Http Components http client
   (SOLR-2020)

 * Many analyzer factories are now "multi term query aware" allowing for things
   like field type aware lowercasing when building prefix & wildcard queries.
   (SOLR-2438)

 * New Kuromoji morphological analyzer tokenizes Japanese text, producing
   both compound words and their segmentation. (SOLR-3056)

 * Range Faceting (Dates & Numbers) is now supported in distributed search
   (SOLR-1709)

 * HTMLStripCharFilter has been completely re-implemented, fixing many bugs
   and greatly improving the performance (LUCENE-3690)

 * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)

 * New LFU Cache option for use in Solr's internal caches. (SOLR-2906)

 * Memory performance improvements to all FST based suggesters (SOLR-2888)

 * New WFSTLookupFactory suggester supports finer-grained ranking for
   suggestions. (LUCENE-3714)

 * New options for configuring the amount of concurrency used in distributed
   searches (SOLR-3221)

 * Many bug fixes

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases.  It is possible that the mirror you are using may not
have replicated the release yet.  If that is the case, please try another
mirror.  This also goes for Maven access.

Happy searching,

Lucene/Solr developers


Re: [ANNOUNCE] Apache Solr 3.6 released

2012-04-12 Thread Robert Muir
Hi,

Just edit it! its a wiki page anyone can edit! There are probably
other out of date ones too

On Thu, Apr 12, 2012 at 5:57 PM, Robert Petersen  wrote:
> I think this page needs updating...  it says it's not out yet.
>
> https://wiki.apache.org/solr/Solr3.6
>
>
> -Original Message-
> From: Robert Muir [mailto:rm...@apache.org]
> Sent: Thursday, April 12, 2012 1:33 PM
> To: d...@lucene.apache.org; solr-user@lucene.apache.org; Lucene mailing list; 
> announce
> Subject: [ANNOUNCE] Apache Solr 3.6 released
>
> 12 April 2012, Apache Solr™ 3.6.0 available
> The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0.
>
> Solr is the popular, blazing fast open source enterprise search platform from
> the Apache Lucene project. Its major features include powerful full-text
> search, hit highlighting, faceted search, dynamic clustering, database
> integration, rich document (e.g., Word, PDF) handling, and geospatial search.
> Solr is highly scalable, providing distributed search and index replication,
> and it powers the search and navigation features of many of the world's
> largest internet sites.
>
> This release contains numerous bug fixes, optimizations, and
> improvements, some of which are highlighted below.  The release
> is available for immediate download at:
>   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see
> note below).
>
> See the CHANGES.txt file included with the release for a full list of
> details.
>
> Solr 3.6.0 Release Highlights:
>
>  * New SolrJ client connector using Apache Http Components http client
>   (SOLR-2020)
>
>  * Many analyzer factories are now "multi term query aware" allowing for 
> things
>   like field type aware lowercasing when building prefix & wildcard queries.
>   (SOLR-2438)
>
>  * New Kuromoji morphological analyzer tokenizes Japanese text, producing
>   both compound words and their segmentation. (SOLR-3056)
>
>  * Range Faceting (Dates & Numbers) is now supported in distributed search
>   (SOLR-1709)
>
>  * HTMLStripCharFilter has been completely re-implemented, fixing many bugs
>   and greatly improving the performance (LUCENE-3690)
>
>  * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)
>
>  * New LFU Cache option for use in Solr's internal caches. (SOLR-2906)
>
>  * Memory performance improvements to all FST based suggesters (SOLR-2888)
>
>  * New WFSTLookupFactory suggester supports finer-grained ranking for
>   suggestions. (LUCENE-3714)
>
>  * New options for configuring the amount of concurrency used in distributed
>   searches (SOLR-3221)
>
>  * Many bug fixes
>
> Note: The Apache Software Foundation uses an extensive mirroring network for
> distributing releases.  It is possible that the mirror you are using may not
> have replicated the release yet.  If that is the case, please try another
> mirror.  This also goes for Maven access.
>
> Happy searching,
>
> Lucene/Solr developers



-- 
lucidimagination.com


Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Benson Margulies
I'm probably confused, but it seems to me that the case I hit does not
meet any of Yonik's criteria.

I have no replicas. I'm running SolrCloud in the simple mode where
each doc ends up in exactly one place.

I think that it's just a bug that the code refuses to do the local
deletion when there's no version info.

However, if I am confused, it sure seems like a candidate for the 'at
least throw instead of failing silently' policy.


Re: codecs for sorted indexes

2012-04-12 Thread Carlos Gonzalez-Cadenas
Hello Michael,

Yes, we are pre-sorting the documents before adding them to the index. We
have a score associated to every document (not an IR score but a
document-related score that reflects its "importance"). Therefore, the
document with the biggest score will have the lowest docid (we add it first
to the index). We do this in order to apply early termination effectively.
With the actual coded, we haven't seen much of a difference in terms of
space when we have the index sorted vs not sorted.

So, the question would be: if we force the docids to be sorted, what is the
best way to encode them?. We don't really care if the codec doesn't work
for cases where the documents are not sorted (i.e. if it throws an
exception if documents are not ordered when creating the index). Our idea
here is that it may be possible to trade off generality but achieve very
significant improvements for the specific case.

Would something along the lines of RLE coding work? i.e. if we have to
store docids 1 to 1500, we can represent it as "1::1499" (it would be 2
ints to represent 1500 docids).

Thanks a lot for your help,
Carlos

On Thu, Apr 12, 2012 at 6:19 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Do you mean you are pre-sorting the documents (by what criteria?)
> yourself, before adding them to the index?



> In which case... you should already be seeing some benefits (smaller
> index size) than had you "randomly" added them (ie the vInts should
> take fewer bytes), I think.  (Probably the savings would be greater
> for better intblock codecs like PForDelta, SimpleX, but I'm not
> sure...).
>
> Or do you mean having a codec re-sort the documents (on flush/merge)?
> I think this should be possible w/ the Codec API... but nobody has
> tried it yet that I know of.
>
> Note that the bulkpostings branch is effectively dead (nobody is
> iterating on it, and we've removed the old bulk API from trunk), but
> there is likely a GSoC project to add a PForDelta codec to trunk:
>
>https://issues.apache.org/jira/browse/LUCENE-3892
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
> On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas
>  wrote:
> > Hello,
> >
> > We're using a sorted index in order to implement early termination
> > efficiently over an index of hundreds of millions of documents. As of
> now,
> > we're using the default codecs coming with Lucene 4, but we believe that
> > due to the fact that the docids are sorted, we should be able to do much
> > better in terms of storage and achieve much better performance,
> especially
> > decompression performance.
> >
> > In particular, Robert Muir is commenting on these lines here:
> >
> >
> https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411
> >
> > We're aware that the in the bulkpostings branch there are different
> codecs
> > being implemented and different experiments being done. We don't know
> > whether we should implement our own codec (i.e. using some RLE-like
> > techniques) or we should use one of the codecs implemented there (PFOR,
> > Simple64, ...).
> >
> > Can you please give us some advice on this?
> >
> > Thanks
> > Carlos
> >
> > Carlos Gonzalez-Cadenas
> > CEO, ExperienceOn - New generation search
> > http://www.experienceon.com
> >
> > Mobile: +34 652 911 201
> > Skype: carlosgonzalezcadenas
> > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>


Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Benson Margulies
On Thu, Apr 12, 2012 at 2:14 PM, Mark Miller  wrote:
> google must not have found it - i put that in a month or so ago I believe -
> at least weeks. As you can see, there is still a bit to fill in, but it
> covers the high level. I'd like to add example snippets for the rest soon.

Mark, is it all true? I don't have an update log or a replication
handler, and neither does the default, and it all works fine in the
simple case from the top of that wiki page.


Re: is there a downside to combining search fields with copyfield?

2012-04-12 Thread Shawn Heisey

On 4/12/2012 1:37 PM, geeky2 wrote:

can you elaborate on this and how EDisMax would preclude the need for
copyfield?

i am using extended dismax now in my response handlers.

here is an example of one of my requestHandlers

   
 
   edismax
   all
   5
   itemNo^1.0
   *:*
 
 
   itemType:1
   rankNo asc, score desc
 
 
   false
 
   


I'm not sure whether or not you can use a multiValued field as the 
source for copyField.  This is the sort of thing that the devs tend to 
think of, so my initial thought would be that it should work, though I 
would definitely test it to be absolutely sure.


Your request handler above has qf set to include the field called 
itemNo.  If you made another that had the following in it, you could do 
without a copyField, by using that request handler.  You would want to 
customize the field boosts:


brand^2.0 category^3.0 partno

To really leverage edismax, assuming that you are using a tokenizer that 
splits any of these fields into multiple tokens, and that you want to 
use relevancy ranking, you might want to consider defining pf as well.


Some observations about your handler above... you are free to ignore 
this: I believe that you don't really need the ^1.0 that's in qf, 
because there's only one field, and 1.0 is the default boost.  Also, 
from what I can tell, because you are only using one qf field and are 
not using any of the dismax-specific goodies like pf or mm, you don't 
really need edismax at all here.  If I'm right, to remove edismax, just 
specify itemNo as the value for the df parameter (default field) and 
remove the defType.  The q.alt parameter might also need to come out.


Solr 3.6 (should be released soon) has deprecated the defaultSearchField 
and defaultOperator parameters in schema.xml, the df and q.op handler 
parameters are the replacement.  This will be enforced in Solr 4.0.


http://wiki.apache.org/solr/SearchHandler#Query_Params

Thanks,
Shawn



Re: Solr Scoring

2012-04-12 Thread Erick Erickson
No, I don't think there's an OOB way to make this happen. It's
a recurring theme, "make exact matches score higher than
stemmed matches".

Best
Erick

On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue  wrote:
> Hi,
>
> I have a field in my index called itemDesc which i am applying
> EnglishMinimalStemFilterFactory to. So if i index a value to this field
> containing "Edges", the EnglishMinimalStemFilterFactory applies stemming
> and "Edges" becomes "Edge". Now when i search for "Edges", documents with
> "Edge" score better than documents with the actual search word - "Edges".
> Is there a way i can make documents with the actual search word in this
> case "Edges" score better than document with "Edge"?
>
> I am using Solr 3.5. My field definition is shown below:
>
> 
>      
>        
>                synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>                             ignoreCase="true"
>                words="stopwords_en.txt"
>                enablePositionIncrements="true"
>             
>    
>        
>      
>      
>        
>         ignoreCase="true" expand="true"/>
>                        ignoreCase="true"
>                words="stopwords_en.txt"
>                enablePositionIncrements="true"
>                />
>        
>    
>         protected="protwords.txt"/>
>        
>      
>    
>
> Thanks.


Re: Solr Scoring

2012-04-12 Thread Walter Underwood
It is easy. Create two fields, text_exact and text_stem. Don't use the stemmer 
in the first chain, do use the stemmer in the second. Give the text_exact a 
bigger weight than text_stem.

wunder

On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote:

> No, I don't think there's an OOB way to make this happen. It's
> a recurring theme, "make exact matches score higher than
> stemmed matches".
> 
> Best
> Erick
> 
> On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue  wrote:
>> Hi,
>> 
>> I have a field in my index called itemDesc which i am applying
>> EnglishMinimalStemFilterFactory to. So if i index a value to this field
>> containing "Edges", the EnglishMinimalStemFilterFactory applies stemming
>> and "Edges" becomes "Edge". Now when i search for "Edges", documents with
>> "Edge" score better than documents with the actual search word - "Edges".
>> Is there a way i can make documents with the actual search word in this
>> case "Edges" score better than document with "Edge"?
>> 
>> I am using Solr 3.5. My field definition is shown below:
>> 
>> 
>>  
>>
>>   > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>> >ignoreCase="true"
>>words="stopwords_en.txt"
>>enablePositionIncrements="true"
>> 
>>
>>
>>  
>>  
>>
>>> ignoreCase="true" expand="true"/>
>>>ignoreCase="true"
>>words="stopwords_en.txt"
>>enablePositionIncrements="true"
>>/>
>>
>>
>>> protected="protwords.txt"/>
>>
>>  
>>
>> 
>> Thanks.







Re: two structures in solr

2012-04-12 Thread Erick Erickson
You have to take off your DB hat when using Solr ...

There is no problem at all having documents in the
same index that are of different types. There is no
penalty for field definitions that aren't used. That is, you
can easily have two different types of documents in the
same index.

It's all about simply populating the two types of documents
with different fields. in your case, I suspect you'll have a
"type" field with two valid values, "project" and "contractor"
or some such. Then just attach a filter query depending on
what you want, i.e. &fq=type:project or &fq=type:contractor
and your searches will be restricted to the proper documents.

Best
Erick

On Thu, Apr 12, 2012 at 5:41 AM, tkoomzaaskz  wrote:
> Hi all,
>
> I'm a solr newbie, so sorry if I do anything wrong ;)
>
> I want to use SOLR not only for fast text search, but mainly to create a
> very fast search engine for a high-traffic system (MySQL would not do the
> job if the db grows too big).
>
> I need to store *two big structures* in SOLR: projects and contractors.
> Contractors will search for available projects and project owners will
> search for contractors who would do it for them.
>
> So far, I have found a solr tutorial for newbies
> http://www.solrtutorial.com, where I found the schema file which defines the
> data structure: http://www.solrtutorial.com/schema-xml.html. But my case is
> that *I want to have two structures*. I guess running two parallel solr
> instances is not the idea. I took a look at
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup
> and I can see that the schema goes like:
>
> 
> 
>  
>    ...
>  
>    
>       required="true" />
>       stored="true" omitNorms="true"/>
>      
>       stored="false"/>
>      ...
>    
> 
>
> But still, this is a single structure. And I need 2.
>
> Great thanks in advance for any help. There are not many tutorials for SOLR
> in the web.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/two-structures-in-solr-tp3905143p3905143.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 3.5 taking long to index

2012-04-12 Thread Shawn Heisey

On 4/12/2012 12:42 PM, Rohit wrote:

Thanks for pointing these out, but I still have one concern, why is the
Virtual Memory running in 300g+?


Solr 3.5 uses MMapDirectoryFactory by default to read the index.  This 
does an mmap on the files that make up your index, so their entire 
contents are simply accessible to the application as virtual memory 
(over 300GB in your case), the OS automatically takes care of swapping 
disk pages in and out of real RAM as required.  This approach has less 
overhead and tends to make better use of the OS disk cache than other 
methods.  It does lead to confused questions and scary numbers in memory 
usage reporting, though.


You have mentioned that you are giving 36GB of RAM to Solr.  How much 
total RAM does the machine have?


Thanks,
Shawn



Re: Dismax request handler differences Between Solr Version 3.5 and 1.4

2012-04-12 Thread Erick Erickson
Then I suspect your solrconfig is different or you're using a *slightly*
different URL. When you specify defType=dismax, you're NOT going
to the "dismax"  requestHandler. You're specifying a "dismax" style
parser, and Solr expects that you're going to provide all the parameters
on the URL. To whit: qf. If you add "&qf=field1 field2 field3..." you'll
see output.

I found this extremely confusing when I started using Solr. If you use
&qt=dismax, _then_ you're specifying that you should use the
requestHandler defined in your solrconfig.xml _named_ "dismax".

And this kind of thing was changed because it was so confusing, but
I suspect your 3.5 installation is not quite the same URL. I think 3.5
was changed to use the default field in this case.

BTW, 3.6 has just been released, if you're upgrading anyway you
might want to jump to 3.6

Best
Erick

On Thu, Apr 12, 2012 at 6:08 AM, mechravi25  wrote:
> Hi,
>
> We are currently using solr (version 1.4.0.2010.01.13.08.09.44). we have a
> strange situation in dismax request handler. when we search for a keyword
> and append qt=dismax, we are not getting the any results. The solr request
> is as follows:
> http://local:8983/solr/core2/select/?q=Bank&version=2.2&start=0&rows=10&indent=on&defType=dismax&debugQuery=on
>
> The Response is as follows :
>
>  
> - 
>  Bank
>  Bank
>  +() ()
>  +() ()
>  
>  DisMaxQParser
>  
>  
> - 
>  0.0
> - 
>  0.0
> - 
>  0.0
>  
> - 
>  0.0
>  
> - 
>  0.0
>  
> - 
>  0.0
>  
> - 
>  0.0
>  
> - 
>  0.0
>  
>  
> - 
>  0.0
> - 
>  0.0
>  
> - 
>  0.0
>  
> - 
>  0.0
>  
> - 
>  0.0
>  
> - 
>  0.0
>  
> - 
>  0.0
>  
>  
>  
>  
>  
>
>
> We are currently testing the Solr Version 3.5, But the same is working fine
> in that version.
>
> Also the Query alternative params are not working properly in SOlr 1.5 when
> compared with version 3.5. The request seems to be the same, but dono where
> its making the issue. Please help me out. Thanks i advance.
>
> Regards,
> Sivaganesh
> 
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Dismax-request-handler-differences-Between-Solr-Version-3-5-and-1-4-tp3905192p3905192.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Further questions about behavior in ReversedWildcardFilterFactory

2012-04-12 Thread Erick Erickson
There is special handling build into Solr (but not Lucene I don't think)
that deals with the reversed case, that's probably the source of your
differences.

Leading wildcards are extremely painful if you don't do some trick
like Solr does with the reversed stuff. In order to run, you have to
spin through _every_ term in the field to see which ones match. It
won't be performant on any very large index.

So I would stick with using the Solr stuff unless you have a specific
need to do things at the Lucene level. In which case I'd look carefully
at the Solr implementation to see what I could glean from that
implementation.

Best
Erick

On Thu, Apr 12, 2012 at 8:01 AM, neosky  wrote:
> I ask the question in
> http://lucene.472066.n3.nabble.com/A-little-onfusion-with-maxPosAsterisk-tt3889226.html
> However, when I do some implementation, I get a further questions.
> 1. Suppose I don't use ReversedWildcardFilterFactory in the index time, it
> seems that Solr doesn't allow the leading wildcard search, it will return
> the error:
> org.apache.lucene.queryParser.ParseException: Cannot parse
> 'sequence:*A*': '*' or '?' not allowed as first character in
> WildcardQuery
> But when I use the ReversedWildcardFilterFactory, I can use the *A* in
> the query. But as I know, the ReversedWildcardFilterFactory should work in
> the index part, should not affect the query behavior. If it is true, how
> does this happen?
> 2.Based on the question above
> suppose I have those tokens in index.
> 1.AB/MNO/UUFI
> 2.BC/MNO/IUYT
> 3.D/MNO/QEWA
> 4./MNO/KGJGLI
> 5.QOEOEF/MNO/
> suppose I use the lucene, I can set the QueryParser with
> AllowLeadingWildcard(true), to search *MNO*
> it should return the tokens above(1-5)
> But in solr, when I conduct the *MNO* with the ReversedWildcardFilterFactory
> in the index, but use the StandardAnalyzer in the query, I don't know what
> happens here.
> The leading *MNO should be fast to match the 5 with
> ReversedWildcardFilterFactory
> The tailer MNO* should be fast to match 4
> But What about *MNO* ?
> Thanks!
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Further-questions-about-behavior-in-ReversedWildcardFilterFactory-tp3905416p3905416.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Suggester not working for digit starting terms

2012-04-12 Thread Robert Muir
On Thu, Apr 12, 2012 at 3:52 PM, jmlucjav  wrote:
> Well now I am really lost...
>
> 1. yes I want to suggest whole sentences too, I want the tokenizer to be
> taken into account, and apparently it is working for me in 3.5.0?? I get
> suggestions that are like "foo bar abc".  Maybe what you mention is only for
> file based dictionaries? I am using the field itself.

it doesnt use *JUST* your tokenizer. It splits and applies identifier
rules. Such identifier rules include things like, 'cannot start with a
digit'.

That's why i recommend you configure a SuggestQueryConverter so you
have complete control of what is going on rather than dealing with the
spellchecking one.

>
> Moving to 3.6.0 is not a problem (I had already downloaded the rc actually)
> but I still see weird things here.
>

installing 3.6 isnt going to do anything magical: as mentioned above
you have to configure the SuggestQueryConverter like the example in
the link if you want to have total control on how the input is treated
before going to the suggester.

-- 
lucidimagination.com


Re: Import null values from XML file

2012-04-12 Thread Erick Erickson
What does "treated as null" mean? Deleted from the doc?
The problem here is that null-ness is kind of tricky. What
behaviors do you want out of Solr in the NULL case?

You can drop this out of the document by writing a custom
updateHandler. It's actually quite simple to do.

Best
Erick

On Thu, Apr 12, 2012 at 9:14 AM, randolf.julian
 wrote:
> We import an XML file directly to SOLR using a the script called post.sh in
> the exampledocs. This is the script:
>
> FILES=$*
> URL=http://localhost:8983/solr/update
>
> for f in $FILES; do
>  echo Posting file $f to $URL
>  curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
>  echo
> done
>
> #send the commit command to make sure all the changes are flushed and
> visible
> curl $URL --data-binary '' -H 'Content-type:text/xml;
> charset=utf-8'
> echo
>
> Our XML file looks something like this:
>
> 
>  
>    D22BF0B9-EE3A-49AC-A4D6-000B07CDA18A
>    D22BF0B9-EE3A-49AC-A4D6-000B07CDA18A
>    1000
>    CK4475
>    CK4475
>    NULL
>    NULL
>    840655037330
>    NULL
>    EBC CLUTCH KIT
>    EBC CLUTCH KIT
>  
> 
>
> How can I tell solr that the "NULL" value should be treated as null?
>
> Thanks,
> Randolf
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Import-null-values-from-XML-file-tp3905600p3905600.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: codecs for sorted indexes

2012-04-12 Thread Robert Muir
On Thu, Apr 12, 2012 at 6:35 PM, Carlos Gonzalez-Cadenas
 wrote:
> Hello Michael,
>
> Yes, we are pre-sorting the documents before adding them to the index. We
> have a score associated to every document (not an IR score but a
> document-related score that reflects its "importance"). Therefore, the
> document with the biggest score will have the lowest docid (we add it first
> to the index). We do this in order to apply early termination effectively.
> With the actual coded, we haven't seen much of a difference in terms of
> space when we have the index sorted vs not sorted.

I wouldn't expect that you will see space savings when you sort this way.

The techniques I was mentioning involve sorting documents by other
factors instead (such as grouping related documents from the same
website together: idea being they probably share many of the same
terms): this hopefully creates smaller document deltas that require
less bits to represent.

-- 
lucidimagination.com


Re: searching across multiple fields using edismax - am i setting this up right?

2012-04-12 Thread Erick Erickson
Looks good on a quick glance. There are a couple of things...

1> there's no need for the "qt" param _if_ you specify the name
as "/partItemNoSearch", just use
blahblah/solr/partItemNoSearch
There's a JIRA about when/if you need at. Either will do, it's
up to you which you prefer.

2> I'd consider moving the sort from the "appends" section to the
"defaults" section on the theory that you may want to override sorting
sometime.

3> Simple way to see the effects of this is to simply append
&debugQuery=on to your URL. You'll see the results of
the query, including the parsed results. It's a little hard to read,
but you should be seeing your search terms spread across
all three fields.

Best
Erick

On Thu, Apr 12, 2012 at 2:06 PM, geeky2  wrote:
> hello all,
>
> i just want to check to make sure i have this right.
>
> i was reading on this page: http://wiki.apache.org/solr/ExtendedDisMax,
> thanks to shawn for educating me.
>
> *i want the user to be able to fire a requestHandler but search across
> multiple fields (itemNo, productType and brand) WITHOUT them having to
> specify in the query url what fields they want / need to search on*
>
> this is what i have in my request handler
>
>
>   default="false">
>    
>      edismax
>      all
>      5
>      *itemNo^1.0 productType^.8 brand^.5*
>      *:*
>    
>    
>      rankNo asc, score desc
>    
>    
>      false
>    
>  
>
> this would be an example of a single term search going against all three of
> the fields
>
> http://bogus:bogus/somecore/select?qt=partItemNoSearch&q=*dishwasher*&debugQuery=on&rows=100
>
> this would be an example of a multiple term search across all three of the
> fields
>
> http://bogus:bogus/somecore/select?qt=partItemNoSearch&q=*dishwasher
> 123-xyz*&debugQuery=on&rows=100
>
>
> do i understand this correctly?
>
> thank you,
> mark
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/searching-across-multiple-fields-using-edismax-am-i-setting-this-up-right-tp3906334p3906334.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Scoring

2012-04-12 Thread Erick Erickson
GAH! I had my head in "make this happen in one field" when I wrote my
response, without being explicit. Of course Walter's solution is pretty
much the standard way to deal with this.

Best
Erick

On Thu, Apr 12, 2012 at 5:38 PM, Walter Underwood  wrote:
> It is easy. Create two fields, text_exact and text_stem. Don't use the 
> stemmer in the first chain, do use the stemmer in the second. Give the 
> text_exact a bigger weight than text_stem.
>
> wunder
>
> On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote:
>
>> No, I don't think there's an OOB way to make this happen. It's
>> a recurring theme, "make exact matches score higher than
>> stemmed matches".
>>
>> Best
>> Erick
>>
>> On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue  wrote:
>>> Hi,
>>>
>>> I have a field in my index called itemDesc which i am applying
>>> EnglishMinimalStemFilterFactory to. So if i index a value to this field
>>> containing "Edges", the EnglishMinimalStemFilterFactory applies stemming
>>> and "Edges" becomes "Edge". Now when i search for "Edges", documents with
>>> "Edge" score better than documents with the actual search word - "Edges".
>>> Is there a way i can make documents with the actual search word in this
>>> case "Edges" score better than document with "Edge"?
>>>
>>> I am using Solr 3.5. My field definition is shown below:
>>>
>>> 
>>>      
>>>        
>>>               >> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>             >>                ignoreCase="true"
>>>                words="stopwords_en.txt"
>>>                enablePositionIncrements="true"
>>>             
>>>    
>>>        
>>>      
>>>      
>>>        
>>>        >> ignoreCase="true" expand="true"/>
>>>        >>                ignoreCase="true"
>>>                words="stopwords_en.txt"
>>>                enablePositionIncrements="true"
>>>                />
>>>        
>>>    
>>>        >> protected="protwords.txt"/>
>>>        
>>>      
>>>    
>>>
>>> Thanks.
>
>
>
>
>


Re: solr hangs

2012-04-12 Thread Peter Markey
Thanks for the response. I have given a size of 8gb for the instance and
has only around few thousands of documents (with 15 fields each having
small amount of data)..apparently the problem is the process (solr jetty
instance) is consuming lots of threads...one time it consumed around 50k
threads and the process maxed out the allowable thread allocated by the OS
(centos) for the process..and in the admin page is see tons of threads
under Thread Dump...it's lik solr is waiting for somethingi have two
leader and replica cores/shards in two instances...and i send the documents
to one of the shard through the csv update handler...

On Wed, Apr 11, 2012 at 7:39 AM, Pawel Rog  wrote:

> You wrote that you can see such error "OutOfMemoryError". I had such
> problems when my caches were to big. It means that there is no more free
> memory in JVM and probably full gc starts running. How big is your Java
> heap? Maybe cache sizes in yout solr are to big according to your JVM
> settings.
>
> --
> Regards,
> Pawel
>
> On Tue, Apr 10, 2012 at 9:51 PM, Peter Markey  wrote:
>
> > Hello,
> >
> > I have a solr cloud setup based on a blog (
> > http://outerthought.org/blog/491-ot.html) and am able to bring up the
> > instances and cores. But when I start indexing data (through csv update),
> > the core throws a out of memory exception
> (null:java.lang.RuntimeException:
> > java.lang.OutOfMemoryError: unable to create new native thread). The
> thread
> > dump from new solr ui is below:
> >
> > cmdDistribExecutor-8-thread-777 (827)
> >
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1bd11b79
> >
> >   - sun.misc.Unsafe.park​(Native Method)
> >   - java.util.concurrent.locks.LockSupport.park​(LockSupport.java:186)
> >   -
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await
> > (AbstractQueuedSynchronizer.java:2043)
> >   -
> >
> >
> org.apache.http.impl.conn.tsccm.WaitingThread.await​(WaitingThread.java:158)
> >   -
> >   org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking
> > (ConnPoolByRoute.java:403)
> >   -
> >   org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry
> > (ConnPoolByRoute.java:300)
> >   -
> >
> >
> org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection
> > (ThreadSafeClientConnManager.java:224)
> >   -
> >   org.apache.http.impl.client.DefaultRequestDirector.execute
> > (DefaultRequestDirector.java:401)
> >   -
> >   org.apache.http.impl.client.AbstractHttpClient.execute
> > (AbstractHttpClient.java:820)
> >   -
> >   org.apache.http.impl.client.AbstractHttpClient.execute
> > (AbstractHttpClient.java:754)
> >   -
> >   org.apache.http.impl.client.AbstractHttpClient.execute
> > (AbstractHttpClient.java:732)
> >   -
> >   org.apache.solr.client.solrj.impl.HttpSolrServer.request
> > (HttpSolrServer.java:304)
> >   -
> >   org.apache.solr.client.solrj.impl.HttpSolrServer.request
> > (HttpSolrServer.java:209)
> >   -
> >   org.apache.solr.update.SolrCmdDistributor$1.call
> > (SolrCmdDistributor.java:320)
> >   -
> >   org.apache.solr.update.SolrCmdDistributor$1.call
> > (SolrCmdDistributor.java:301)
> >   - java.util.concurrent.FutureTask$Sync.innerRun​(FutureTask.java:334)
> >   - java.util.concurrent.FutureTask.run​(FutureTask.java:166)
> >   -
> >
> java.util.concurrent.Executors$RunnableAdapter.call​(Executors.java:471)
> >   - java.util.concurrent.FutureTask$Sync.innerRun​(FutureTask.java:334)
> >   - java.util.concurrent.FutureTask.run​(FutureTask.java:166)
> >   -
> >   java.util.concurrent.ThreadPoolExecutor.runWorker
> > (ThreadPoolExecutor.java:1110)
> >   -
> >   java.util.concurrent.ThreadPoolExecutor$Worker.run
> > (ThreadPoolExecutor.java:603)
> >   - java.lang.Thread.run​(Thread.java:679)
> >
> >
> >
> > Apparently I do see lots of threads like above in the thread dump. I'm
> > using latest build from the trunk (Apr 10th). Any insights into this
> issue
> > woudl be really helpful. Thanks a lot.
> >
>


Re: Solr Http Caching

2012-04-12 Thread Chris Hostetter

: Are any of you using Solr Http caching? I am interested to see how people
: use this functionality. I have an index that basically changes once a day
: at midnight. Is it okay to enable Solr Http caching for such an index and
: set the max age to 1 day? Any potential issues?
: 
: I am using solr 3.5 with SolrJ.

in a past life i put squid in front of solr as an accelerator.  i didn't 
bother configuring solr to output expiration info in the Cache-Control 
header, i just took advantage of the etag generated from the index 
version (as well as lastModifiedFrom="openTime") to ensure tha Solr would 
short circut and return a 304 w/o doing any processing (or wasting a lot 
of bandwidth returning data) anytime it got an If-Modified-Since or 
If-None-Match request indicating that the cache already had a current 
copy.

If you know your index only changes ever 24 hours, then setting a max-age 
would probably make sense, to elimianate even those conditional requests, 
but i wouldn't set it to 24H (what if a request happens 1 minute before 
your daily rebuild?) set it to whatever the longest amount of time you are 
willing to serve stale results.

-Hoss


Re: Does the lucene can read the index file from solr?

2012-04-12 Thread a sd
hi,neosky, how to do? i need this way too. thanks

On Thu, Apr 12, 2012 at 9:35 PM, neosky  wrote:

> Thanks!I will try again
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Does-the-lucene-can-read-the-index-file-from-solr-tp3902927p3905364.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Otis Gospodnetic
Hello Ali,

> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> seconds.


That's fine.  Whether it's doable with any tech will depend on how much 
hardware you give it, among other things.

> Needless to mention, the search index needs to scale to 5Billion pages. It
> is also possible that I might need to store multiple indexes -- one for
> crawled content, and one for ancillary data that is also very large. Each
> of these indices would likely require a logically distributed and
> replicated index.


Yup, OK.

> However, I would like for such a system to be homogenous with the Hadoop
> infrastructure that is already installed on the cluster (for the crawl). In
> other words, I would much prefer if the replication and distribution of the
> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> using another scalability framework (such as SolrCloud). In addition, it
> would be ideal if this environment was flexible enough to be dynamically
> scaled based on the size requirements of the index and the search traffic
> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
> enough to automatically provision additional processing power into the
> cluster without requiring server re-starts).


There is no such thing just yet.
There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
automatically index HBase content, but that was either not completed or not 
committed into HBase.

> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
> mature enough and would be the right architectural choice to go along with
> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
> above.


Here is a summary on all of them:
* Search on HBase - I assume you are referring to the same thing I mentioned 
above.  Not ready.
* Solandra - uses Cassandra+Solr, plus DataStax now has a different 
(commercial) offering that combines search and Cassandra.  Looks good.
* Lily - data stored in HBase cluster gets indexed to a separate Solr 
instance(s)  on the side.  Not really integrated the way you want it to be.
* ElasticSearch - solid at this point, the most dynamic solution today, can 
scale well (we are working on a mny-B documents index and hundreds of nodes 
with ElasticSearch right now), etc.  But again, not integrated with Hadoop the 
way you want it.
* IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
sure about its future considering LinkedIn uses Zoie and Sensei already.
* And there is SolrCloud, which is coming soon and will be solid, but is again 
not integrated.

If I were you and I had to pick today - I'd pick ElasticSearch if I were 
completely open.  If I had Solr bias I'd give SolrCloud a try first.

> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
> estimate my needing with this setup, for regular web-data (HTML text) at
> this scale?

I don't know off the topic of my head, but I'm guessing several hundred for 
serving search requests.

HTH,

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html


> Any architectural guidance would be greatly appreciated. The more details
> provided, the wider my grin :).
> 
> Many many thanks in advance.
> 
> Thanks,
> Safdar
>


Re: term frequency outweighs exact phrase match

2012-04-12 Thread Chris Hostetter

: I use solr 3.5 with edismax. I have the following issue with phrase 
: search. For example if I have three documents with content like
: 
: 1.apache apache
: 2. solr solr
: 3.apache solr
: 
: then search for apache solr displays documents in the order 1,.2,3 
: instead of 3, 2, 1 because term frequency in the first and second 
: documents is higher than in the third document. We want results be 
: displayed in the order as 3,2,1 since the third document has exact 
: match.

you need to give us a lot more info, like what other data is in the 
various fields for those documents, exactly what your query URL looks 
like, and what debugQuery=true gives you back in terms of score 
explanations ofr each document, because if that sample content is the only 
thing you've got indexed (even if it's in multiple fields), then documents 
#1 and #2 shouldn't even match your query using the mm you've specified...

: 2<-1 5<-2 6<90%

...because doc #1 and #2 will only contain one clause.

Otherwise it should work fine.

I used the example 3.5 schema, and created 3 docs matching what you 
described. (with name copyfield'ed into text)...


1apache apache
2solr solr
3apache solr


...and then used this similar query (note mm=1) to get the results you 
would expect...

http://localhost:8983/solr/select/?fl=name,score&debugQuery=true&defType=edismax&qf=name+text&pf=name^10+text^5&q=apache%20solr&mm=1



1.309231
apache solr


0.022042051
apache apache


0.022042051
solr solr




-Hoss


RE: solr 3.5 taking long to index

2012-04-12 Thread Rohit
The machine has a total ram of around 46GB. My Biggest concern is Solr index 
time gradually increasing and then the commit stops because of timeouts, out 
commit rate is very high, but I am not able to find the root cause of the issue.

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: 13 April 2012 05:15
To: solr-user@lucene.apache.org
Subject: Re: solr 3.5 taking long to index

On 4/12/2012 12:42 PM, Rohit wrote:
> Thanks for pointing these out, but I still have one concern, why is 
> the Virtual Memory running in 300g+?

Solr 3.5 uses MMapDirectoryFactory by default to read the index.  This does an 
mmap on the files that make up your index, so their entire contents are simply 
accessible to the application as virtual memory (over 300GB in your case), the 
OS automatically takes care of swapping disk pages in and out of real RAM as 
required.  This approach has less overhead and tends to make better use of the 
OS disk cache than other methods.  It does lead to confused questions and scary 
numbers in memory usage reporting, though.

You have mentioned that you are giving 36GB of RAM to Solr.  How much total RAM 
does the machine have?

Thanks,
Shawn




Re: solr 3.5 taking long to index

2012-04-12 Thread Shawn Heisey

On 4/12/2012 8:42 PM, Rohit wrote:

The machine has a total ram of around 46GB. My Biggest concern is Solr index 
time gradually increasing and then the commit stops because of timeouts, out 
commit rate is very high, but I am not able to find the root cause of the issue.


For good performance, Solr relies on the OS having enough free RAM to 
keep critical portions of the index in the disk cache.  Some numbers 
that I have collected from your information so far are listed below.  
Please let me know if I've got any of this wrong:


46GB total RAM
36GB RAM allocated to Solr
300GB total index size

This leaves only 10GB of RAM free to cache 300GB of index, assuming that 
this server is dedicated to Solr.  The critical portions of your index 
are very likely considerably larger than 10GB, which causes constant 
reading from the disk for queries and updates.  With a high commit rate 
and a relatively low mergeFactor of 10, your index will be doing a lot 
of merging during updates, and some of those merges are likely to be 
quite large, further complicating the I/O situation.


Another thing that can lead to increasing index update times is cache 
warming, also greatly affected by high I/O levels.  If you visit the 
/solr/corename/admin/stats.jsp#cache URL, you can see the warmupTime for 
each cache in milliseconds.


Adding more memory to the server would probably help things.  You'll 
want to carefully check all the server and Solr statistics you can to 
make sure that memory is the root of problem, before you actually spend 
the money.  At the server level, look for things like a high iowait CPU 
percentage.  For Solr, you can turn the logging level up to INFO in the 
admin interface as well as turn on the infostream in solrconfig.xml for 
extensive debugging.


I hope this is helpful.  If not, I can try to come up with more specific 
things you can look at.


Thanks,
Shawn



Re: EmbeddedSolrServer and StreamingUpdateSolrServer

2012-04-12 Thread pcrao
Hi Shawn,

Thanks for sharing your opinion.

Mikhail Khludnev, what do you think of Shawn's opinion?

Thanks,
PC Rao.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-and-StreamingUpdateSolrServer-tp3889073p3907223.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Trouble handling Unit symbol

2012-04-12 Thread Rajani Maski
Hi All,

   I tried to index with UTF-8  encode but the issue is still not fixed.
Please see my inputs below.

*Indexed XML:*


  
0.100
µ
  


*Search Query - * BODY:µ

numfound : 0 results obtained.

*What can be the reason for this? How do i need to make search query so
that the above document is found.*


Thanks & Regards

Regards
Rajani



2012/4/2 Rajani Maski 

> Thank you for the reply.
>
>
>
> On Sat, Mar 31, 2012 at 3:38 AM, Chris Hostetter  > wrote:
>
>>
>> : We have data having such symbols like :  ต
>> : Indexed data has  -Dose:"0 ตL"
>> : Now , when  it is searched as  - Dose:"0 ตL"
>>...
>> : Query Q value observed  : S257:"0 ยตL/injection"
>>
>> First off: your "when searched as" example does not match up to your
>> "Query Q" observed value (ie: field queries, extra "/injection" text at
>> the end) suggesting that you maybe cut/paste something you didn't mean to
>> -- so take the rest of this advice with a grain of salt.
>>
>> If i ignore your "when it is searched as" exampleand focus entirely on
>> what you say you've indexed the data as, and the Q value you are sing (in
>> what looks like the echoParams output) then the first thing that jumps out
>> at me is that it looks like your servlet container (or perhaps your web
>> browser if that's where you tested this) is not dealing with the unicode
>> correctly -- because allthough i see a "ต" in the first three lines i
>> quoted above (UTF8: 0xC2 0xB5) in your value observed i'm seeing it
>> preceeded by a "ย" (UTF8: 0xC3 0x82) ... suggesting that perhaps the "ต"
>> did not get URL encoded properly when the request was made to your servlet
>> container?
>>
>> In particular, you might want to take a look at...
>>
>>
>> https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
>> http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
>> The example/exampledocs/test_utf8.sh script included with solr
>>
>>
>>
>>
>> -Hoss
>
>
>


  1   2   >