Re: Greater-than and less-than in data import SQL queries

2009-11-02 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Mon, Nov 2, 2009 at 11:34 AM, Amit Nithian  wrote:
> A thought I had on this from a DIH design perspective. Would it be better to
> have the SQL queries stored in an element rather than an attribute so that
> you can wrap it in a CDATA block without having to mess up the look of query
> with <, >? Makes debugging easier (I know find and replace is trivial
> but it can be annoying when debugging SQL issues :-)).

Actually most of the parsers are forgiving in this aspect. I mean '<'
and '>' are ok in the xml parser shipped with the jdk.

>
> On Wed, Oct 28, 2009 at 5:15 PM, Lance Norskog  wrote:
>
>> It is easier to put SQL select statements in a view, and just use that
>> view from the DIH configuration file.
>>
>> On Tue, Oct 27, 2009 at 12:30 PM, Andrew Clegg 
>> wrote:
>> >
>> >
>> > Heh, eventually I decided
>> >
>> > "where 4 > node_depth"
>> >
>> > was the most pleasing (if slightly WTF-ish) way of writing it...
>> >
>> > Cheers,
>> >
>> > Andrew.
>> >
>> >
>> > Erik Hatcher-4 wrote:
>> >>
>> >> Use < instead of < in that attribute.  That should fix the issue.
>> >> Remember, it's an XML file, so it has to obey XML encoding rules which
>> >> make it ugly but whatcha gonna do?
>> >>
>> >>       Erik
>> >>
>> >> On Oct 27, 2009, at 11:50 AM, Andrew Clegg wrote:
>> >>
>> >>>
>> >>> Hi,
>> >>>
>> >>> If I have a DataImportHandler query with a greater-than sign in,
>> >>> like this:
>> >>>
>> >>>        > >>> query="select *,
>> >>> title as keywords from cathnode_text where node_depth > 4">
>> >>>
>> >>> Everything's fine. However, if it contains a less-than sign:
>> >>>
>> >>>        > >>> query="select *,
>> >>> title as keywords from cathnode_text where node_depth < 4">
>> >>>
>> >>> I get this exception:
>> >>>
>> >>> INFO: Processing configuration from solrconfig.xml:
>> >>> {config=dataconfig.xml}
>> >>> [Fatal Error] :240:129: The value of attribute "query" associated
>> >>> with an
>> >>> element type "null" must not contain the '<' character.
>> >>> 27-Oct-2009 15:30:49
>> >>> org.apache.solr.handler.dataimport.DataImportHandler
>> >>> inform
>> >>> SEVERE: Exception while loading DataImporter
>> >>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> >>> Exception
>> >>> occurred while initializing context
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .solr
>> >>> .handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:184)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .solr.handler.dataimport.DataImporter.(DataImporter.java:101)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .solr
>> >>> .handler.dataimport.DataImportHandler.inform(DataImportHandler.java:
>> >>> 113)
>> >>>        at
>> >>> org
>> >>> .apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:
>> >>> 424)
>> >>>        at org.apache.solr.core.SolrCore.(SolrCore.java:588)
>> >>>        at
>> >>> org.apache.solr.core.CoreContainer
>> >>> $Initializer.initialize(CoreContainer.java:137)
>> >>>        at
>> >>> org
>> >>> .apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:
>> >>> 83)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .catalina
>> >>> .core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:
>> >>> 275)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .catalina
>> >>> .core
>> >>> .ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:
>> >>> 397)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .catalina
>> >>> .core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .catalina.core.StandardContext.filterStart(StandardContext.java:3709)
>> >>>        at
>> >>> org.apache.catalina.core.StandardContext.start(StandardContext.java:
>> >>> 4356)
>> >>>        at
>> >>> org.apache.catalina.manager.ManagerServlet.start(ManagerServlet.java:
>> >>> 1244)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .catalina.manager.HTMLManagerServlet.start(HTMLManagerServlet.java:
>> >>> 604)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:
>> >>> 129)
>> >>>        at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
>> >>>        at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .catalina
>> >>> .core
>> >>> .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:
>> >>> 290)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .catalina
>> >>> .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:
>> >>> 233)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .catalina.core.StandardContextValve.invoke(StandardContextValve.java:
>> >>> 175)
>> >>>        at
>> >>> org
>> >>> .apache
>> >>> .catalina
>> >>> .authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525)
>> >>>        at
>> >>> org.apach

Problems downloading lucene 2.9.1

2009-11-02 Thread Licinio Fernández Maurelo
Hi folks,

as we are using an snapshot dependecy to solr1.4, today we are getting
problems when maven try to download lucene 2.9.1 (there isn't a any 2.9.1
there).

Which repository can i use to download it?

Thx

-- 
Lici


RE: CPU utilization and query time high on Solr slave when snapshot install

2009-11-02 Thread biku...@sapient.com
Hi Solr Gurus,

We have solr in 1 master, 2 slave configuration. Snapshot is created post 
commit, post optimization. We have autocommit after 50 documents or 5 minutes. 
Snapshot puller runs as a cron every 10 minutes. What we have observed is that 
whenever snapshot is installed on the slave, we see solrj client used to query 
slave solr, gets timedout and there is high CPU usage/load avg. on slave 
server. If we stop snapshot puller, then slaves work with no issues. The system 
has been running since 2 months and this issue has started to occur only now  
when load on website is increasing.

Following are some details:

Solr Details:
apache-solr Version: 1.3.0
Lucene - 2.4-dev

Master/Slave configurations:

Master:
- for indexing data HTTPRequests are made on Solr server.
- autocommit feature is enabled for 50 docs and 5 minutes
- caching params are disable for this server
- mergeFactor of 10 is set
- we were running optimize script after every 2 hours, but now have reduced the 
duration to twice a day but issue still persists

Slave1/Slave2:
- standard requestHandler is being used
- default values of caching are set
Machine Specifications:

Master:
- 4GB RAM
- 1GB JVM Heap memory is allocated to Solr

Slave1/Slave2:
- 4GB RAM
- 2GB JVM Heap memory is allocated to Solr

Master and Slave1 (solr1)are on single box and Slave2(solr2) on different box. 
We use HAProxy to load balance query requests between 2 slaves. Master is only 
used for indexing.
Please let us know if somebody has ever faced similar kind of issue or has some 
insight into it as we guys are literally struck at the moment with a very 
unstable production environment.

As a workaround, we have started running optimize on master every 7 minutes. 
This seems to have reduced the severity of the problem but still issue occurs 
every 2days now. please suggest what could be the root cause of this.

Thanks,
Bipul






Re: Indexing multiple entities

2009-11-02 Thread Chantal Ackermann

I'm using a code generator for my entities, and I cannot modify the generation.
I need to work out another option :(


shouldn't code generators help development and not make it more complex 
and difficult? oO


(sry off topic)

chantal


Re: StreamingUpdateSolrServer - indexing process stops in a couple of hours

2009-11-02 Thread Shalin Shekhar Mangar
I'm able to reproduce this issue consistently using JDK 1.6.0_16

After an optimize is called, only one thread keeps adding documents and the
rest wait on StreamingUpdateSolrServer line 196.

On Sun, Oct 25, 2009 at 8:03 AM, Dadasheva, Olga  wrote:

> I am using java 1.6.0_05
>
> To illustrate what is happening I wrote this test program that has 10
> threads adding a collection of documents and one thread optimizing the index
> every 10 sec.
>
> I am seeing that after the first optimize there is only one thread that
> keeps adding documents. The other ones are locked.
>
> In the real code I ended up adding synchronized around add on optimize to
> avoid this.
>
> public static void main(String[] args) {
>
>final JettySolrRunner jetty = new JettySolrRunner("/solr", 8983 );
>try {
>jetty.start();
>// setup the server...
>String url = "http://localhost:8983/solr";;
>final StreamingUpdateSolrServer server = new
> StreamingUpdateSolrServer( url, 2, 5 ) {
> @Override
>public void handleError(Throwable ex) {
> // do somethign...
>}
>};
>server.setConnectionTimeout(1000);
>server.setDefaultMaxConnectionsPerHost(100);
>server.setMaxTotalConnections(100);
>int i = 0;
>while (i++ < 10) {
>new Thread("add-thread"+i) {
>public void run(){
>int j = 0;
>while (true) {
>try {
>List docs
> = new ArrayList();
>for (int n = 0; n < 50; n++)
> {
>SolrInputDocument doc =
> new SolrInputDocument();
>String docID =
> this.getName()+"_doc_"+j++;
>doc.addField( "id",
> docID);
>doc.addField( "content",
> "document_"+docID);
>docs.add(doc);
>}
>server.add(docs);
>
>  System.out.println(this.getName()+" added "+docs.size()+" documents");
>Thread.sleep(100);
>} catch (Exception e) {
>e.printStackTrace();
>
>  System.err.println(this.getName()+" "+e.getLocalizedMessage());
>System.exit(0);
>}
>}
>}
>}.start();
>}
>
>new Thread("optimizer-thread") {
>public void run(){
>while (true) {
>try {
>Thread.sleep(1);
>server.optimize();
>System.out.println(this.getName()+"
> optimized");
>} catch (Exception e) {
>e.printStackTrace();
>System.err.println("optimizer
> "+e.getLocalizedMessage());
>System.exit(0);
>}
>}
>}
>}.start();
>
>
>} catch (Exception e) {
>e.printStackTrace();
> }
>
> }
> -Original Message-
> From: Lance Norskog [mailto:goks...@gmail.com]
> Sent: Tuesday, October 13, 2009 8:59 PM
> To: solr-user@lucene.apache.org
> Subject: Re: StreamingUpdateSolrServer - indexing process stops in a couple
> of hours
>
> Which Java release is this?  There are known thread-blocking problems in
> Java 1.5.
>
> Also, what sockets are used during this time? Try 'netstat -s | fgrep 8983'
> (or your Solr URL port #) and watch the active, TIME_WAIT, CLOSE_WAIT
> sockets build up. This may give a hint.
>
> On Tue, Oct 13, 2009 at 8:47 AM, Dadasheva, Olga <
> olga_dadash...@harvard.edu> wrote:
> > Hi,
> >
> > I am indexing documents using StreamingUpdateSolrServer. My 'setup'
> > code is almost a copy of the junit test of the Solr trunk.
> >
> >try {
> >StreamingUpdateSolrServer streamingServer = new
> > StreamingUpdateSolrServer( url, 2, 5 ) {
> >@Override
> >public void handleError(Throwable ex) {
> >System.out.p

Lock problems: Lock obtain timed out

2009-11-02 Thread Jérôme Etévé
Hi,

  I've got a few machines who post documents concurrently to a solr
instance. They do not issue the commit themselves, instead, I've got
autocommit set up at solr server side:
   
  5 
  6 


This usually works fine, but sometime the server goes in a deadlock
state . Here's the errors I get from the log (these go on forever
until I delete the index and restart all from zero):

02-Nov-2009 10:35:27 org.apache.solr.update.SolrIndexWriter finalize
SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates
a bug -- POSSIBLE RESOURCE LEAK!!!
...
[ multiple messages like this ]
...
02-Nov-2009 10:35:27 org.apache.solr.common.SolrException log
SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
timed out: 
NativeFSLock@/home/solrdata/jobs/index/lucene-703db99881e56205cb910a2e5fd816d3-write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:85)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1538)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1395)
at 
org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:190)
at 
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98)
at 
org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:220)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)


I'm wondering what could be the reason for this (if a commit takes
mire than 60 seconds for instance?), and if I should use better
locking or autocommittting options?

Here's the locking conf I've got at the moment:
   1000
1
   native

I'm using solr trunk from 12 oct 2009 within tomcat.

Thanks for any help.

Jerome.

-- 
Jerome Eteve.
http://www.eteve.net
jer...@eteve.net


Re: Spell check suggestion and correct way of implementation and some Questions

2009-11-02 Thread Shalin Shekhar Mangar
On Wed, Oct 28, 2009 at 8:57 PM, darniz  wrote:

>
> Question. Should i build the dictionlary only once and after that as new
> words are indexed the dictionary will be updated. Or i to do that manually
> over certain interval.
>
>
No. The dictionary is built only when spellcheck.build=true is specified as
a request parameter. You will need to explicitly send spellcheck.build=true
again when the document changes or you can use the buildOnCommit or
buildOnOptimize parameters to re-build the index automatically.

http://wiki.apache.org/solr/SpellCheckComponent#Building_on_Commits


>
> add the spellcheck component to the handler in my case as of now standard
> requets handler. I might also start adding some more dismax handlers
> depending on my requirement
>  
>
> 
>   explicit
>   
> 
> 
>spellcheck
> 
>  
>
> run the query with parameter spell.check=true, and also specify against
> which dictionary you want to run spell check again in my case my
> spellcheck.dictionary parameter is mySpellChecker.
>
>
The parameter is spellcheck=true not spell.check=true. If you do not give a
name to your dictionary then you do not need to add the
spellcheck.dictionary parameter.

-- 
Regards,
Shalin Shekhar Mangar.


tracking solr response time

2009-11-02 Thread bharath venkatesh
Hi,

We are using solr for many of ur products  it is doing quite well
.  But since no of hits are becoming high we are experiencing latency
in certain requests ,about 15% of our requests are suffering a latency
 . We are trying to identify  the problem .  It may be due to  network
issue or solr server is taking time to process the request  .   other
than  qtime which is returned along with the response is there any
other way to track solr servers performance ?  how is qtime calculated
, is it the total time from when solr server got the request till it
gave the response ? can we do some extra logging to track solr servers
performance . ideally I would want to pass some log id along with the
request (query ) to  solr server  and solr server must log the
response time along with that log id .

Thanks in advance ..
Bharath


Re: Problems downloading lucene 2.9.1

2009-11-02 Thread Grant Ingersoll


On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote:


Hi folks,

as we are using an snapshot dependecy to solr1.4, today we are getting
problems when maven try to download lucene 2.9.1 (there isn't a any  
2.9.1

there).

Which repository can i use to download it?


They won't be there until 2.9.1 is officially released.  We are trying  
to speed up the Solr release by piggybacking on the Lucene release,  
but this little bit is the one downside.



-Grant

NullPointerException with TermVectorComponent

2009-11-02 Thread Andrew Clegg

Hi,

I've recently added the TermVectorComponent as a separate handler, following
the example in the supplied config file, i.e.:

  

  
  
  true
  
  
  tvComponent
  
  

It works, but with one quirk. When you use tf.all=true, you get the tf*idf
scores in the output, just fine (along with tf and df). But if you use
tv.tf_idf=true you get an NPE:

http://server:8080/solr/tvrh/?q=1cuk&version=2.2&indent=on&tv.tf_idf=true

HTTP Status 500 - null java.lang.NullPointerException at
org.apache.solr.handler.component.TermVectorComponent$TVMapper.getDocFreq(TermVectorComponent.java:253)
at
org.apache.solr.handler.component.TermVectorComponent$TVMapper.map(TermVectorComponent.java:245)
at
org.apache.lucene.index.TermVectorsReader.readTermVector(TermVectorsReader.java:522)
at
org.apache.lucene.index.TermVectorsReader.readTermVectors(TermVectorsReader.java:401)
at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:378)
at
org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:1253)
at
org.apache.lucene.index.DirectoryReader.getTermFreqVector(DirectoryReader.java:474)
at
org.apache.solr.search.SolrIndexReader.getTermFreqVector(SolrIndexReader.java:244)
at
org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:125)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
(etc.)

Is this a bug, or am I doing it wrong?

Cheers,

Andrew.

-- 
View this message in context: 
http://old.nabble.com/NullPointerException-with-TermVectorComponent-tp26156903p26156903.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: tracking solr response time

2009-11-02 Thread Yonik Seeley
On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
 wrote:
>    We are using solr for many of ur products  it is doing quite well
> .  But since no of hits are becoming high we are experiencing latency
> in certain requests ,about 15% of our requests are suffering a latency

How much of a latency compared to normal, and what version of Solr are
you using?

>  . We are trying to identify  the problem .  It may be due to  network
> issue or solr server is taking time to process the request  .   other
> than  qtime which is returned along with the response is there any
> other way to track solr servers performance ?
> how is qtime calculated
> , is it the total time from when solr server got the request till it
> gave the response ?

QTime is the time spent in generating the in-memory representation for
the response before the response writer starts streaming it back in
whatever format was requested.  The stored fields of returned
documents are also loaded at this point (to enable handling of huge
response lists w/o storing all in memory).

There are normally servlet container logs that can be configured to
spit out the real total request time.

> can we do some extra logging to track solr servers
> performance . ideally I would want to pass some log id along with the
> request (query ) to  solr server  and solr server must log the
> response time along with that log id .

Yep - Solr isn't bothered by params it doesn't know about, so just put
logid=xxx and it should also be logged with the other request
params.

-Yonik
http://www.lucidimagination.com


Re: tracking solr response time

2009-11-02 Thread Israel Ekpo
On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley wrote:

> On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
>  wrote:
> >We are using solr for many of ur products  it is doing quite well
> > .  But since no of hits are becoming high we are experiencing latency
> > in certain requests ,about 15% of our requests are suffering a latency
>
> How much of a latency compared to normal, and what version of Solr are
> you using?
>
> >  . We are trying to identify  the problem .  It may be due to  network
> > issue or solr server is taking time to process the request  .   other
> > than  qtime which is returned along with the response is there any
> > other way to track solr servers performance ?
> > how is qtime calculated
> > , is it the total time from when solr server got the request till it
> > gave the response ?
>
> QTime is the time spent in generating the in-memory representation for
> the response before the response writer starts streaming it back in
> whatever format was requested.  The stored fields of returned
> documents are also loaded at this point (to enable handling of huge
> response lists w/o storing all in memory).
>
> There are normally servlet container logs that can be configured to
> spit out the real total request time.
>
> > can we do some extra logging to track solr servers
> > performance . ideally I would want to pass some log id along with the
> > request (query ) to  solr server  and solr server must log the
> > response time along with that log id .
>
> Yep - Solr isn't bothered by params it doesn't know about, so just put
> logid=xxx and it should also be logged with the other request
> params.
>
> -Yonik
> http://www.lucidimagination.com
>



If you are not using Java then you may have to track the elapsed time
manually.

If you are using the SolrJ Java client you may have the following options:

There is a method called getElapsedTime() in
org.apache.solr.client.solrj.response.SolrResponseBase which is available to
all the subclasses

I have not used it personally but I think this should return the time spent
on the client side for that request.

The QTime is not the time on the client side but the time spent internally
at the Solr server to process the request.

http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html

http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html

Most likely it could be as a result of an internal network issue between the
two servers or the Solr server is competing with other applications for
resources.

What operating system is the Solr server running on? Is you client
application connection to a Solr server on the same network or over the
internet? Are there other applications like database servers etc running on
the same machine? If so, then the DB server (or any other application) and
the Solr server could be competing for resources like CPU, memory etc.

If you are using Tomcat, you can take a look in
$CATALINA_HOME/logs/catalina.out, there are timestamps there that can also
guide you.

-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.


Re: tracking solr response time

2009-11-02 Thread Grant Ingersoll


On Nov 2, 2009, at 5:41 AM, Yonik Seeley wrote:


QTime is the time spent in generating the in-memory representation for
the response before the response writer starts streaming it back in
whatever format was requested.  The stored fields of returned
documents are also loaded at this point (to enable handling of huge
response lists w/o storing all in memory).

There are normally servlet container logs that can be configured to
spit out the real total request time.


It might be nice to add a flag to DebugComponent to spit out timings  
only.  Thus, one could skip the explains, etc. and just see the  
timings.  Seems like that would have pretty low overhead and still see  
the timings.Î

Re: NullPointerException with TermVectorComponent

2009-11-02 Thread david.stu...@progressivealliance.co.uk
I think it might be to do with the library itself

I downloaded semanticvectors-1.22 and compiled from source. Then created a demo
corpus using 
java org.apache.lucene.demo.IndexFiles against the lucene src directory
I then ran a java pitt.search.semanticvectors.BuildIndex against the index and
got the following

Seedlength = 10
Dimension = 200
Minimum frequency = 0
Number non-alphabet characters = 0
Contents fields are: [contents]
Creating semantic term vectors ...
Populating basic sparse doc vector store, number of vectors: 774
Creating store of sparse vectors  ...
Created 774 sparse random vectors.
Creating term vectors ...
There are 36881 terms (and 774 docs)
0 ... 1000 ... 2000 ... 3000 ... 4000 ... Exception in thread "main"
java.lang.NullPointerException
    at
org.apache.lucene.index.DirectoryReader$MultiTermDocs.freq(DirectoryReader.java:
1068)
    at
pitt.search.semanticvectors.LuceneUtils.getGlobalTermFreq(LuceneUtils.java:70)
    at
pitt.search.semanticvectors.LuceneUtils.termFilter(LuceneUtils.java:187)
    at
pitt.search.semanticvectors.TermVectorsFromLucene.(TermVectorsFromLucene.j
ava:163)
    at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:138)
I am still digging but when you look at the source code it references lucene
call dating back to lucene 2.4 alot fo which are deprecated might need some
refreshing.

Cheers,

Dave

 
On 02 November 2009 at 14:40 Andrew Clegg  wrote:

> 
> Hi,
> 
> I've recently added the TermVectorComponent as a separate handler, following
> the example in the supplied config file, i.e.:
> 
>    class="org.apache.solr.handler.component.TermVectorComponent"/>
> 
>    class="org.apache.solr.handler.component.SearchHandler">
>           
>                   true
>           
>           
>                   tvComponent
>           
>   
> 
> It works, but with one quirk. When you use tf.all=true, you get the tf*idf
> scores in the output, just fine (along with tf and df). But if you use
> tv.tf_idf=true you get an NPE:
> 
> http://server:8080/solr/tvrh/?q=1cuk&version=2.2&indent=on&tv.tf_idf=true
> 
> HTTP Status 500 - null java.lang.NullPointerException at
> org.apache.solr.handler.component.TermVectorComponent$TVMapper.getDocFreq(Term
> VectorComponent.java:253)
> at
> org.apache.solr.handler.component.TermVectorComponent$TVMapper.map(TermVectorC
> omponent.java:245)
> at
> org.apache.lucene.index.TermVectorsReader.readTermVector(TermVectorsReader.jav
> a:522)
> at
> org.apache.lucene.index.TermVectorsReader.readTermVectors(TermVectorsReader.ja
> va:401)
> at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:378)
> at
> org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:125
> 3)
> at
> org.apache.lucene.index.DirectoryReader.getTermFreqVector(DirectoryReader.java
> :474)
> at
> org.apache.solr.search.SolrIndexReader.getTermFreqVector(SolrIndexReader.java:
> 244)
> at
> org.apache.solr.handler.component.TermVectorComponent.process(TermVectorCompon
> ent.java:125)
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandle
> r.java:195)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.ja
> va:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338
> )
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:24
> 1)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFi
> lterChain.java:235)
> at 
> (etc.)
> 
> Is this a bug, or am I doing it wrong?
> 
> Cheers,
> 
> Andrew.
> 
> -- 
> View this message in context:
> http://old.nabble.com/NullPointerException-with-TermVectorComponent-tp26156903p26156903.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Problems downloading lucene 2.9.1

2009-11-02 Thread Ryan McKinley


On Nov 2, 2009, at 8:29 AM, Grant Ingersoll wrote:



On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote:


Hi folks,

as we are using an snapshot dependecy to solr1.4, today we are  
getting
problems when maven try to download lucene 2.9.1 (there isn't a any  
2.9.1

there).

Which repository can i use to download it?


They won't be there until 2.9.1 is officially released.  We are  
trying to speed up the Solr release by piggybacking on the Lucene  
release, but this little bit is the one downside.


Until then, you can add a repo to:

http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/




Re: adding and updating a lot of document to Solr, metadata extraction etc

2009-11-02 Thread Alexey Serba
Hi Eugene,

> - ability to iterate over all documents, returned in search, as Lucene does
>  provide within a HitCollector instance. We would need to extract and
>  aggregate various fields, stored in index, to group results and aggregate 
> them
>  in some way.
> 
> Also I did not find any way in the tutorial to access the search results with
> all fields to be processed by our application.
>
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr
Check out Faceted Search, probably you can achieve your goal by using
Facet Component

There's also Field Collapsing patch
http://wiki.apache.org/solr/FieldCollapsing


Alex


RE: Solr YUI autocomplete

2009-11-02 Thread Ankit Bhatnagar


Hey Amit,

My index(ie Solr) was on different domain, so I can't use XHR(as XHR doesnot 
work with cross domain proxyless data fetch).

I tried using YUI's  DS_ScriptNode but didn't work.

I completed my task by using jQuery and it worked well with solr.

-Ankit

-Original Message-
From: Amit Nithian [mailto:anith...@gmail.com] 
Sent: Monday, November 02, 2009 1:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr YUI autocomplete

I've used the YUI auto complete (albeit not with Solr which shouldn't matter
here) and it should work with JSON. I did one that simply made XHR calls
over to a method on my server which returned pipe delimited text which
worked fine.

Are you using the XHR Data source and if so, what type are you telling it to
expect. One of the examples on the YUI site is text based and i'm sure you
can specify TYPE_JSON or JS_ARRAY too.

- Amit

On Fri, Oct 30, 2009 at 7:04 AM, Ankit Bhatnagar wrote:

>
> Does Solr supports JSONP (JSON with Padding) in the response?
>
> -Ankit
>
>
>
> -Original Message-
> From: Ankit Bhatnagar [mailto:abhatna...@vantage.com]
> Sent: Friday, October 30, 2009 10:27 AM
> To: 'solr-user@lucene.apache.org'
> Subject: Solr YUI autocomplete
>
> Hi Guys,
>
> I have question regarding - how to specify the
>
> I am using YUI autocomplete widget and it expects the JSONP response.
>
>
> http://localhost:8983/solr/select/?q=monitor&version=2.2&start=0&rows=10&indent=on&wt=json&json.wrf=
>
> I am not sure how should I specify the json.wrf=function
>
> Thanks
> Ankit
>


question about collapse.type = adjacent

2009-11-02 Thread michael8

Hi,

I would like to confirm if 'adjacent' in collapse.type means the documents
(with the same collapse field value) are considered adjacent *after* the
'sort' param from the query has been applied, or *before*?  I would think it
would be *after* since collapse feature primarily is meant for presentation
use.

Thanks,
Michael
-- 
View this message in context: 
http://old.nabble.com/question-about-collapse.type-%3D-adjacent-tp26157114p26157114.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: tracking solr response time

2009-11-02 Thread bharath venkatesh
Thanks for the quick response
@yonik

>How much of a latency compared to normal, and what version of Solr are
you using?

latency is usually around 2-4 secs (some times it goes more than that
)  which happens  to  only 15-20%  of the request  other  80-85% of
request are very fast it is in  milli secs ( around 200,000 requests
happens every day )

@Israel  we are not using java client ..  we  r using  python at the
client with response formatted in json

@yonikn @Israel   does qtime measure the total time taken at the solr
server ? I am already measuring the time to get the response  at
client  end . I would want  a means to know how much time the solr
server is taking to respond (process ) once it gets the request  . so
that I could identify whether it is a solr server issue or internal
network issue


@Israel  we are using rhel server  5 on both client and server .. we
have 6 solr sever . one is acting as master . both client and solr
sever are on the same network . those servers are dedicated solr
server except 2 severs which have DB and memcahce running .. we have
adjusted the load accordingly







On 11/2/09, Israel Ekpo  wrote:
> On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley
> wrote:
>
>> On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
>>  wrote:
>> >We are using solr for many of ur products  it is doing quite well
>> > .  But since no of hits are becoming high we are experiencing latency
>> > in certain requests ,about 15% of our requests are suffering a latency
>>
>> How much of a latency compared to normal, and what version of Solr are
>> you using?
>>
>> >  . We are trying to identify  the problem .  It may be due to  network
>> > issue or solr server is taking time to process the request  .   other
>> > than  qtime which is returned along with the response is there any
>> > other way to track solr servers performance ?
>> > how is qtime calculated
>> > , is it the total time from when solr server got the request till it
>> > gave the response ?
>>
>> QTime is the time spent in generating the in-memory representation for
>> the response before the response writer starts streaming it back in
>> whatever format was requested.  The stored fields of returned
>> documents are also loaded at this point (to enable handling of huge
>> response lists w/o storing all in memory).
>>
>> There are normally servlet container logs that can be configured to
>> spit out the real total request time.
>>
>> > can we do some extra logging to track solr servers
>> > performance . ideally I would want to pass some log id along with the
>> > request (query ) to  solr server  and solr server must log the
>> > response time along with that log id .
>>
>> Yep - Solr isn't bothered by params it doesn't know about, so just put
>> logid=xxx and it should also be logged with the other request
>> params.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>
>
>
> If you are not using Java then you may have to track the elapsed time
> manually.
>
> If you are using the SolrJ Java client you may have the following options:
>
> There is a method called getElapsedTime() in
> org.apache.solr.client.solrj.response.SolrResponseBase which is available to
> all the subclasses
>
> I have not used it personally but I think this should return the time spent
> on the client side for that request.
>
> The QTime is not the time on the client side but the time spent internally
> at the Solr server to process the request.
>
> http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html
>
> http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html
>
> Most likely it could be as a result of an internal network issue between the
> two servers or the Solr server is competing with other applications for
> resources.
>
> What operating system is the Solr server running on? Is you client
> application connection to a Solr server on the same network or over the
> internet? Are there other applications like database servers etc running on
> the same machine? If so, then the DB server (or any other application) and
> the Solr server could be competing for resources like CPU, memory etc.
>
> If you are using Tomcat, you can take a look in
> $CATALINA_HOME/logs/catalina.out, there are timestamps there that can also
> guide you.
>
> --
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the gift.
> Quality First. Measure Twice. Cut Once.
>


Re: Solr YUI autocomplete

2009-11-02 Thread Eric Pugh

It does, have you looked at
http://wiki.apache.org/solr/SolJSON?highlight=%28json%29#Using_Solr.27s_JSON_output_for_AJAX.
 
Also, in my book on Solr, there is an example, but using the jquery
autocomplete, which I think was answered earlier on the thread!  Hope that
helps.



ANKITBHATNAGAR wrote:
> 
> 
> Does Solr supports JSONP (JSON with Padding) in the response?
> 
> -Ankit
>  
> 
> 
> -Original Message-
> From: Ankit Bhatnagar [mailto:abhatna...@vantage.com] 
> Sent: Friday, October 30, 2009 10:27 AM
> To: 'solr-user@lucene.apache.org'
> Subject: Solr YUI autocomplete
> 
> Hi Guys,
> 
> I have question regarding - how to specify the 
> 
> I am using YUI autocomplete widget and it expects the JSONP response.
> 
> http://localhost:8983/solr/select/?q=monitor&version=2.2&start=0&rows=10&indent=on&wt=json&json.wrf=
> 
> I am not sure how should I specify the json.wrf=function
> 
> Thanks
> Ankit
> 
> 

-- 
View this message in context: 
http://old.nabble.com/JQuery-and-autosuggest-tp26130209p26157130.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Cell on web-based files?

2009-11-02 Thread Alexey Serba
> e.g (doesn't work)
> curl http://localhost:8983/solr/update/extract?extractOnly=true
> --data-binary @http://myweb.com/mylocalfile.htm -H "Content-type:text/html"

> You might try remote streaming with Solr (see
> http://wiki.apache.org/solr/SolrConfigXml).

Yes, curl example

curl 
'http://localhost:8080/solr/main_index/extract/?extractOnly=true&indent=on&resource.name=lecture12&stream.url=http%3A//myweb.com/lecture12.ppt'

It works great for me.

Alex


RE: Solr YUI autocomplete

2009-11-02 Thread Ankit Bhatnagar

Hey Eric,

That correct however it didn't work with YUI widget.

I changed my approach to use jQuery for now.
 


-Ankit

-Original Message-
From: Eric Pugh [mailto:ep...@opensourceconnections.com] 
Sent: Monday, November 02, 2009 10:20 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr YUI autocomplete


It does, have you looked at
http://wiki.apache.org/solr/SolJSON?highlight=%28json%29#Using_Solr.27s_JSON_output_for_AJAX.
 
Also, in my book on Solr, there is an example, but using the jquery
autocomplete, which I think was answered earlier on the thread!  Hope that
helps.



ANKITBHATNAGAR wrote:
> 
> 
> Does Solr supports JSONP (JSON with Padding) in the response?
> 
> -Ankit
>  
> 
> 
> -Original Message-
> From: Ankit Bhatnagar [mailto:abhatna...@vantage.com] 
> Sent: Friday, October 30, 2009 10:27 AM
> To: 'solr-user@lucene.apache.org'
> Subject: Solr YUI autocomplete
> 
> Hi Guys,
> 
> I have question regarding - how to specify the 
> 
> I am using YUI autocomplete widget and it expects the JSONP response.
> 
> http://localhost:8983/solr/select/?q=monitor&version=2.2&start=0&rows=10&indent=on&wt=json&json.wrf=
> 
> I am not sure how should I specify the json.wrf=function
> 
> Thanks
> Ankit
> 
> 

-- 
View this message in context: 
http://old.nabble.com/JQuery-and-autosuggest-tp26130209p26157130.html
Sent from the Solr - User mailing list archive at Nabble.com.



storing other files in index directory

2009-11-02 Thread Paul Rosen
Are there any pitfalls to storing an arbitrary text file in the same 
directory as the solr index?


We're slinging different versions of the index around while we're 
testing and it's hard to keep them straight.


I'd like to put a readme.txt file in the directory that contains some 
history about how that index came to be. Is that harmless? Will it be 
ignored by solr, including during optimizations and any other operation, 
and will solr not delete it?


Re: tracking solr response time

2009-11-02 Thread Erick Erickson
Also, how about a sample of a fast and slow query? And is a slow
query only slow the first time it's executed or every time?

Best
Erick

On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh <
bharathv6.proj...@gmail.com> wrote:

> Thanks for the quick response
> @yonik
>
> >How much of a latency compared to normal, and what version of Solr are
> you using?
>
> latency is usually around 2-4 secs (some times it goes more than that
> )  which happens  to  only 15-20%  of the request  other  80-85% of
> request are very fast it is in  milli secs ( around 200,000 requests
> happens every day )
>
> @Israel  we are not using java client ..  we  r using  python at the
> client with response formatted in json
>
> @yonikn @Israel   does qtime measure the total time taken at the solr
> server ? I am already measuring the time to get the response  at
> client  end . I would want  a means to know how much time the solr
> server is taking to respond (process ) once it gets the request  . so
> that I could identify whether it is a solr server issue or internal
> network issue
>
>
> @Israel  we are using rhel server  5 on both client and server .. we
> have 6 solr sever . one is acting as master . both client and solr
> sever are on the same network . those servers are dedicated solr
> server except 2 severs which have DB and memcahce running .. we have
> adjusted the load accordingly
>
>
>
>
>
>
>
> On 11/2/09, Israel Ekpo  wrote:
> > On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley
> > wrote:
> >
> >> On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
> >>  wrote:
> >> >We are using solr for many of ur products  it is doing quite well
> >> > .  But since no of hits are becoming high we are experiencing latency
> >> > in certain requests ,about 15% of our requests are suffering a latency
> >>
> >> How much of a latency compared to normal, and what version of Solr are
> >> you using?
> >>
> >> >  . We are trying to identify  the problem .  It may be due to  network
> >> > issue or solr server is taking time to process the request  .   other
> >> > than  qtime which is returned along with the response is there any
> >> > other way to track solr servers performance ?
> >> > how is qtime calculated
> >> > , is it the total time from when solr server got the request till it
> >> > gave the response ?
> >>
> >> QTime is the time spent in generating the in-memory representation for
> >> the response before the response writer starts streaming it back in
> >> whatever format was requested.  The stored fields of returned
> >> documents are also loaded at this point (to enable handling of huge
> >> response lists w/o storing all in memory).
> >>
> >> There are normally servlet container logs that can be configured to
> >> spit out the real total request time.
> >>
> >> > can we do some extra logging to track solr servers
> >> > performance . ideally I would want to pass some log id along with the
> >> > request (query ) to  solr server  and solr server must log the
> >> > response time along with that log id .
> >>
> >> Yep - Solr isn't bothered by params it doesn't know about, so just put
> >> logid=xxx and it should also be logged with the other request
> >> params.
> >>
> >> -Yonik
> >> http://www.lucidimagination.com
> >>
> >
> >
> >
> > If you are not using Java then you may have to track the elapsed time
> > manually.
> >
> > If you are using the SolrJ Java client you may have the following
> options:
> >
> > There is a method called getElapsedTime() in
> > org.apache.solr.client.solrj.response.SolrResponseBase which is available
> to
> > all the subclasses
> >
> > I have not used it personally but I think this should return the time
> spent
> > on the client side for that request.
> >
> > The QTime is not the time on the client side but the time spent
> internally
> > at the Solr server to process the request.
> >
> >
> http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html
> >
> >
> http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html
> >
> > Most likely it could be as a result of an internal network issue between
> the
> > two servers or the Solr server is competing with other applications for
> > resources.
> >
> > What operating system is the Solr server running on? Is you client
> > application connection to a Solr server on the same network or over the
> > internet? Are there other applications like database servers etc running
> on
> > the same machine? If so, then the DB server (or any other application)
> and
> > the Solr server could be competing for resources like CPU, memory etc.
> >
> > If you are using Tomcat, you can take a look in
> > $CATALINA_HOME/logs/catalina.out, there are timestamps there that can
> also
> > guide you.
> >
> > --
> > "Good Enough" is not good enough.
> > To give anything less than your best is to sacrifice the gift.
> > Quality First. Measure Twice. Cut Once.
> >
>


tokenize after filters

2009-11-02 Thread Joe Calderon
 is it possible to tokenize a field on whitespace after some filters
have been applied:

ex: "A + W Root Beer"
the field uses a keyword tokenizer to keep the string together, then
it will get converted to "aw root beer" by a custom filter ive made, i
now want to split that up into 3 tokens (aw, root, beer), but seems
like you cant use a tokenizer after a filter ... so whats the best way
of accomplishing this?

thx much

--joe


Re: Annotations and reference types

2009-11-02 Thread Shalin Shekhar Mangar
On Thu, Oct 29, 2009 at 7:57 PM, M. Tinnemeyer  wrote:

> Dear listusers,
>
> Is there a way to store an instance of class A (including the fields from
> "myB") via solr using annotations ?
> The index should look like : id; name; b_id; b_name
>
> --
> Class A {
>
> @Field
> private String id;
> @Field
> private String name;
> @Field
> private B myB;
> }
>
> --
> Class B {
>
> @Field("b_id")
> private String id;
> @Field("B_name")
> private String name;
> }
>
>
No.

I guess you want to represent certain fields in class B and have them as an
attribute in Class A (but all fields belong to the same schema), then it can
be a worthwhile addition to Solrj. Can you open an issue? A patch would be
even better :)

-- 
Regards,
Shalin Shekhar Mangar.


Re: Question about DIH execution order

2009-11-02 Thread Bertie Shen
Hi Noble,

   I tried to understand your suggestions and played different variations
according to your reply.  But none of them work. Can you explain it in  more
details?
   Thanks a lot!




BTW, do you mean your solution as follows?


   
   
 
   
 
  
 

 But
   1) There is no TmpCourseId field column.
   2) Can we put two name CourseId and id in the same map? It seems not.





2009/11/1 Noble Paul നോബിള്‍ नोब्ळ् 

> On Sun, Nov 1, 2009 at 11:59 PM, Bertie Shen 
> wrote:
> > Hi folks,
> >
> >  I have the following data-config.xml. Is there a way to
> > let transformation take place after executing SQL "select comment from
> > Rating where Rating.CourseId = ${Course.CourseId}"?  In MySQL database,
> > column CourseId in table Course is integer 1, 2, etc;
> > template transformation will make them like Course:1, Course:2; column
> > CourseId in table Rating is also integer 1, 2, etc.
> >
> >  If transformation happens before executing "select comment from Rating
> > where Rating.CourseId = ${Course.CourseId}", then there will no match for
> > the SQL statement execution.
> >
> >  
> > 
> >   > column="CourseId" template="Course:${Course.CourseId}" name="id"/>
> >  
> >
> >  
> >
> >  
> >
>
> keep the field as follows
>   column="TmpCourseId" name="CourseId"
> template="Course:${Course.CourseId}" name="id"/>
>
>
>
>
> --
> -
> Noble Paul | Principal Engineer| AOL | http://aol.com
>


Re: tracking solr response time

2009-11-02 Thread Israel Ekpo
On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh <
bharathv6.proj...@gmail.com> wrote:

> Thanks for the quick response
> @yonik
>
> >How much of a latency compared to normal, and what version of Solr are
> you using?
>
> latency is usually around 2-4 secs (some times it goes more than that
> )  which happens  to  only 15-20%  of the request  other  80-85% of
> request are very fast it is in  milli secs ( around 200,000 requests
> happens every day )
>
> @Israel  we are not using java client ..  we  r using  python at the
> client with response formatted in json
>
> @yonikn @Israel   does qtime measure the total time taken at the solr
> server ? I am already measuring the time to get the response  at
> client  end . I would want  a means to know how much time the solr
> server is taking to respond (process ) once it gets the request  . so
> that I could identify whether it is a solr server issue or internal
> network issue
>

It is the time spent at the Solr server.

I think Yonik already answered this part in his response to your thread :

This is what he said :

QTime is the time spent in generating the in-memory representation for
the response before the response writer starts streaming it back in
whatever format was requested.  The stored fields of returned
documents are also loaded at this point (to enable handling of huge
response lists w/o storing all in memory).


>
> @Israel  we are using rhel server  5 on both client and server .. we
> have 6 solr sever . one is acting as master . both client and solr
> sever are on the same network . those servers are dedicated solr
> server except 2 severs which have DB and memcahce running .. we have
> adjusted the load accordingly
>
>
>
>
>
>
>
> On 11/2/09, Israel Ekpo  wrote:
> > On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley
> > wrote:
> >
> >> On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
> >>  wrote:
> >> >We are using solr for many of ur products  it is doing quite well
> >> > .  But since no of hits are becoming high we are experiencing latency
> >> > in certain requests ,about 15% of our requests are suffering a latency
> >>
> >> How much of a latency compared to normal, and what version of Solr are
> >> you using?
> >>
> >> >  . We are trying to identify  the problem .  It may be due to  network
> >> > issue or solr server is taking time to process the request  .   other
> >> > than  qtime which is returned along with the response is there any
> >> > other way to track solr servers performance ?
> >> > how is qtime calculated
> >> > , is it the total time from when solr server got the request till it
> >> > gave the response ?
> >>
> >> QTime is the time spent in generating the in-memory representation for
> >> the response before the response writer starts streaming it back in
> >> whatever format was requested.  The stored fields of returned
> >> documents are also loaded at this point (to enable handling of huge
> >> response lists w/o storing all in memory).
> >>
> >> There are normally servlet container logs that can be configured to
> >> spit out the real total request time.
> >>
> >> > can we do some extra logging to track solr servers
> >> > performance . ideally I would want to pass some log id along with the
> >> > request (query ) to  solr server  and solr server must log the
> >> > response time along with that log id .
> >>
> >> Yep - Solr isn't bothered by params it doesn't know about, so just put
> >> logid=xxx and it should also be logged with the other request
> >> params.
> >>
> >> -Yonik
> >> http://www.lucidimagination.com
> >>
> >
> >
> >
> > If you are not using Java then you may have to track the elapsed time
> > manually.
> >
> > If you are using the SolrJ Java client you may have the following
> options:
> >
> > There is a method called getElapsedTime() in
> > org.apache.solr.client.solrj.response.SolrResponseBase which is available
> to
> > all the subclasses
> >
> > I have not used it personally but I think this should return the time
> spent
> > on the client side for that request.
> >
> > The QTime is not the time on the client side but the time spent
> internally
> > at the Solr server to process the request.
> >
> >
> http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html
> >
> >
> http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html
> >
> > Most likely it could be as a result of an internal network issue between
> the
> > two servers or the Solr server is competing with other applications for
> > resources.
> >
> > What operating system is the Solr server running on? Is you client
> > application connection to a Solr server on the same network or over the
> > internet? Are there other applications like database servers etc running
> on
> > the same machine? If so, then the DB server (or any other application)
> and
> > the Solr server could be competing for resources like CPU, memory etc.
> >
> > If you are using Tomca

Re: CPU utilization and query time high on Solr slave when snapshot install

2009-11-02 Thread Walter Underwood
If you are going to pull a new index every 10 minutes, try turning off  
cache autowarming.


Your caches are never more than 10 minutes old, so spending a minute  
warming each new cache is a waste of CPU. Autowarm submits queries to  
the new Searcher before putting it in service. This will create a  
burst of query load on the new Searcher, often keeping one CPU pretty  
busy for several seconds.


In solrconfig.xml, set autowarmCount to 0.

Also, if you want the slaves to always have an optimized index, create  
the snapshot only in post-optimize. If you create snapshots in both  
post-commit and post-optimize, you are creating a non-optimized index  
(post-commit), then replacing it with an optimized one a few minutes  
later. A slave might get a non-optimized index one time, then an  
optimized one the next.


wunder

On Nov 2, 2009, at 1:45 AM, biku...@sapient.com wrote:


Hi Solr Gurus,

We have solr in 1 master, 2 slave configuration. Snapshot is created  
post commit, post optimization. We have autocommit after 50  
documents or 5 minutes. Snapshot puller runs as a cron every 10  
minutes. What we have observed is that whenever snapshot is  
installed on the slave, we see solrj client used to query slave  
solr, gets timedout and there is high CPU usage/load avg. on slave  
server. If we stop snapshot puller, then slaves work with no issues.  
The system has been running since 2 months and this issue has  
started to occur only now  when load on website is increasing.


Following are some details:

Solr Details:
apache-solr Version: 1.3.0
Lucene - 2.4-dev

Master/Slave configurations:

Master:
- for indexing data HTTPRequests are made on Solr server.
- autocommit feature is enabled for 50 docs and 5 minutes
- caching params are disable for this server
- mergeFactor of 10 is set
- we were running optimize script after every 2 hours, but now have  
reduced the duration to twice a day but issue still persists


Slave1/Slave2:
- standard requestHandler is being used
- default values of caching are set
Machine Specifications:

Master:
- 4GB RAM
- 1GB JVM Heap memory is allocated to Solr

Slave1/Slave2:
- 4GB RAM
- 2GB JVM Heap memory is allocated to Solr

Master and Slave1 (solr1)are on single box and Slave2(solr2) on  
different box. We use HAProxy to load balance query requests between  
2 slaves. Master is only used for indexing.
Please let us know if somebody has ever faced similar kind of issue  
or has some insight into it as we guys are literally struck at the  
moment with a very unstable production environment.


As a workaround, we have started running optimize on master every 7  
minutes. This seems to have reduced the severity of the problem but  
still issue occurs every 2days now. please suggest what could be the  
root cause of this.


Thanks,
Bipul








Re: tracking solr response time

2009-11-02 Thread bharath venkatesh
@Israel: yes I got that point which yonik mentioned .. but is qtime the
total time taken by solr server for that request or  is it  part of time
taken by the solr for that request ( is there any thing that a solr server
does for that particulcar request which is not included in that qtime
bracket ) ?  I am sorry for dragging in to this qtime. I just want to be
sure, as we observed many times there is huge mismatch between qtime and
time measured at the client for the response ( does this imply it is due to
internal network issue )

@Erick: yes, many times query is slow first time its executed is there any
solution to improve upon this factor .. for querying we use
DisMaxRequestHandler , queries are quite long with many faceting parameters
.


On Mon, Nov 2, 2009 at 10:46 PM, Israel Ekpo  wrote:

> On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh <
> bharathv6.proj...@gmail.com> wrote:
>
> > Thanks for the quick response
> > @yonik
> >
> > >How much of a latency compared to normal, and what version of Solr are
> > you using?
> >
> > latency is usually around 2-4 secs (some times it goes more than that
> > )  which happens  to  only 15-20%  of the request  other  80-85% of
> > request are very fast it is in  milli secs ( around 200,000 requests
> > happens every day )
> >
> > @Israel  we are not using java client ..  we  r using  python at the
> > client with response formatted in json
> >
> > @yonikn @Israel   does qtime measure the total time taken at the solr
> > server ? I am already measuring the time to get the response  at
> > client  end . I would want  a means to know how much time the solr
> > server is taking to respond (process ) once it gets the request  . so
> > that I could identify whether it is a solr server issue or internal
> > network issue
> >
>
> It is the time spent at the Solr server.
>
> I think Yonik already answered this part in his response to your thread :
>
> This is what he said :
>
> QTime is the time spent in generating the in-memory representation for
> the response before the response writer starts streaming it back in
> whatever format was requested.  The stored fields of returned
> documents are also loaded at this point (to enable handling of huge
> response lists w/o storing all in memory).
>
>
> >
> > @Israel  we are using rhel server  5 on both client and server .. we
> > have 6 solr sever . one is acting as master . both client and solr
> > sever are on the same network . those servers are dedicated solr
> > server except 2 severs which have DB and memcahce running .. we have
> > adjusted the load accordingly
> >
> >
> >
> >
> >
> >
> >
> > On 11/2/09, Israel Ekpo  wrote:
> > > On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley
> > > wrote:
> > >
> > >> On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
> > >>  wrote:
> > >> >We are using solr for many of ur products  it is doing quite well
> > >> > .  But since no of hits are becoming high we are experiencing
> latency
> > >> > in certain requests ,about 15% of our requests are suffering a
> latency
> > >>
> > >> How much of a latency compared to normal, and what version of Solr are
> > >> you using?
> > >>
> > >> >  . We are trying to identify  the problem .  It may be due to
>  network
> > >> > issue or solr server is taking time to process the request  .
> other
> > >> > than  qtime which is returned along with the response is there any
> > >> > other way to track solr servers performance ?
> > >> > how is qtime calculated
> > >> > , is it the total time from when solr server got the request till it
> > >> > gave the response ?
> > >>
> > >> QTime is the time spent in generating the in-memory representation for
> > >> the response before the response writer starts streaming it back in
> > >> whatever format was requested.  The stored fields of returned
> > >> documents are also loaded at this point (to enable handling of huge
> > >> response lists w/o storing all in memory).
> > >>
> > >> There are normally servlet container logs that can be configured to
> > >> spit out the real total request time.
> > >>
> > >> > can we do some extra logging to track solr servers
> > >> > performance . ideally I would want to pass some log id along with
> the
> > >> > request (query ) to  solr server  and solr server must log the
> > >> > response time along with that log id .
> > >>
> > >> Yep - Solr isn't bothered by params it doesn't know about, so just put
> > >> logid=xxx and it should also be logged with the other request
> > >> params.
> > >>
> > >> -Yonik
> > >> http://www.lucidimagination.com
> > >>
> > >
> > >
> > >
> > > If you are not using Java then you may have to track the elapsed time
> > > manually.
> > >
> > > If you are using the SolrJ Java client you may have the following
> > options:
> > >
> > > There is a method called getElapsedTime() in
> > > org.apache.solr.client.solrj.response.SolrResponseBase which is
> available
> > to
> > > all the subclasses
> > >
> > > I have not used it personally but I think

RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Any thoughts regarding the subject? I hope FieldCache doesn't use more than
6 bytes per document-field instance... I am too lazy to research Lucene
source code, I hope someone can provide exact answer... Thanks


> Subject: Lucene FieldCache memory requirements
> 
> Hi,
> 
> 
> Can anyone confirm Lucene FieldCache memory requirements? I have 100
> millions docs with non-tokenized field "country" (10 different countries);
I
> expect it requires array of ("int", "long"), size of array 100,000,000,
> without any impact of "country" field length;
> 
> it requires 600,000,000 bytes: "int" is pointer to document (Lucene
document
> ID),  and "long" is pointer to String value...
> 
> Am I right, is it 600Mb just for this "country" (indexed, non-tokenized,
> non-boolean) field and 100 millions docs? I need to calculate exact
minimum RAM
> requirements...
> 
> I believe it shouldn't depend on cardinality (distribution) of field...
> 
> Thanks,
> Fuad
> 
> 
> 
> 





Re: Lucene FieldCache memory requirements

2009-11-02 Thread Michael McCandless
Which FieldCache API are you using?  getStrings?  or getStringIndex
(which is used, under the hood, if you sort by this field).

Mike

On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi  wrote:
> Any thoughts regarding the subject? I hope FieldCache doesn't use more than
> 6 bytes per document-field instance... I am too lazy to research Lucene
> source code, I hope someone can provide exact answer... Thanks
>
>
>> Subject: Lucene FieldCache memory requirements
>>
>> Hi,
>>
>>
>> Can anyone confirm Lucene FieldCache memory requirements? I have 100
>> millions docs with non-tokenized field "country" (10 different countries);
> I
>> expect it requires array of ("int", "long"), size of array 100,000,000,
>> without any impact of "country" field length;
>>
>> it requires 600,000,000 bytes: "int" is pointer to document (Lucene
> document
>> ID),  and "long" is pointer to String value...
>>
>> Am I right, is it 600Mb just for this "country" (indexed, non-tokenized,
>> non-boolean) field and 100 millions docs? I need to calculate exact
> minimum RAM
>> requirements...
>>
>> I believe it shouldn't depend on cardinality (distribution) of field...
>>
>> Thanks,
>> Fuad
>>
>>
>>
>>
>
>
>
>


LocalSolr, Maven, build files and release candidates (Just for info) and spatial radius (A question)

2009-11-02 Thread Ian Ibbotson
Hallo All. I've been trying to prepare a project using localsolr for the
impending (I hope) arrival of solr 1.4 and Lucene 2.9.1.. Here are some
notes in case anyone else is suffering similarly. Obviously everything here
may change by next week.

First problem has been the lack of any stable maven based lucene and solr
artifacts to wire into my poms. Because of that, and as an interim only
measure, I've built the latest branches of the lucene 2.9.1 and solr 1.4
trees and made them into a *temporary* maven repository at
http://developer.k-int.com/m2snapshots/. in there you can find all the jar
artifacts tagged as xxx-ki-rc1 (For solr) and xxx-ki-rc3 (For lucene) and
finally, a localsolr.localsolr build tagged as 1.5.2-rc1. Sorry for the
naming, but I don't want these artifacts to clash with the real ones when
they come along. This is really just for my own use, but I've seen messages
and spoken to people who are really struggling to get their maven deps
right, if this helps anyone, please feel free to use these until the real
apache artifacts appear. I can't take any responsibility for their quality.
All the poms have been altered to look for the correct dependent artifacts
in the same repository, adding the stanza

  
  

  k-int-m2-snapshots
  K-int M2 Snapshots
  http://developer.k-int.com/m2snapshots
  
true
  

  

to your pom will let you use these deps temporarily until we see an official
build. If you're a maven developer and I've gone way around the houses with
this, please tell me of an easier solution :) This repo *will* go away when
the real builds turn up.

The localsolr in this repo also contains the patches I've submitted (A good
while ago) to the localsolr project to make it build with the lucene 2.9.1
rc3 as the downloadable dist is currently built against an older 2.9 release
that had a different API (IE won't work with the new lucene and solr)

All this means that there is a working localsolr build.

Second up, I've also seen emails (And seen the exception myself) around
asking about the following when trying to get all these revisions working
together.

java.lang.NumberFormatException: Invalid shift value in prefixCoded string
(is encoded value really a LONG?)

There are some threads out there telling you that the Lucene indexes are not
binary compatible between versions, but if you're using localsolr, what you
really need to know is:

1) Make sure that your schema.xml contains at least the following fieldType
defs

   

2) Convert your old solr sdouble fields to tdoubles:

  
  
  

Pretty sure you would need to rebuild your indexes.

Ok, with those changes I managed to get a working spatial search.

My only problem now is that the radius param on the command line seems to
need to be way bigger than it needs to be in order to find anything.
Specifically, if I search with a radius of 220 I get a record back which
marks it's geo_distance as "83.76888211666025". Shuffling the radius around
ends up that a radius of 205 returns that doc, 204 and it's filtered. I'm
going to dig into this now, but if anyone knows about this I'd really
appreciate any help.

Cheers all, hope this is of use to someone out there, if anyone has
corrections/comments I'd really appreciate any info.

Best,
Ian.


Re: question about collapse.type = adjacent

2009-11-02 Thread Martijn v Groningen
Hi Micheal,

Field collapsing is basicly done in two steps. The first step is to
get the uncollapsed sorted (whether it is score or a field value)
documents and the second step is to apply the collapse algorithm on
the uncollapsed documents. So yes, when specifying
collapse.type=adjacent the documents can get collapsed after the sort
has been applied, but this also the case when not specifying
collapse.type=adjacent
I hope this answers your question.

Cheers,

Martijn

2009/11/2 michael8 :
>
> Hi,
>
> I would like to confirm if 'adjacent' in collapse.type means the documents
> (with the same collapse field value) are considered adjacent *after* the
> 'sort' param from the query has been applied, or *before*?  I would think it
> would be *after* since collapse feature primarily is meant for presentation
> use.
>
> Thanks,
> Michael
> --
> View this message in context: 
> http://old.nabble.com/question-about-collapse.type-%3D-adjacent-tp26157114p26157114.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Met vriendelijke groet,

Martijn van Groningen


apply a patch on solr

2009-11-02 Thread michael8

Hi,

First I like to pardon my novice question on patching solr (1.4).  What I
like to know is, given a patch, like the one for collapse field, how would
one go about knowing what solr source that patch is meant for since this is
a source level patch?  Wouldn't the exact versions of a set of java files to
be patched critical for the patch to work properly?

So far what I have done is to pull the latest collapse field patch down from
http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch), and
then svn up the latest trunk from
http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and build. 
Intuitively I was thinking I should be doing svn up to a specific
revision/tag instead of just latest.  So far everything seems fine, but I
just want to make sure I'm doing the right thing and not just being lucky.

Thanks,
Michael
-- 
View this message in context: 
http://old.nabble.com/apply-a-patch-on-solr-tp26157826p26157826.html
Sent from the Solr - User mailing list archive at Nabble.com.



apply a patch on solr

2009-11-02 Thread michael8

Hi,

First I like to pardon my novice question on patching solr (1.4).  What I
like to know is, given a patch, like the one for collapse field, how would
one go about knowing what solr source that patch is meant for since this is
a source level patch?  Wouldn't the exact versions of a set of java files to
be patched critical for the patch to work properly?

So far what I have done is to pull the latest collapse field patch down from
http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch), and
then svn up the latest trunk from
http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and build. 
Intuitively I was thinking I should be doing svn up to a specific
revision/tag instead of just latest.  So far everything seems fine, but I
just want to make sure I'm doing the right thing and not just being lucky.

Thanks,
Michael
-- 
View this message in context: 
http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26157827.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I am not using Lucene API directly; I am using SOLR which uses Lucene
FieldCache for faceting on non-tokenized fields...
I think this cache will be lazily loaded, until user executes sorted (by
this field) SOLR query for all documents *:* - in this case it will be fully
populated...


> Subject: Re: Lucene FieldCache memory requirements
> 
> Which FieldCache API are you using?  getStrings?  or getStringIndex
> (which is used, under the hood, if you sort by this field).
> 
> Mike
> 
> On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi  wrote:
> > Any thoughts regarding the subject? I hope FieldCache doesn't use more
than
> > 6 bytes per document-field instance... I am too lazy to research Lucene
> > source code, I hope someone can provide exact answer... Thanks
> >
> >
> >> Subject: Lucene FieldCache memory requirements
> >>
> >> Hi,
> >>
> >>
> >> Can anyone confirm Lucene FieldCache memory requirements? I have 100
> >> millions docs with non-tokenized field "country" (10 different
countries);
> > I
> >> expect it requires array of ("int", "long"), size of array 100,000,000,
> >> without any impact of "country" field length;
> >>
> >> it requires 600,000,000 bytes: "int" is pointer to document (Lucene
> > document
> >> ID),  and "long" is pointer to String value...
> >>
> >> Am I right, is it 600Mb just for this "country" (indexed,
non-tokenized,
> >> non-boolean) field and 100 millions docs? I need to calculate exact
> > minimum RAM
> >> requirements...
> >>
> >> I believe it shouldn't depend on cardinality (distribution) of field...
> >>
> >> Thanks,
> >> Fuad
> >>
> >>
> >>
> >>
> >
> >
> >
> >




Dismax and Standard Queries together

2009-11-02 Thread ram_sj

Hi,

I have three fields, business_name, category_name, sub_category_name in my
solrconfig file.

my query = "pet clinic"

example sub_category_names: Veterinarians, Kennels, Veterinary Clinics  
Hospitals, Pet Grooming, Pet Stores, Clinics

my ideal requirement is dismax searching on 

a. dismax over three or two fields
b. followed by a Boolean match over any one of the field is acceptable.

I played around with minimum match attributes, but doesn't seems to be
helpful, I guess the dismax requires at-least two fields. 

The nest queries takes only one qf filed, so it doesn't help much either.

Any suggestions will be helpful.

Thanks
Ram
-- 
View this message in context: 
http://old.nabble.com/Dismax-and-Standard-Queries-together-tp26157830p26157830.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: tokenize after filters

2009-11-02 Thread Steven A Rowe
I think you want Koji Sekiguchi's Char Filters:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=char+filters#Char_Filters

Steve

> -Original Message-
> From: Joe Calderon [mailto:calderon@gmail.com]
> Sent: Monday, November 02, 2009 11:25 AM
> To: solr-user@lucene.apache.org
> Subject: tokenize after filters
> 
>  is it possible to tokenize a field on whitespace after some filters
> have been applied:
> 
> ex: "A + W Root Beer"
> the field uses a keyword tokenizer to keep the string together, then
> it will get converted to "aw root beer" by a custom filter ive made, i
> now want to split that up into 3 tokens (aw, root, beer), but seems
> like you cant use a tokenizer after a filter ... so whats the best way
> of accomplishing this?
> 
> thx much
> 
> --joe


field queries seem slow

2009-11-02 Thread mike anderson
I took a look through my Solr logs this weekend and noticed that the longest
queries were on particular fields, like "author:albert einstein". Is this a
result consistent with other setups out there? If not, Is there a trick to
make these go faster? I've read up on filter queries and use those when
applicable, but they don't really solve all my problems.

If anybody wants to take a shot at it but needs to see my solrconfig, etc
just let me know.

Cheers,
Mike


manually creating indices to speed up indexing with app-knowledge

2009-11-02 Thread Britske

This may seem like a strange question, but here it goes anyway. 

Im considering the possibility of low-level constructing indices for about
20.000 indexed fields (type sInt) if at all possible . (With indices in this
context I mean the inverted indices from term to Documentid just to be 100%
complete)  
These indices have to be recreated each night, along with the normal
reindex. 

Globally it should go something like this (each night) : 
 - documents (consisting of about 20 stored fields and about 10 stored &
indexed fields) are indexed through the normal 'code-path' (solrJ in my
case) 
- After all docs are persisted (max 200.000) I want to extract the mapping
from 'lucene docid' --> 'stored/indexed product key'
I believe this should work, because after all docs are persisted the
internal docids aren't altered, so the relationship between 'lucene docid'
--> 'stored/indexed product key' is invariant from that point forward.
(please correct if wrong) 
- construct the 20.000 inverted indices on such a low enough level that I do
not have to go through IndexWriter if possible, so  I do not need to
construct Documents, I only need to construct the native format of the
indices themselves. Ideally this should work on multiple servers so that the
indices can be created in parallel and the index-files later simply copied
to the index-directory of the master. 

Basically what it boils down to is that indexing time (a reindex should be
done each night)  is a big show-stopper at the moment, although we've tried
and tested all the more standard optimization tricks & techniques, as well
as having build a  home-grown shard-like indexing strategy which uses 20
pretty big servers in parallel. The 20.000 indexed fields are still simply
killing. 

At the same time the app has a lot of knowledge of the 20.000 indices. 
- All indices consist of prices (ints) between 0 and 10.000
- and most important: as part of the document construction process the
ordening of each of the 20.000 indices is known for all documents that are
processed by the document-construction server in question. (This part is
needed, and is already performing at light speed) 

for sake of argument say we have 5 document-construction servers. Each
server processes 40.000 documents. Each server has 20.000 ordered indices in
its own format readily available for the 40.000 documents it's processing. 
Something like: LinkedHashMap> --> 


Say we have 20 indexing servers. Each server has to calculate 1.000 indices
(totalling the 20.000) 
We have the 5 doc-construction servers distribute the ordered sub-indices to
the correct servers. 
Each server constructs an index from 5 ordered sub-indices coming from 5
different construction-servers. This can be done efficiently using a
mergesort (since the sub-indices are already sorted) 

All that is missing (oversimplifying here ) is going from the ordered
indices in application-format to the index-format of lucene (substituting
the productids by the lucene docid's along the way) and stream it to disk. 
I believe this would quite posisbly give a really big indexing improvement.  

Is my thinking correct in the steps involved? 
Do you believe that this indeed would give a big speedup for this specific
situation  
Where would I hook in the SOlr / lucene code to construct the native format?


Thanks in advance (and for making it to here) 

Geert-Jan

-- 
View this message in context: 
http://old.nabble.com/manually-creating-indices-to-speed-up-indexing-with-app-knowledge-tp26157851p26157851.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: apply a patch on solr

2009-11-02 Thread mike anderson
You can see what revision the patch was written for at the top of the patch,
it will look like this:

Index: org/apache/solr/handler/MoreLikeThisHandler.java
===
--- org/apache/solr/handler/MoreLikeThisHandler.java (revision 772437)
+++ org/apache/solr/handler/MoreLikeThisHandler.java (working copy)

now check out revision 772437 using the --revision switch in svn, patch
away, and then svn up to make sure everything merges cleanly.  This is a
good guide to follow as well:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg10189.html

cheers,
-mike

On Mon, Nov 2, 2009 at 3:55 PM, michael8  wrote:

>
> Hi,
>
> First I like to pardon my novice question on patching solr (1.4).  What I
> like to know is, given a patch, like the one for collapse field, how would
> one go about knowing what solr source that patch is meant for since this is
> a source level patch?  Wouldn't the exact versions of a set of java files
> to
> be patched critical for the patch to work properly?
>
> So far what I have done is to pull the latest collapse field patch down
> from
> http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch),
> and
> then svn up the latest trunk from
> http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and build.
> Intuitively I was thinking I should be doing svn up to a specific
> revision/tag instead of just latest.  So far everything seems fine, but I
> just want to make sure I'm doing the right thing and not just being lucky.
>
> Thanks,
> Michael
> --
> View this message in context:
> http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26157827.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


highlighting error using 1.4rc

2009-11-02 Thread Jake Brownell
Hi,

I've tried installing the latest (3rd) RC for Solr 1.4 and Lucene 2.9.1. One of 
our integration tests, which runs against and embedded server appears to be 
failing on highlighting. I've included the stack trace and the configuration 
from solrconf. I'd appreciate any insights. Please let me know what additional 
information would be useful.


Caused by: org.apache.solr.client.solrj.SolrServerException: 
org.apache.solr.client.solrj.SolrServerException: java.lang.ClassCastException: 
org.apache.lucene.search.spans.SpanOrQuery cannot be cast to 
org.apache.lucene.search.spans.SpanNearQuery
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153)
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
at 
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at 
org.bookshare.search.solr.SolrSearchServerWrapper.query(SolrSearchServerWrapper.java:96)
... 29 more
Caused by: org.apache.solr.client.solrj.SolrServerException: 
java.lang.ClassCastException: org.apache.lucene.search.spans.SpanOrQuery cannot 
be cast to org.apache.lucene.search.spans.SpanNearQuery
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141)
... 32 more
Caused by: java.lang.ClassCastException: 
org.apache.lucene.search.spans.SpanOrQuery cannot be cast to 
org.apache.lucene.search.spans.SpanNearQuery
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:489)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:484)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:249)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:230)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:414)
at 
org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216)
at 
org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184)
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:226)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:203)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139)
... 32 more

I see in our solrconf the following for highlighting.

  
   
   
   

 100

   

   
   

  
  70
  
  0.5
  
  [-\w ,/\n\"']{20,200}

   

   
   

 
 

   
  



Thanks,
Jake


Question regarding snapinstaller

2009-11-02 Thread Prasanna Ranganathan

 It looks like the snapinstaller script does an atomic remove and replace of
the entire solr_home/data_dir/index folder with the contents of the new
snapshot before issuing a commit command. I am trying to understand the
implication of the same.

 What happens to queries that come during the time interval between the
instant the existing directory is removed and the commit command gets
finalized? Does a currently running instance of Solr not need the files in
the index folder to serve the query results? Are all the contents of the
index folder loaded into memory?
 
 Thanks in advance for any help.

Regards,

Prasanna.


Re: tracking solr response time

2009-11-02 Thread Erick Erickson
So I need someone with better knowledge to chime in here with an opinion
on whether autowarming would help since the whole faceting thing is
something
I'm not very comfortable with...



Erick

On Mon, Nov 2, 2009 at 2:21 PM, bharath venkatesh <
bharathv6.proj...@gmail.com> wrote:

> @Israel: yes I got that point which yonik mentioned .. but is qtime the
> total time taken by solr server for that request or  is it  part of time
> taken by the solr for that request ( is there any thing that a solr server
> does for that particulcar request which is not included in that qtime
> bracket ) ?  I am sorry for dragging in to this qtime. I just want to be
> sure, as we observed many times there is huge mismatch between qtime and
> time measured at the client for the response ( does this imply it is due to
> internal network issue )
>
> @Erick: yes, many times query is slow first time its executed is there any
> solution to improve upon this factor .. for querying we use
> DisMaxRequestHandler , queries are quite long with many faceting parameters
> .
>
>
> On Mon, Nov 2, 2009 at 10:46 PM, Israel Ekpo  wrote:
>
> > On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh <
> > bharathv6.proj...@gmail.com> wrote:
> >
> > > Thanks for the quick response
> > > @yonik
> > >
> > > >How much of a latency compared to normal, and what version of Solr are
> > > you using?
> > >
> > > latency is usually around 2-4 secs (some times it goes more than that
> > > )  which happens  to  only 15-20%  of the request  other  80-85% of
> > > request are very fast it is in  milli secs ( around 200,000 requests
> > > happens every day )
> > >
> > > @Israel  we are not using java client ..  we  r using  python at the
> > > client with response formatted in json
> > >
> > > @yonikn @Israel   does qtime measure the total time taken at the solr
> > > server ? I am already measuring the time to get the response  at
> > > client  end . I would want  a means to know how much time the solr
> > > server is taking to respond (process ) once it gets the request  . so
> > > that I could identify whether it is a solr server issue or internal
> > > network issue
> > >
> >
> > It is the time spent at the Solr server.
> >
> > I think Yonik already answered this part in his response to your thread :
> >
> > This is what he said :
> >
> > QTime is the time spent in generating the in-memory representation for
> > the response before the response writer starts streaming it back in
> > whatever format was requested.  The stored fields of returned
> > documents are also loaded at this point (to enable handling of huge
> > response lists w/o storing all in memory).
> >
> >
> > >
> > > @Israel  we are using rhel server  5 on both client and server .. we
> > > have 6 solr sever . one is acting as master . both client and solr
> > > sever are on the same network . those servers are dedicated solr
> > > server except 2 severs which have DB and memcahce running .. we have
> > > adjusted the load accordingly
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 11/2/09, Israel Ekpo  wrote:
> > > > On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley
> > > > wrote:
> > > >
> > > >> On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
> > > >>  wrote:
> > > >> >We are using solr for many of ur products  it is doing quite
> well
> > > >> > .  But since no of hits are becoming high we are experiencing
> > latency
> > > >> > in certain requests ,about 15% of our requests are suffering a
> > latency
> > > >>
> > > >> How much of a latency compared to normal, and what version of Solr
> are
> > > >> you using?
> > > >>
> > > >> >  . We are trying to identify  the problem .  It may be due to
> >  network
> > > >> > issue or solr server is taking time to process the request  .
> > other
> > > >> > than  qtime which is returned along with the response is there any
> > > >> > other way to track solr servers performance ?
> > > >> > how is qtime calculated
> > > >> > , is it the total time from when solr server got the request till
> it
> > > >> > gave the response ?
> > > >>
> > > >> QTime is the time spent in generating the in-memory representation
> for
> > > >> the response before the response writer starts streaming it back in
> > > >> whatever format was requested.  The stored fields of returned
> > > >> documents are also loaded at this point (to enable handling of huge
> > > >> response lists w/o storing all in memory).
> > > >>
> > > >> There are normally servlet container logs that can be configured to
> > > >> spit out the real total request time.
> > > >>
> > > >> > can we do some extra logging to track solr servers
> > > >> > performance . ideally I would want to pass some log id along with
> > the
> > > >> > request (query ) to  solr server  and solr server must log the
> > > >> > response time along with that log id .
> > > >>
> > > >> Yep - Solr isn't bothered by params it doesn't know about, so just
> put
> > > >> logid=xxx and it should also be logged with the other reques

Re: field queries seem slow

2009-11-02 Thread Erick Erickson
H, are you sorting? And has your readers been reopened? Is the
second query of that sort also slow? If the answer to this last question is
"no",
have you tried some autowarming queries?

Best
Erick

On Mon, Nov 2, 2009 at 4:34 PM, mike anderson wrote:

> I took a look through my Solr logs this weekend and noticed that the
> longest
> queries were on particular fields, like "author:albert einstein". Is this a
> result consistent with other setups out there? If not, Is there a trick to
> make these go faster? I've read up on filter queries and use those when
> applicable, but they don't really solve all my problems.
>
> If anybody wants to take a shot at it but needs to see my solrconfig, etc
> just let me know.
>
> Cheers,
> Mike
>


Re: Question about DIH execution order

2009-11-02 Thread Fergus McMenemie
Bertie,

Not sure what you are trying to do, we need a clearer description of
what "select *" returns and what you want to end up in the index. But 
to answer your question The transformations happen after DIH has
performed the SQL statement. In fact the rows output from the SQL
command are assigned to the DIH fields and then any transformations
are applied. The examples in 
http://wiki.apache.org/solr/DataImportHandler
are quite good.  

>Hi Noble,
>
>   I tried to understand your suggestions and played different variations
>according to your reply.  But none of them work. Can you explain it in  more
>details?
>   Thanks a lot!
>
>
>
>
>BTW, do you mean your solution as follows?
>
>
>   
>   template="Course:${Course.CourseId}" name="id"/>
> 
>   
> 
>  
> 
>
> But
>   1) There is no TmpCourseId field column.
>   2) Can we put two name CourseId and id in the same map? It seems not.
>
>
>
>
>
>2009/11/1 Noble Paul ?? Â Ë³Ë 
>
>> On Sun, Nov 1, 2009 at 11:59 PM, Bertie Shen 
>> wrote:
>> > Hi folks,
>> >
>> >  I have the following data-config.xml. Is there a way to
>> > let transformation take place after executing SQL "select comment from
>> > Rating where Rating.CourseId = ${Course.CourseId}"?  In MySQL database,
>> > column CourseId in table Course is integer 1, 2, etc;
>> > template transformation will make them like Course:1, Course:2; column
>> > CourseId in table Rating is also integer 1, 2, etc.
>> >
>> >  If transformation happens before executing "select comment from Rating
>> > where Rating.CourseId = ${Course.CourseId}", then there will no match for
>> > the SQL statement execution.
>> >
>> >  
>> > 
>> >  > > column="CourseId" template="Course:${Course.CourseId}" name="id"/>
>> >  
>> >
>> >  
>> >
>> >  
>> >
>>
>> keep the field as follows
>>  > column="TmpCourseId" name="CourseId"
>> template="Course:${Course.CourseId}" name="id"/>
>>
>>
>>
>>
>> --
>> -
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===


Re: Lucene FieldCache memory requirements

2009-11-02 Thread Michael McCandless
OK I think someone who knows how Solr uses the fieldCache for this
type of field will have to pipe up.

For Lucene directly, simple strings would consume an pointer (4 or 8
bytes depending on whether your JRE is 64bit) per doc, and the string
index would consume an int (4 bytes) per doc.  (Each also consume
negligible (for your case) memory to hold the actual string values).

Note that for your use case, this is exceptionally wasteful.  If
Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
then it'd take much fewer bits to reference the values, since you have
only 10 unique string values.

Mike

On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi  wrote:
> I am not using Lucene API directly; I am using SOLR which uses Lucene
> FieldCache for faceting on non-tokenized fields...
> I think this cache will be lazily loaded, until user executes sorted (by
> this field) SOLR query for all documents *:* - in this case it will be fully
> populated...
>
>
>> Subject: Re: Lucene FieldCache memory requirements
>>
>> Which FieldCache API are you using?  getStrings?  or getStringIndex
>> (which is used, under the hood, if you sort by this field).
>>
>> Mike
>>
>> On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi  wrote:
>> > Any thoughts regarding the subject? I hope FieldCache doesn't use more
> than
>> > 6 bytes per document-field instance... I am too lazy to research Lucene
>> > source code, I hope someone can provide exact answer... Thanks
>> >
>> >
>> >> Subject: Lucene FieldCache memory requirements
>> >>
>> >> Hi,
>> >>
>> >>
>> >> Can anyone confirm Lucene FieldCache memory requirements? I have 100
>> >> millions docs with non-tokenized field "country" (10 different
> countries);
>> > I
>> >> expect it requires array of ("int", "long"), size of array 100,000,000,
>> >> without any impact of "country" field length;
>> >>
>> >> it requires 600,000,000 bytes: "int" is pointer to document (Lucene
>> > document
>> >> ID),  and "long" is pointer to String value...
>> >>
>> >> Am I right, is it 600Mb just for this "country" (indexed,
> non-tokenized,
>> >> non-boolean) field and 100 millions docs? I need to calculate exact
>> > minimum RAM
>> >> requirements...
>> >>
>> >> I believe it shouldn't depend on cardinality (distribution) of field...
>> >>
>> >> Thanks,
>> >> Fuad
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> >
>
>
>


Re: highlighting error using 1.4rc

2009-11-02 Thread Mark Miller
Umm - crap. This looks looks like a bug in a fix that just went in. My  
fault on the review. I'll fix it tonight when I get home -  
unfortunetly, both lucene and sold are about to be released...


- Mark

http://www.lucidimagination.com (mobile)

On Nov 2, 2009, at 5:17 PM, Jake Brownell  wrote:


Hi,

I've tried installing the latest (3rd) RC for Solr 1.4 and Lucene  
2.9.1. One of our integration tests, which runs against and embedded  
server appears to be failing on highlighting. I've included the  
stack trace and the configuration from solrconf. I'd appreciate any  
insights. Please let me know what additional information would be  
useful.



Caused by: org.apache.solr.client.solrj.SolrServerException:  
org.apache.solr.client.solrj.SolrServerException:  
java.lang.ClassCastException:  
org.apache.lucene.search.spans.SpanOrQuery cannot be cast to  
org.apache.lucene.search.spans.SpanNearQuery
   at  
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request 
(EmbeddedSolrServer.java:153)
   at  
org.apache.solr.client.solrj.request.QueryRequest.process 
(QueryRequest.java:89)
   at org.apache.solr.client.solrj.SolrServer.query 
(SolrServer.java:118)
   at org.bookshare.search.solr.SolrSearchServerWrapper.query 
(SolrSearchServerWrapper.java:96)

   ... 29 more
Caused by: org.apache.solr.client.solrj.SolrServerException:  
java.lang.ClassCastException:  
org.apache.lucene.search.spans.SpanOrQuery cannot be cast to  
org.apache.lucene.search.spans.SpanNearQuery
   at  
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request 
(EmbeddedSolrServer.java:141)

   ... 32 more
Caused by: java.lang.ClassCastException:  
org.apache.lucene.search.spans.SpanOrQuery cannot be cast to  
org.apache.lucene.search.spans.SpanNearQuery
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields( 
WeightedSpanTermExtractor.java:489)
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields( 
WeightedSpanTermExtractor.java:484)
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms( 
WeightedSpanTermExtractor.java:249)
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract 
(WeightedSpanTermExtractor.java:230)
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract 
(WeightedSpanTermExtractor.java:158)
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms( 
WeightedSpanTermExtractor.java:414)
   at  
org.apache.lucene.search.highlight.QueryScorer.initExtractor 
(QueryScorer.java:216)
   at org.apache.lucene.search.highlight.QueryScorer.init 
(QueryScorer.java:184)
   at  
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments 
(Highlighter.java:226)
   at  
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting 
(DefaultSolrHighlighter.java:335)
   at  
org.apache.solr.handler.component.HighlightComponent.process 
(HighlightComponent.java:89)
   at  
org.apache.solr.handler.component.SearchHandler.handleRequestBody 
(SearchHandler.java:203)
   at  
org.apache.solr.handler.RequestHandlerBase.handleRequest 
(RequestHandlerBase.java:131)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java: 
1316)
   at  
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request 
(EmbeddedSolrServer.java:139)

   ... 32 more

I see in our solrconf the following for highlighting.

 
  
  
  class="org.apache.solr.highlight.GapFragmenter" default="true">

   
100
   
  

  
  class="org.apache.solr.highlight.RegexFragmenter">

   
 
 70
 
 0.5
 
 [-\w ,/\n\"']{20,200}
   
  

  
  class="org.apache.solr.highlight.HtmlFormatter" default="true">

   


   
  
 



Thanks,
Jake


Re: Spell check suggestion and correct way of implementation and some Questions

2009-11-02 Thread darniz

Hello everybody
i am able to use spell checker but i have some questions if someone can
answer this
if i search free text word waranty then i get back suggestion warranty which
is fine.
but if do a search on field for example
description:waranty the output collation element is description:warranty
which i dont want i want to get back only the text ie warranty.

We are using collation to return back the results since if a user types
three words then we use collation in the response element to display the
spelling suggestion.

Any advice

darniz



-- 
View this message in context: 
http://old.nabble.com/Spell-check-suggestion-and-correct-way-of-implementation-and-some-Questions-tp26096664p26157893.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Spell check suggestion and correct way of implementation and some Questions

2009-11-02 Thread darniz

Hello everybody
i am able to use spell checker but i have some questions if someone can
answer this
if i search free text word waranty then i get back suggestion warranty which
is fine.
but if do a search on field for example
description:waranty the output collation element is description:warranty
which i dont want i want to get back only the text ie warranty.

We are using collation to return back the results since if a user types
three words then we use collation in the response element to display the
spelling suggestion.

Any advice

darniz

-- 
View this message in context: 
http://old.nabble.com/Spell-check-suggestion-and-correct-way-of-implementation-and-some-Questions-tp26096664p26157895.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: CPU utilization and query time high on Solr slave when snapshot install

2009-11-02 Thread Mark Miller
Hmm...I think you have to setup warming queries yourself and that  
autowarm just copies entries from the old cache to the new cache,  
rather than issuing queries - the value is how many entries it will  
copy. Though that's still going to take CPU and time.


- Mark

http://www.lucidimagination.com (mobile)

On Nov 2, 2009, at 12:47 PM, Walter Underwood   
wrote:


If you are going to pull a new index every 10 minutes, try turning  
off cache autowarming.


Your caches are never more than 10 minutes old, so spending a minute  
warming each new cache is a waste of CPU. Autowarm submits queries  
to the new Searcher before putting it in service. This will create a  
burst of query load on the new Searcher, often keeping one CPU  
pretty busy for several seconds.


In solrconfig.xml, set autowarmCount to 0.

Also, if you want the slaves to always have an optimized index,  
create the snapshot only in post-optimize. If you create snapshots  
in both post-commit and post-optimize, you are creating a non- 
optimized index (post-commit), then replacing it with an optimized  
one a few minutes later. A slave might get a non-optimized index one  
time, then an optimized one the next.


wunder

On Nov 2, 2009, at 1:45 AM, biku...@sapient.com wrote:


Hi Solr Gurus,

We have solr in 1 master, 2 slave configuration. Snapshot is  
created post commit, post optimization. We have autocommit after 50  
documents or 5 minutes. Snapshot puller runs as a cron every 10  
minutes. What we have observed is that whenever snapshot is  
installed on the slave, we see solrj client used to query slave  
solr, gets timedout and there is high CPU usage/load avg. on slave  
server. If we stop snapshot puller, then slaves work with no  
issues. The system has been running since 2 months and this issue  
has started to occur only now  when load on website is increasing.


Following are some details:

Solr Details:
apache-solr Version: 1.3.0
Lucene - 2.4-dev

Master/Slave configurations:

Master:
- for indexing data HTTPRequests are made on Solr server.
- autocommit feature is enabled for 50 docs and 5 minutes
- caching params are disable for this server
- mergeFactor of 10 is set
- we were running optimize script after every 2 hours, but now have  
reduced the duration to twice a day but issue still persists


Slave1/Slave2:
- standard requestHandler is being used
- default values of caching are set
Machine Specifications:

Master:
- 4GB RAM
- 1GB JVM Heap memory is allocated to Solr

Slave1/Slave2:
- 4GB RAM
- 2GB JVM Heap memory is allocated to Solr

Master and Slave1 (solr1)are on single box and Slave2(solr2) on  
different box. We use HAProxy to load balance query requests  
between 2 slaves. Master is only used for indexing.
Please let us know if somebody has ever faced similar kind of issue  
or has some insight into it as we guys are literally struck at the  
moment with a very unstable production environment.


As a workaround, we have started running optimize on master every 7  
minutes. This seems to have reduced the severity of the problem but  
still issue occurs every 2days now. please suggest what could be  
the root cause of this.


Thanks,
Bipul








Re: solr search

2009-11-02 Thread Lance Norskog
The problem is in db-dataconfig.xml. You should start with the example
DataImportHandler configuration fles.

The structure is wrong. First there is a datasource, then there are
'entities' which fetch a document's fields from the datasource.

On Fri, Oct 30, 2009 at 9:03 PM, manishkbawne  wrote:
>
> Hi,
> I have made following changes in solrconfig.xml
>
>    class="org.apache.solr.handler.dataimport.DataImportHandler">
>    
>         name="config">C:/Apache-Tomcat/apache-tomcat-6.0.20/solr/conf/db-data-config.xml
>    
>  
>
>
> in db-dataconfig.xml
> 
>        
>                 driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
>                url="jdbc:sqlserver://servername:1433/databasename" user="sa"
> password="p...@123"/>
>                        
>                         
>                        
>        
> 
>
> in schema.xml files
> 
>
> Please suggest me the possible cause of error??
>
>
>
>
> Lance Norskog-2 wrote:
>>
>> Please post your dataimporthandler configuration file.
>>
>> On Fri, Oct 30, 2009 at 4:17 AM, manishkbawne 
>> wrote:
>>>
>>> Thanks for your reply .. I am trying to use the database for solr search
>>> but
>>> getting this error..
>>>
>>> false in null
>>> -
>>> java.lang.NullPointerException at
>>> org.apache.solr.handler.dataimport.DataImporter.(DataImporter.java:95)
>>> at
>>> org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:106)
>>> at org.apache.solr.core.SolrResourceLoader
>>>
>>> Can you please suggest me some possible solution?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Karsten F. wrote:

 hi manishkbawne,

 unspecific ideas of search improvements are her:
 http://wiki.apache.org/solr/SolrPerformanceFactors

 I really like the last idea in
 http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
 :
 Use a profiler and ask a more specific question in this forum.

 Best regards
   Karsten



 manishkbawne wrote:
>
> I am using solr search to search through xml files. As I am working on
> millions of data, the result output is slower. Can anyone please
> suggest
> me some way, by which I can increase the search result output?
>


>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/solr-search-tp26125183p26128341.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/solr-search-tp26125183p26139946.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: solr web ui

2009-11-02 Thread Lance Norskog
This is what I meant to mention - Uri's GWT browser, not the Velocity toolkit.

On Fri, Oct 30, 2009 at 1:20 PM, Grant Ingersoll  wrote:
> There is also a GWT contribution in JIRA that is pretty handy and will
> likely be added in 1.5.  See http://issues.apache.org/jira/browse/SOLR-1163
>
> -Grant
> On Oct 29, 2009, at 9:17 PM, scabbage wrote:
>
>>
>> Hi,
>>
>> I'm a new solr user. I would like to know if there are any easy to setup
>> web
>> UIs for solr. It can be as simple as a search box, term highlighting and
>> basic faceting. Basically I'm using solr to store all our automation
>> testing
>> logs and would like to have a simple searchable UI. I don't wanna spent
>> too
>> much time writing my own.
>>
>> Thanks.
>> --
>> View this message in context:
>> http://www.nabble.com/solr-web-ui-tp26123604p26123604.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi

Thank you very much Mike,

I found it:
org.apache.solr.request.SimpleFacets
...
// TODO: future logic could use filters instead of the fieldcache if
// the number of terms in the field is small enough.
counts = getFieldCacheCounts(searcher, base, field, offset,limit,
mincount, missing, sort, prefix);
...
FieldCache.StringIndex si =
FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
final String[] terms = si.lookup;
final int[] termNum = si.order;
...


So that 64-bit requires more memory :)


Mike, am I right here?
[(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
(64-bit JVM)
1.2Gb RAM for this...

Or, may be I am wrong:
> For Lucene directly, simple strings would consume an pointer (4 or 8
> bytes depending on whether your JRE is 64bit) per doc, and the string
> index would consume an int (4 bytes) per doc.

[8 bytes (64bit)] x [number of documents (100mlns)]? 
0.8Gb

Kind of Map between String and DocSet, saving 4 bytes... "Key" is String,
and "Value" is array of 64-bit pointers to Document. Why 64-bit (for 64-bit
JVM)? I always thought it is (int) documentId...

Am I right?


Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!

>> Note that for your use case, this is exceptionally wasteful.  
This is probably very common case... I think it should be confirmed by
Lucene developers too... FieldCache is warmed anyway, even when we don't use
SOLR...

 
-Fuad







> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: November-02-09 6:00 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Lucene FieldCache memory requirements
> 
> OK I think someone who knows how Solr uses the fieldCache for this
> type of field will have to pipe up.
> 
> For Lucene directly, simple strings would consume an pointer (4 or 8
> bytes depending on whether your JRE is 64bit) per doc, and the string
> index would consume an int (4 bytes) per doc.  (Each also consume
> negligible (for your case) memory to hold the actual string values).
> 
> Note that for your use case, this is exceptionally wasteful.  If
> Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
> then it'd take much fewer bits to reference the values, since you have
> only 10 unique string values.
> 
> Mike
> 
> On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi  wrote:
> > I am not using Lucene API directly; I am using SOLR which uses Lucene
> > FieldCache for faceting on non-tokenized fields...
> > I think this cache will be lazily loaded, until user executes sorted (by
> > this field) SOLR query for all documents *:* - in this case it will be
fully
> > populated...
> >
> >
> >> Subject: Re: Lucene FieldCache memory requirements
> >>
> >> Which FieldCache API are you using?  getStrings?  or getStringIndex
> >> (which is used, under the hood, if you sort by this field).
> >>
> >> Mike
> >>
> >> On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi  wrote:
> >> > Any thoughts regarding the subject? I hope FieldCache doesn't use
more
> > than
> >> > 6 bytes per document-field instance... I am too lazy to research
Lucene
> >> > source code, I hope someone can provide exact answer... Thanks
> >> >
> >> >
> >> >> Subject: Lucene FieldCache memory requirements
> >> >>
> >> >> Hi,
> >> >>
> >> >>
> >> >> Can anyone confirm Lucene FieldCache memory requirements? I have 100
> >> >> millions docs with non-tokenized field "country" (10 different
> > countries);
> >> > I
> >> >> expect it requires array of ("int", "long"), size of array
100,000,000,
> >> >> without any impact of "country" field length;
> >> >>
> >> >> it requires 600,000,000 bytes: "int" is pointer to document (Lucene
> >> > document
> >> >> ID),  and "long" is pointer to String value...
> >> >>
> >> >> Am I right, is it 600Mb just for this "country" (indexed,
> > non-tokenized,
> >> >> non-boolean) field and 100 millions docs? I need to calculate exact
> >> > minimum RAM
> >> >> requirements...
> >> >>
> >> >> I believe it shouldn't depend on cardinality (distribution) of
field...
> >> >>
> >> >> Thanks,
> >> >> Fuad
> >> >>
> >> >>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> >
> >
> >
> >




Re: CPU utilization and query time high on Solr slave when snapshot install

2009-11-02 Thread Jay Hill
So assuming you set up a few sample sort queries to run in the firstSearcher
config, and had very low query volume during that ten minutes so that there
were no evictions before a new Searcher was loaded, would those queries run
by the firstSearcher be passed along to the cache for the next Searcher as
part of the autowarm? If so, it seems like you might want to load a few sort
queries for the firstSearcher, but might not need any included in the
newSearcher?

-Jay


On Mon, Nov 2, 2009 at 4:26 PM, Mark Miller  wrote:

> Hmm...I think you have to setup warming queries yourself and that autowarm
> just copies entries from the old cache to the new cache, rather than issuing
> queries - the value is how many entries it will copy. Though that's still
> going to take CPU and time.
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
>
> On Nov 2, 2009, at 12:47 PM, Walter Underwood 
> wrote:
>
>  If you are going to pull a new index every 10 minutes, try turning off
>> cache autowarming.
>>
>> Your caches are never more than 10 minutes old, so spending a minute
>> warming each new cache is a waste of CPU. Autowarm submits queries to the
>> new Searcher before putting it in service. This will create a burst of query
>> load on the new Searcher, often keeping one CPU pretty busy for several
>> seconds.
>>
>> In solrconfig.xml, set autowarmCount to 0.
>>
>> Also, if you want the slaves to always have an optimized index, create the
>> snapshot only in post-optimize. If you create snapshots in both post-commit
>> and post-optimize, you are creating a non-optimized index (post-commit),
>> then replacing it with an optimized one a few minutes later. A slave might
>> get a non-optimized index one time, then an optimized one the next.
>>
>> wunder
>>
>> On Nov 2, 2009, at 1:45 AM, biku...@sapient.com wrote:
>>
>>  Hi Solr Gurus,
>>>
>>> We have solr in 1 master, 2 slave configuration. Snapshot is created post
>>> commit, post optimization. We have autocommit after 50 documents or 5
>>> minutes. Snapshot puller runs as a cron every 10 minutes. What we have
>>> observed is that whenever snapshot is installed on the slave, we see solrj
>>> client used to query slave solr, gets timedout and there is high CPU
>>> usage/load avg. on slave server. If we stop snapshot puller, then slaves
>>> work with no issues. The system has been running since 2 months and this
>>> issue has started to occur only now  when load on website is increasing.
>>>
>>> Following are some details:
>>>
>>> Solr Details:
>>> apache-solr Version: 1.3.0
>>> Lucene - 2.4-dev
>>>
>>> Master/Slave configurations:
>>>
>>> Master:
>>> - for indexing data HTTPRequests are made on Solr server.
>>> - autocommit feature is enabled for 50 docs and 5 minutes
>>> - caching params are disable for this server
>>> - mergeFactor of 10 is set
>>> - we were running optimize script after every 2 hours, but now have
>>> reduced the duration to twice a day but issue still persists
>>>
>>> Slave1/Slave2:
>>> - standard requestHandler is being used
>>> - default values of caching are set
>>> Machine Specifications:
>>>
>>> Master:
>>> - 4GB RAM
>>> - 1GB JVM Heap memory is allocated to Solr
>>>
>>> Slave1/Slave2:
>>> - 4GB RAM
>>> - 2GB JVM Heap memory is allocated to Solr
>>>
>>> Master and Slave1 (solr1)are on single box and Slave2(solr2) on different
>>> box. We use HAProxy to load balance query requests between 2 slaves. Master
>>> is only used for indexing.
>>> Please let us know if somebody has ever faced similar kind of issue or
>>> has some insight into it as we guys are literally struck at the moment with
>>> a very unstable production environment.
>>>
>>> As a workaround, we have started running optimize on master every 7
>>> minutes. This seems to have reduced the severity of the problem but still
>>> issue occurs every 2days now. please suggest what could be the root cause of
>>> this.
>>>
>>> Thanks,
>>> Bipul
>>>
>>>
>>>
>>>
>>>
>>


Re: Lucene FieldCache memory requirements

2009-11-02 Thread Mark Miller
It also briefly requires more memory than just that - it allocates an
array the size of maxdoc+1 to hold the unique terms - and then sizes down.

Possibly we can use the getUnuiqeTermCount method in the flexible
indexing branch to get rid of that - which is why I was thinking it
might be a good idea to drop the unsupported exception in that method
for things like multi reader and just do the work to get the right
number (currently there is a comment that the user should do that work
if necessary, making the call unreliable for this).

Fuad Efendi wrote:
> Thank you very much Mike,
>
> I found it:
> org.apache.solr.request.SimpleFacets
> ...
> // TODO: future logic could use filters instead of the fieldcache if
> // the number of terms in the field is small enough.
> counts = getFieldCacheCounts(searcher, base, field, offset,limit,
> mincount, missing, sort, prefix);
> ...
> FieldCache.StringIndex si =
> FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
> final String[] terms = si.lookup;
> final int[] termNum = si.order;
> ...
>
>
> So that 64-bit requires more memory :)
>
>
> Mike, am I right here?
> [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
> (64-bit JVM)
> 1.2Gb RAM for this...
>
> Or, may be I am wrong:
>   
>> For Lucene directly, simple strings would consume an pointer (4 or 8
>> bytes depending on whether your JRE is 64bit) per doc, and the string
>> index would consume an int (4 bytes) per doc.
>> 
>
> [8 bytes (64bit)] x [number of documents (100mlns)]? 
> 0.8Gb
>
> Kind of Map between String and DocSet, saving 4 bytes... "Key" is String,
> and "Value" is array of 64-bit pointers to Document. Why 64-bit (for 64-bit
> JVM)? I always thought it is (int) documentId...
>
> Am I right?
>
>
> Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!
>
>   
>>> Note that for your use case, this is exceptionally wasteful.  
>>>   
> This is probably very common case... I think it should be confirmed by
> Lucene developers too... FieldCache is warmed anyway, even when we don't use
> SOLR...
>
>  
> -Fuad
>
>
>
>
>
>
>
>   
>> -Original Message-
>> From: Michael McCandless [mailto:luc...@mikemccandless.com]
>> Sent: November-02-09 6:00 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Lucene FieldCache memory requirements
>>
>> OK I think someone who knows how Solr uses the fieldCache for this
>> type of field will have to pipe up.
>>
>> For Lucene directly, simple strings would consume an pointer (4 or 8
>> bytes depending on whether your JRE is 64bit) per doc, and the string
>> index would consume an int (4 bytes) per doc.  (Each also consume
>> negligible (for your case) memory to hold the actual string values).
>>
>> Note that for your use case, this is exceptionally wasteful.  If
>> Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
>> then it'd take much fewer bits to reference the values, since you have
>> only 10 unique string values.
>>
>> Mike
>>
>> On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi  wrote:
>> 
>>> I am not using Lucene API directly; I am using SOLR which uses Lucene
>>> FieldCache for faceting on non-tokenized fields...
>>> I think this cache will be lazily loaded, until user executes sorted (by
>>> this field) SOLR query for all documents *:* - in this case it will be
>>>   
> fully
>   
>>> populated...
>>>
>>>
>>>   
 Subject: Re: Lucene FieldCache memory requirements

 Which FieldCache API are you using?  getStrings?  or getStringIndex
 (which is used, under the hood, if you sort by this field).

 Mike

 On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi  wrote:
 
> Any thoughts regarding the subject? I hope FieldCache doesn't use
>   
> more
>   
>>> than
>>>   
> 6 bytes per document-field instance... I am too lazy to research
>   
> Lucene
>   
> source code, I hope someone can provide exact answer... Thanks
>
>
>   
>> Subject: Lucene FieldCache memory requirements
>>
>> Hi,
>>
>>
>> Can anyone confirm Lucene FieldCache memory requirements? I have 100
>> millions docs with non-tokenized field "country" (10 different
>> 
>>> countries);
>>>   
> I
>   
>> expect it requires array of ("int", "long"), size of array
>> 
> 100,000,000,
>   
>> without any impact of "country" field length;
>>
>> it requires 600,000,000 bytes: "int" is pointer to document (Lucene
>> 
> document
>   
>> ID),  and "long" is pointer to String value...
>>
>> Am I right, is it 600Mb just for this "country" (indexed,
>> 
>>> non-tokenized,
>>>   
>> non-boolean) field and 100 millions docs? I need to calculate exact
>> 
> minimum RAM
>   
>> requirements...
>>
>>

Why does BinaryRequestWriter force the path to be base URL + "/update/javabin"

2009-11-02 Thread Stuart Tettemer
Hi folks,
First of all, thanks for Solr.  It is a great piece of work.

I have a question about BinaryRequestWriter in the solrj project.  Why does
it force the path of UpdateRequests to have be "/update/javabin" (see
BinaryRequestWriter.getPath(String) starting on line 109)?

I am extending BinaryRequestWriter specifically to remove this requirement
and am interested to know the reasoning behind in the inital choice.

Thanks for your time,
Stuart


Re: highlighting error using 1.4rc

2009-11-02 Thread Mark Miller
Sorry - it was a bug in the backport from trunk to 2.9.1 - didn't
realize that code didn't get hit because we didn't pass a null field -
else the tests would have caught it. Fix has been committed but I don't
know whether it will make 2.9.1 or 1.4 because both have gotten the
votes and time needed for release.

Mark Miller wrote:
> Umm - crap. This looks looks like a bug in a fix that just went in. My
> fault on the review. I'll fix it tonight when I get home -
> unfortunetly, both lucene and sold are about to be released...
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
> On Nov 2, 2009, at 5:17 PM, Jake Brownell  wrote:
>
>> Hi,
>>
>> I've tried installing the latest (3rd) RC for Solr 1.4 and Lucene
>> 2.9.1. One of our integration tests, which runs against and embedded
>> server appears to be failing on highlighting. I've included the stack
>> trace and the configuration from solrconf. I'd appreciate any
>> insights. Please let me know what additional information would be
>> useful.
>>
>>
>> Caused by: org.apache.solr.client.solrj.SolrServerException:
>> org.apache.solr.client.solrj.SolrServerException:
>> java.lang.ClassCastException:
>> org.apache.lucene.search.spans.SpanOrQuery cannot be cast to
>> org.apache.lucene.search.spans.SpanNearQuery
>>at
>> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153)
>>
>>at
>> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
>>
>>at
>> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
>>at
>> org.bookshare.search.solr.SolrSearchServerWrapper.query(SolrSearchServerWrapper.java:96)
>>
>>... 29 more
>> Caused by: org.apache.solr.client.solrj.SolrServerException:
>> java.lang.ClassCastException:
>> org.apache.lucene.search.spans.SpanOrQuery cannot be cast to
>> org.apache.lucene.search.spans.SpanNearQuery
>>at
>> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141)
>>
>>... 32 more
>> Caused by: java.lang.ClassCastException:
>> org.apache.lucene.search.spans.SpanOrQuery cannot be cast to
>> org.apache.lucene.search.spans.SpanNearQuery
>>at
>> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:489)
>>
>>at
>> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:484)
>>
>>at
>> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:249)
>>
>>at
>> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:230)
>>
>>at
>> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)
>>
>>at
>> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:414)
>>
>>at
>> org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216)
>>
>>at
>> org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184)
>>
>>at
>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:226)
>>
>>at
>> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335)
>>
>>at
>> org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
>>
>>at
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:203)
>>
>>at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>
>>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>at
>> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139)
>>
>>... 32 more
>>
>> I see in our solrconf the following for highlighting.
>>
>>  
>>   
>>   
>>   > class="org.apache.solr.highlight.GapFragmenter" default="true">
>>
>> 100
>>
>>   
>>
>>   
>>   > class="org.apache.solr.highlight.RegexFragmenter">
>>
>>  
>>  70
>>  
>>  0.5
>>  
>>  [-\w ,/\n\"']{20,200}
>>
>>   
>>
>>   
>>   > class="org.apache.solr.highlight.HtmlFormatter" default="true">
>>
>> 
>> 
>>
>>   
>>  
>>
>>
>>
>> Thanks,
>> Jake


-- 
- Mark

http://www.lucidimagination.com





Re: Programmatically configuring SLF4J for Solr 1.4?

2009-11-02 Thread Don Werve
2009/11/1 Ryan McKinley 

> I'm sure it is possible to configure JDK logging (java.util.loging)
> programatically... but I have never had much luck with it.
>
> It is very easy to configure log4j programatically, and this works great
> with solr.
>

Don't suppose I could trouble you for an example?  I'm not terribly familiar
with Java logging frameworks just yet.


RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi

Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
difference between maxdoc and maxdoc + 1 for such estimate... difference is
between 0.4Gb and 1.2Gb...


So, let's vote ;)

A. [maxdoc] x [8 bytes ~ pointer to String object]

B. [maxdoc] x [8 bytes ~ pointer to Document object]

C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] 
- same as [String1_Document_Count + ... + String10_Document_Count] x [4
bytes ~ DocumentID]

D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...]


Please confirm that it is Pointer to Object and not Lucene Document ID... I
hope it is (int) Document ID...





> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: November-02-09 6:52 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Lucene FieldCache memory requirements
> 
> It also briefly requires more memory than just that - it allocates an
> array the size of maxdoc+1 to hold the unique terms - and then sizes down.
> 
> Possibly we can use the getUnuiqeTermCount method in the flexible
> indexing branch to get rid of that - which is why I was thinking it
> might be a good idea to drop the unsupported exception in that method
> for things like multi reader and just do the work to get the right
> number (currently there is a comment that the user should do that work
> if necessary, making the call unreliable for this).
> 
> Fuad Efendi wrote:
> > Thank you very much Mike,
> >
> > I found it:
> > org.apache.solr.request.SimpleFacets
> > ...
> > // TODO: future logic could use filters instead of the
fieldcache if
> > // the number of terms in the field is small enough.
> > counts = getFieldCacheCounts(searcher, base, field,
offset,limit,
> > mincount, missing, sort, prefix);
> > ...
> > FieldCache.StringIndex si =
> > FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
> > final String[] terms = si.lookup;
> > final int[] termNum = si.order;
> > ...
> >
> >
> > So that 64-bit requires more memory :)
> >
> >
> > Mike, am I right here?
> > [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
> > (64-bit JVM)
> > 1.2Gb RAM for this...
> >
> > Or, may be I am wrong:
> >
> >> For Lucene directly, simple strings would consume an pointer (4 or 8
> >> bytes depending on whether your JRE is 64bit) per doc, and the string
> >> index would consume an int (4 bytes) per doc.
> >>
> >
> > [8 bytes (64bit)] x [number of documents (100mlns)]?
> > 0.8Gb
> >
> > Kind of Map between String and DocSet, saving 4 bytes... "Key" is
String,
> > and "Value" is array of 64-bit pointers to Document. Why 64-bit (for
64-bit
> > JVM)? I always thought it is (int) documentId...
> >
> > Am I right?
> >
> >
> > Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!
> >
> >
> >>> Note that for your use case, this is exceptionally wasteful.
> >>>
> > This is probably very common case... I think it should be confirmed by
> > Lucene developers too... FieldCache is warmed anyway, even when we don't
use
> > SOLR...
> >
> >
> > -Fuad
> >
> >
> >
> >
> >
> >
> >
> >
> >> -Original Message-
> >> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> >> Sent: November-02-09 6:00 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Lucene FieldCache memory requirements
> >>
> >> OK I think someone who knows how Solr uses the fieldCache for this
> >> type of field will have to pipe up.
> >>
> >> For Lucene directly, simple strings would consume an pointer (4 or 8
> >> bytes depending on whether your JRE is 64bit) per doc, and the string
> >> index would consume an int (4 bytes) per doc.  (Each also consume
> >> negligible (for your case) memory to hold the actual string values).
> >>
> >> Note that for your use case, this is exceptionally wasteful.  If
> >> Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
> >> then it'd take much fewer bits to reference the values, since you have
> >> only 10 unique string values.
> >>
> >> Mike
> >>
> >> On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi  wrote:
> >>
> >>> I am not using Lucene API directly; I am using SOLR which uses Lucene
> >>> FieldCache for faceting on non-tokenized fields...
> >>> I think this cache will be lazily loaded, until user executes sorted
(by
> >>> this field) SOLR query for all documents *:* - in this case it will be
> >>>
> > fully
> >
> >>> populated...
> >>>
> >>>
> >>>
>  Subject: Re: Lucene FieldCache memory requirements
> 
>  Which FieldCache API are you using?  getStrings?  or getStringIndex
>  (which is used, under the hood, if you sort by this field).
> 
>  Mike
> 
>  On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi  wrote:
> 
> > Any thoughts regarding the subject? I hope FieldCache doesn't use
> >
> > more
> >
> >>> than
> >>>
> > 6 bytes per document-field instance... I am too lazy to research
> >
> > Lucene
> >
> > source code, I hope someone can provide

Getting update/extract RequestHandler to work under Tomcat

2009-11-02 Thread Glock, Thomas

Hoping someone might help with getting /update/extract RequestHandler to
work under Tomcat.

Error 500 happens when trying to access
http://localhost:8080/apache-solr-1.4-dev/update/extract/  (see below)

Note /update/extract DOES work correctly under the Jetty provided
example.

I think I must have a directory path incorrectly specified but not sure
where.

No errors in the Catalina log on startup - only this: 

Nov 2, 2009 7:10:49 PM org.apache.solr.core.RequestHandlers
initHandlersFromConfig
INFO: created /update/extract:
org.apache.solr.handler.extraction.ExtractingRequestHandler

Solrconfig.xml under tomcat is slightly changed from the example with
regards to  elements:

  
  
  :

The \contrib and \dist directories were copied directly below the
"webapps\apache-solr-1.4-dev" unchanged from the example.

Im the catalina log I see all the "Adding specified lib dirs..." added
without error:

INFO: Adding specified lib dirs to ClassLoader
Nov 2, 2009 7:31:20 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/C:/Program%20Files/Apache%20Software%20Foundation/Tomcat%206.0/we
bapps/apache-solr-1.4-dev/contrib/extraction/lib/asm-3.1.jar' to
classloader
Nov 2, 2009 7:31:20 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/C:/Program%20Files/Apache%20Software%20Foundation/Tomcat%206.0/we
bapps/apache-solr-1.4-dev/contrib/extraction/lib/bcmail-jdk14-136.jar'
to classloader
Nov 2, 2009 7:31:20 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/C:/Program%20Files/Apache%20Software%20Foundation/Tomcat%206.0/we
bapps/apache-solr-1.4-dev/contrib/extraction/lib/bcprov-jdk14-136.jar'
to classloader

(...many more...)

Solr Home is mapped to:

INFO: SolrDispatchFilter.init()
Nov 2, 2009 7:10:47 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: Using JNDI solr.home: .\webapps\apache-solr-1.4-dev\solr
Nov 2, 2009 7:10:47 PM
org.apache.solr.core.CoreContainer$Initializer initialize
INFO: looking for solr.xml: C:\Program Files\Apache Software
Foundation\Tomcat 6.0\.\webapps\apache-solr-1.4-dev\solr\solr.xml
Nov 2, 2009 7:10:47 PM org.apache.solr.core.SolrResourceLoader

INFO: Solr home set to '.\webapps\apache-solr-1.4-dev\solr\' 

500 Error:

HTTP Status 500 - lazy loading error
org.apache.solr.common.SolrException: lazy loading error at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappe
dHandler(RequestHandlers.java:249) at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleReq
uest(RequestHandlers.java:231) at
org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:338) at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:241) at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:235) at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:206) at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:233) at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:191) at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(Authenticator
Base.java:433) at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:128) at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:102) at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:109) at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2
93) at
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.j
ava:859) at
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.proce
ss(Http11AprProtocol.java:574) at
org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1527)
at java.lang.Thread.run(Unknown Source) Caused by:
org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.handler.extraction.ExtractingRequestHandler' at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.jav
a:373) at
org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at
org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449) at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappe
dHandler(RequestHandlers.java:240) ... 17 more Caused by:
java.lang.ClassNotFoundException:
org.apache.solr.handler.extraction.ExtractingRequestHandler at
java.net.URLClassLoader$1.run(Unknown Source) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(Unknown Source) at
java.lang.ClassLoader.loadClass(Unknown Source) at
java.net.FactoryURLClassLoader.loadClass(Unknown Source) at
java.lang.ClassLoader.loadClass(Unknown Source) at
java.lang.ClassLoader.loadClassInte

Re: Lucene FieldCache memory requirements

2009-11-02 Thread Mark Miller
Fuad Efendi wrote:
> Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
> difference between maxdoc and maxdoc + 1 for such estimate... difference is
> between 0.4Gb and 1.2Gb...
>
>   
I'm not sure I understand - but I didn't mean to imply the +1 on maxdoc
meant anything. The issue is that in the end, it only needs a String
array the size of String[UniqueTerms] - but because it can't easily
figure out that number, it first creates an array of String[MaxDoc+1] -
so with a ton of docs and a few uniques, you get a temp boost in the RAM
reqs until it sizes it down. A pointer for each doc.

-- 
- Mark

http://www.lucidimagination.com





SolrJ looping until I get all the results

2009-11-02 Thread Paul Tomblin
If I want to do a query and only return X number of rows at a time,
but I want to keep querying until I get all the row, how do I do that?
 Can I just keep advancing query.setStart(...) and then checking if
server.query(query) returns any rows?  Or is there a better way?

Here's what I'm thinking

final static int MAX_ROWS = 100;
int start = 0;
query.setRows(MAX_ROWS);
while (true)
{
   QueryResponse resp = solrChunkServer.query(query);
   SolrDocumentList docs = resp.getResults();
   if (docs.size() == 0)
 break;
   
  start += MAX_ROWS;
  query.setStart(start);
}



-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin


RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I just did some tests in a completely new index (Slave), sort by
low-distributed non-tokenized Field (such as Country) takes milliseconds,
but sort (ascending) on tokenized field with heavy distribution took 30
seconds (initially). Second sort (descending) took milliseconds. Generic
query *.*; FieldCache is not used for tokenized fields... how it is sorted
:)
Fortunately, no any OOM.
-Fuad




RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Mark,

I don't understand this: 
> so with a ton of docs and a few uniques, you get a temp boost in the RAM
> reqs until it sizes it down.

Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it is
not cache?


And this:
> A pointer for each doc.

Why can't we use (int) DocumentID? For me, it is natural; 64-bit pointer to
an object in RAM is not natural (in Lucene world)...


So, is it [maxdoc]x[4-bytes], or [maxdoc]x[8-bytes]?... 
-Fuad







RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Ok, my "naive" thinking about FieldCache: for each Term we can quickly
retrieve DocSet. What are memory requirements? Theoretically,
[maxdoc]x[4-bytes DocumentID], plus some (small) array to store terms
pointing to (large) arrays of DocumentIDs.

Mike suggested http://issues.apache.org/jira/browse/LUCENE-1990 to make this
memory requirement even lower... but please correct me if I am wrong with
formula, and I am unsure how it is currently implemented...


Thanks,
Fuad


> -Original Message-
> From: Fuad Efendi [mailto:f...@efendi.ca]
> Sent: November-02-09 8:21 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Lucene FieldCache memory requirements
> 
> Mark,
> 
> I don't understand this:
> > so with a ton of docs and a few uniques, you get a temp boost in the RAM
> > reqs until it sizes it down.
> 
> Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it is
> not cache?
> 
> 
> And this:
> > A pointer for each doc.
> 
> Why can't we use (int) DocumentID? For me, it is natural; 64-bit pointer
to
> an object in RAM is not natural (in Lucene world)...
> 
> 
> So, is it [maxdoc]x[4-bytes], or [maxdoc]x[8-bytes]?...
> -Fuad
> 
> 
> 
> 





Re: SolrJ looping until I get all the results

2009-11-02 Thread Avlesh Singh
>
> final static int MAX_ROWS = 100;
> int start = 0;
> query.setRows(MAX_ROWS);
> while (true)
> {
>   QueryResponse resp = solrChunkServer.query(query);
>   SolrDocumentList docs = resp.getResults();
>   if (docs.size() == 0)
> break;
>   
>  start += MAX_ROWS;
>  query.setStart(start);
> }
>
Yes. It will work as you think. But are you sure that you want to do this?
How many documents do you have in the index? If the number is in an
acceptable range, why not simply do a query.setRows(Integer.MAX_VALUE) once?


Cheers
Avlesh

On Tue, Nov 3, 2009 at 6:19 AM, Paul Tomblin  wrote:

> If I want to do a query and only return X number of rows at a time,
> but I want to keep querying until I get all the row, how do I do that?
>  Can I just keep advancing query.setStart(...) and then checking if
> server.query(query) returns any rows?  Or is there a better way?
>
> Here's what I'm thinking
>
> final static int MAX_ROWS = 100;
> int start = 0;
> query.setRows(MAX_ROWS);
> while (true)
> {
>   QueryResponse resp = solrChunkServer.query(query);
>   SolrDocumentList docs = resp.getResults();
>   if (docs.size() == 0)
> break;
>   
>  start += MAX_ROWS;
>  query.setStart(start);
> }
>
>
>
> --
> http://www.linkedin.com/in/paultomblin
> http://careers.stackoverflow.com/ptomblin
>


RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
To be correct, I analyzed FieldCache awhile ago and I believed it never
"sizes down"...

/**
 * Expert: The default cache implementation, storing all values in memory.
 * A WeakHashMap is used for storage.
 *
 * Created: May 19, 2004 4:40:36 PM
 *
 * @since   lucene 1.4
 */


Will it size down? Only if we are not faceting (as in SOLR v.1.3)...

And I am still unsure, Document ID vs. Object Pointer.




> 
> I don't understand this:
> > so with a ton of docs and a few uniques, you get a temp boost in the RAM
> > reqs until it sizes it down.
> 
> Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it is
> not cache?
> 




Re: SolrJ looping until I get all the results

2009-11-02 Thread Paul Tomblin
On Mon, Nov 2, 2009 at 8:40 PM, Avlesh Singh  wrote:
>>
>> final static int MAX_ROWS = 100;
>> int start = 0;
>> query.setRows(MAX_ROWS);
>> while (true)
>> {
>>   QueryResponse resp = solrChunkServer.query(query);
>>   SolrDocumentList docs = resp.getResults();
>>   if (docs.size() == 0)
>>     break;
>>   
>>  start += MAX_ROWS;
>>  query.setStart(start);
>> }
>>
> Yes. It will work as you think. But are you sure that you want to do this?
> How many documents do you have in the index? If the number is in an
> acceptable range, why not simply do a query.setRows(Integer.MAX_VALUE) once?

I was doing it that way, but what I'm doing with the documents is do
some manipulation and put the new classes into a different list.
Because I basically have two times the number of documents in lists,
I'm running out of memory.  So I figured if I do it 1000 documents at
a time, the SolrDocumentList will get garbage collected at least.



-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin


Re: adding and updating a lot of document to Solr, metadata extraction etc

2009-11-02 Thread Lance Norskog
About large XML files and http overhead: you can tell solr to load the
file directly from a file system. This will stream thousands of
documents in one XML file without loading everything in memory at
once.

This is a new book on Solr. It will help you through this early learning phase.

http://www.packtpub.com/solr-1-4-enterprise-search-server

On Mon, Nov 2, 2009 at 6:24 AM, Alexey Serba  wrote:
> Hi Eugene,
>
>> - ability to iterate over all documents, returned in search, as Lucene does
>>  provide within a HitCollector instance. We would need to extract and
>>  aggregate various fields, stored in index, to group results and aggregate 
>> them
>>  in some way.
>> 
>> Also I did not find any way in the tutorial to access the search results with
>> all fields to be processed by our application.
>>
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr
> Check out Faceted Search, probably you can achieve your goal by using
> Facet Component
>
> There's also Field Collapsing patch
> http://wiki.apache.org/solr/FieldCollapsing
>
>
> Alex
>



-- 
Lance Norskog
goks...@gmail.com


Re: SolrJ looping until I get all the results

2009-11-02 Thread Avlesh Singh
>
> I was doing it that way, but what I'm doing with the documents is do
> some manipulation and put the new classes into a different list.
> Because I basically have two times the number of documents in lists,
> I'm running out of memory.  So I figured if I do it 1000 documents at
> a time, the SolrDocumentList will get garbage collected at least.
>
You are right w.r.t to all that but I am surprised that you would need ALL
the documents from the index for a search requirement.

Cheers
Avlesh

On Tue, Nov 3, 2009 at 7:13 AM, Paul Tomblin  wrote:

> On Mon, Nov 2, 2009 at 8:40 PM, Avlesh Singh  wrote:
> >>
> >> final static int MAX_ROWS = 100;
> >> int start = 0;
> >> query.setRows(MAX_ROWS);
> >> while (true)
> >> {
> >>   QueryResponse resp = solrChunkServer.query(query);
> >>   SolrDocumentList docs = resp.getResults();
> >>   if (docs.size() == 0)
> >> break;
> >>   
> >>  start += MAX_ROWS;
> >>  query.setStart(start);
> >> }
> >>
> > Yes. It will work as you think. But are you sure that you want to do
> this?
> > How many documents do you have in the index? If the number is in an
> > acceptable range, why not simply do a query.setRows(Integer.MAX_VALUE)
> once?
>
> I was doing it that way, but what I'm doing with the documents is do
> some manipulation and put the new classes into a different list.
> Because I basically have two times the number of documents in lists,
> I'm running out of memory.  So I figured if I do it 1000 documents at
> a time, the SolrDocumentList will get garbage collected at least.
>
>
>
> --
> http://www.linkedin.com/in/paultomblin
> http://careers.stackoverflow.com/ptomblin
>


Re: Lucene FieldCache memory requirements

2009-11-02 Thread Mark Miller
 static final class StringIndexCache extends Cache {
StringIndexCache(FieldCache wrapper) {
  super(wrapper);
}

@Override
protected Object createValue(IndexReader reader, Entry entryKey)
throws IOException {
  String field = StringHelper.intern(entryKey.field);
  final int[] retArray = new int[reader.maxDoc()];
  String[] mterms = new String[reader.maxDoc()+1];
  TermDocs termDocs = reader.termDocs();
  TermEnum termEnum = reader.terms (new Term (field));
  int t = 0;  // current term number

  // an entry for documents that have no terms in this field
  // should a document with no terms be at top or bottom?
  // this puts them at the top - if it is changed,
FieldDocSortedHitQueue
  // needs to change as well.
  mterms[t++] = null;

  try {
do {
  Term term = termEnum.term();
  if (term==null || term.field() != field) break;

  // store term text
  // we expect that there is at most one term per document
  if (t >= mterms.length) throw new RuntimeException ("there are
more terms than " +
  "documents in field \"" + field + "\", but it's
impossible to sort on " +
  "tokenized fields");
  mterms[t] = term.text();

  termDocs.seek (termEnum);
  while (termDocs.next()) {
retArray[termDocs.doc()] = t;
  }

  t++;
} while (termEnum.next());
  } finally {
termDocs.close();
termEnum.close();
  }

  if (t == 0) {
// if there are no terms, make the term array
// have a single null entry
mterms = new String[1];
  } else if (t < mterms.length) {
// if there are less terms than documents,
// trim off the dead array space
String[] terms = new String[t];
System.arraycopy (mterms, 0, terms, 0, t);
mterms = terms;
  }

  StringIndex value = new StringIndex (retArray, mterms);
  return value;
}
  };

The formula for a String Index fieldcache is essentially the String
array of unique terms (which does indeed "size down" at the bottom) and
the int array indexing into the String array.


Fuad Efendi wrote:
> To be correct, I analyzed FieldCache awhile ago and I believed it never
> "sizes down"...
>
> /**
>  * Expert: The default cache implementation, storing all values in memory.
>  * A WeakHashMap is used for storage.
>  *
>  * Created: May 19, 2004 4:40:36 PM
>  *
>  * @since   lucene 1.4
>  */
>
>
> Will it size down? Only if we are not faceting (as in SOLR v.1.3)...
>
> And I am still unsure, Document ID vs. Object Pointer.
>
>
>
>
>   
>> I don't understand this:
>> 
>>> so with a ton of docs and a few uniques, you get a temp boost in the RAM
>>> reqs until it sizes it down.
>>>   
>> Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it is
>> not cache?
>>
>> 
>
>
>   


-- 
- Mark

http://www.lucidimagination.com





Re: SolrJ looping until I get all the results

2009-11-02 Thread Paul Tomblin
On Mon, Nov 2, 2009 at 8:47 PM, Avlesh Singh  wrote:
>>
>> I was doing it that way, but what I'm doing with the documents is do
>> some manipulation and put the new classes into a different list.
>> Because I basically have two times the number of documents in lists,
>> I'm running out of memory.  So I figured if I do it 1000 documents at
>> a time, the SolrDocumentList will get garbage collected at least.
>>
> You are right w.r.t to all that but I am surprised that you would need ALL
> the documents from the index for a search requirement.

This isn't a search, this is a search and destroy.  Basically I need
the file names of all the documents that I've indexed in Solr so that
I can delete them.

-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin


Re: SolrJ looping until I get all the results

2009-11-02 Thread Avlesh Singh
>
> This isn't a search, this is a search and destroy.  Basically I need the
> file names of all the documents that I've indexed in Solr so that I can
> delete them.
>
Okay. I am sure you are aware of the "fl" parameter which restricts the
number of fields returned back with a response. If you need limited info, it
might be a good idea to use this parameter.

Cheers
Avlesh

On Tue, Nov 3, 2009 at 7:23 AM, Paul Tomblin  wrote:

> On Mon, Nov 2, 2009 at 8:47 PM, Avlesh Singh  wrote:
> >>
> >> I was doing it that way, but what I'm doing with the documents is do
> >> some manipulation and put the new classes into a different list.
> >> Because I basically have two times the number of documents in lists,
> >> I'm running out of memory.  So I figured if I do it 1000 documents at
> >> a time, the SolrDocumentList will get garbage collected at least.
> >>
> > You are right w.r.t to all that but I am surprised that you would need
> ALL
> > the documents from the index for a search requirement.
>
> This isn't a search, this is a search and destroy.  Basically I need
> the file names of all the documents that I've indexed in Solr so that
> I can delete them.
>
> --
> http://www.linkedin.com/in/paultomblin
> http://careers.stackoverflow.com/ptomblin
>


RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I believe this is correct estimate:

> C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]
>
>   same as 
> [String1_Document_Count + ... + String10_Document_Count + ...] 
> x [4 bytes per DocumentID]


So, for 100 millions docs we need 400Mb for each(!) non-tokenized field.
Although FieldCacheImpl is based on WeakHashMap (somewhere...), we can't
rely on "sizing down" with SOLR faceting features


I think I finally found the answer...

  /** Expert: Stores term text values and document ordering data. */
  public static class StringIndex {
...   
/** All the term values, in natural order. */
public final String[] lookup;

/** For each document, an index into the lookup array. */
public final int[] order;
...
  }



Another API:
  /** Checks the internal cache for an appropriate entry, and if none
   * is found, reads the term values in field and returns an
array
   * of size reader.maxDoc() containing the value each document
   * has in the given field.
   * @param reader  Used to get field values.
   * @param field   Which field contains the strings.
   * @return The values in the given field for each document.
   * @throws IOException  If any error occurs.
   */
  public String[] getStrings (IndexReader reader, String field)
  throws IOException;


Looks similar; cache size is [maxdoc]; however values stored are 8-byte
pointers for 64-bit JVM.


  private Map,Cache> caches;
  private synchronized void init() {
caches = new HashMap,Cache>(7);
...
caches.put(String.class, new StringCache(this));
caches.put(StringIndex.class, new StringIndexCache(this));
...
  }


StringCache and StringIndexCache use WeakHashMap internally... but objects
won't be ever garbage collected in a "faceted" production system...

SOLR SimpleFacets don't use "getStrings" API, so the hope is memory
requirements are minimized.


However, Lucene may use it internally for some queries (or, for instance, to
get access to a nontokenized cached field without reading index)... to be
safe, use this in your basic memory estimates:


[512Mb ~ 1Gb] + [non_tokenized_fields_count] x [maxdoc] x [8 bytes]


-Fuad



> -Original Message-
> From: Fuad Efendi [mailto:f...@efendi.ca]
> Sent: November-02-09 7:37 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Lucene FieldCache memory requirements
> 
> 
> Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
> difference between maxdoc and maxdoc + 1 for such estimate... difference
is
> between 0.4Gb and 1.2Gb...
> 
> 
> So, let's vote ;)
> 
> A. [maxdoc] x [8 bytes ~ pointer to String object]
> 
> B. [maxdoc] x [8 bytes ~ pointer to Document object]
> 
> C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]
> - same as [String1_Document_Count + ... + String10_Document_Count] x [4
> bytes ~ DocumentID]
> 
> D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...]
> 
> 
> Please confirm that it is Pointer to Object and not Lucene Document ID...
I
> hope it is (int) Document ID...
> 
> 
> 
> 
> 
> > -Original Message-
> > From: Mark Miller [mailto:markrmil...@gmail.com]
> > Sent: November-02-09 6:52 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Lucene FieldCache memory requirements
> >
> > It also briefly requires more memory than just that - it allocates an
> > array the size of maxdoc+1 to hold the unique terms - and then sizes
down.
> >
> > Possibly we can use the getUnuiqeTermCount method in the flexible
> > indexing branch to get rid of that - which is why I was thinking it
> > might be a good idea to drop the unsupported exception in that method
> > for things like multi reader and just do the work to get the right
> > number (currently there is a comment that the user should do that work
> > if necessary, making the call unreliable for this).
> >
> > Fuad Efendi wrote:
> > > Thank you very much Mike,
> > >
> > > I found it:
> > > org.apache.solr.request.SimpleFacets
> > > ...
> > > // TODO: future logic could use filters instead of the
> fieldcache if
> > > // the number of terms in the field is small enough.
> > > counts = getFieldCacheCounts(searcher, base, field,
> offset,limit,
> > > mincount, missing, sort, prefix);
> > > ...
> > > FieldCache.StringIndex si =
> > > FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
> > > final String[] terms = si.lookup;
> > > final int[] termNum = si.order;
> > > ...
> > >
> > >
> > > So that 64-bit requires more memory :)
> > >
> > >
> > > Mike, am I right here?
> > > [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents
(100mlns)]
> > > (64-bit JVM)
> > > 1.2Gb RAM for this...
> > >
> > > Or, may be I am wrong:
> > >
> > >> For Lucene directly, simple strings would consume an pointer (4 or 8
> > >> bytes depending on whether your JRE is 64bit) per doc, and the string
> > >> index would consume an int (4 bytes) per doc.
> > >>
> > >
> > > [8 bytes (64bit)] x [number of documents (100mlns)]?
> > > 0.8

RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Hi Mark,

Yes, I understand it now; however, how will StringIndexCache size down in a
production system faceting by Country on a homepage? This is SOLR
specific...


Lucene specific: Lucene doesn't read from disk if it can retrieve field
value for a specific document ID from cache. How will it size down in purely
Lucene-based heavy-loaded production system? Especially if this cache is
used for query optimizations.



> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: November-02-09 8:53 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Lucene FieldCache memory requirements
> 
>  static final class StringIndexCache extends Cache {
> StringIndexCache(FieldCache wrapper) {
>   super(wrapper);
> }
> 
> @Override
> protected Object createValue(IndexReader reader, Entry entryKey)
> throws IOException {
>   String field = StringHelper.intern(entryKey.field);
>   final int[] retArray = new int[reader.maxDoc()];
>   String[] mterms = new String[reader.maxDoc()+1];
>   TermDocs termDocs = reader.termDocs();
>   TermEnum termEnum = reader.terms (new Term (field));
>   int t = 0;  // current term number
> 
>   // an entry for documents that have no terms in this field
>   // should a document with no terms be at top or bottom?
>   // this puts them at the top - if it is changed,
> FieldDocSortedHitQueue
>   // needs to change as well.
>   mterms[t++] = null;
> 
>   try {
> do {
>   Term term = termEnum.term();
>   if (term==null || term.field() != field) break;
> 
>   // store term text
>   // we expect that there is at most one term per document
>   if (t >= mterms.length) throw new RuntimeException ("there are
> more terms than " +
>   "documents in field \"" + field + "\", but it's
> impossible to sort on " +
>   "tokenized fields");
>   mterms[t] = term.text();
> 
>   termDocs.seek (termEnum);
>   while (termDocs.next()) {
> retArray[termDocs.doc()] = t;
>   }
> 
>   t++;
> } while (termEnum.next());
>   } finally {
> termDocs.close();
> termEnum.close();
>   }
> 
>   if (t == 0) {
> // if there are no terms, make the term array
> // have a single null entry
> mterms = new String[1];
>   } else if (t < mterms.length) {
> // if there are less terms than documents,
> // trim off the dead array space
> String[] terms = new String[t];
> System.arraycopy (mterms, 0, terms, 0, t);
> mterms = terms;
>   }
> 
>   StringIndex value = new StringIndex (retArray, mterms);
>   return value;
> }
>   };
> 
> The formula for a String Index fieldcache is essentially the String
> array of unique terms (which does indeed "size down" at the bottom) and
> the int array indexing into the String array.
> 
> 
> Fuad Efendi wrote:
> > To be correct, I analyzed FieldCache awhile ago and I believed it never
> > "sizes down"...
> >
> > /**
> >  * Expert: The default cache implementation, storing all values in
memory.
> >  * A WeakHashMap is used for storage.
> >  *
> >  * Created: May 19, 2004 4:40:36 PM
> >  *
> >  * @since   lucene 1.4
> >  */
> >
> >
> > Will it size down? Only if we are not faceting (as in SOLR v.1.3)...
> >
> > And I am still unsure, Document ID vs. Object Pointer.
> >
> >
> >
> >
> >
> >> I don't understand this:
> >>
> >>> so with a ton of docs and a few uniques, you get a temp boost in the
RAM
> >>> reqs until it sizes it down.
> >>>
> >> Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it
is
> >> not cache?
> >>
> >>
> >
> >
> >
> 
> 
> --
> - Mark
> 
> http://www.lucidimagination.com
> 
> 





RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Even in simplistic scenario, when it is Garbage Collected, we still
_need_to_be_able_ to allocate enough RAM to FieldCache on demand... linear
dependency on document count...


> 
> Hi Mark,
> 
> Yes, I understand it now; however, how will StringIndexCache size down in
a
> production system faceting by Country on a homepage? This is SOLR
> specific...
> 
> 
> Lucene specific: Lucene doesn't read from disk if it can retrieve field
> value for a specific document ID from cache. How will it size down in
purely
> Lucene-based heavy-loaded production system? Especially if this cache is
> used for query optimizations.
> 




Re: Why does BinaryRequestWriter force the path to be base URL + "/update/javabin"

2009-11-02 Thread Noble Paul നോബിള്‍ नोब्ळ्
yup, that can be relaxed. It was just a convention.

On Tue, Nov 3, 2009 at 5:24 AM, Stuart Tettemer  wrote:
> Hi folks,
> First of all, thanks for Solr.  It is a great piece of work.
>
> I have a question about BinaryRequestWriter in the solrj project.  Why does
> it force the path of UpdateRequests to have be "/update/javabin" (see
> BinaryRequestWriter.getPath(String) starting on line 109)?
>
> I am extending BinaryRequestWriter specifically to remove this requirement
> and am interested to know the reasoning behind in the inital choice.
>
> Thanks for your time,
> Stuart
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: Question regarding snapinstaller

2009-11-02 Thread Lance Norskog
In Posix-compliant systems (basically Unix system calls) a file exists
independent of file names, and there can be multiple names for a file.
If a program has a file open, that file can be deleted but it will
still exist until the program closes (or the program exits).

In the snapinstaller cycle, Solr holds the old index files open while
snapinstaller swaps in the new set. The 'commit' operation causes Solr
to (eventually) close all of the old index files and at that point
they will go away.

On Mon, Nov 2, 2009 at 1:26 PM, Prasanna Ranganathan
 wrote:
>
>  It looks like the snapinstaller script does an atomic remove and replace of
> the entire solr_home/data_dir/index folder with the contents of the new
> snapshot before issuing a commit command. I am trying to understand the
> implication of the same.
>
>  What happens to queries that come during the time interval between the
> instant the existing directory is removed and the commit command gets
> finalized? Does a currently running instance of Solr not need the files in
> the index folder to serve the query results? Are all the contents of the
> index folder loaded into memory?
>
>  Thanks in advance for any help.
>
> Regards,
>
> Prasanna.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Annotations and reference types

2009-11-02 Thread Noble Paul നോബിള്‍ नोब्ळ्
I guess this is not a very good idea.

The document itself is a flat data structure. It is hard to see that
is nested datastructure. If allowed , how deep would we wish to make
it.

The simple solution would be to write setters for "b_id" and "b_name" in class A
and the setters can inject values into B.

On Mon, Nov 2, 2009 at 10:05 PM, Shalin Shekhar Mangar
 wrote:
> On Thu, Oct 29, 2009 at 7:57 PM, M. Tinnemeyer  wrote:
>
>> Dear listusers,
>>
>> Is there a way to store an instance of class A (including the fields from
>> "myB") via solr using annotations ?
>> The index should look like : id; name; b_id; b_name
>>
>> --
>> Class A {
>>
>> @Field
>> private String id;
>> @Field
>> private String name;
>> @Field
>> private B myB;
>> }
>>
>> --
>> Class B {
>>
>> @Field("b_id")
>> private String id;
>> @Field("B_name")
>> private String name;
>> }
>>
>>
> No.
>
> I guess you want to represent certain fields in class B and have them as an
> attribute in Class A (but all fields belong to the same schema), then it can
> be a worthwhile addition to Solrj. Can you open an issue? A patch would be
> even better :)
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: field queries seem slow

2009-11-02 Thread Lance Norskog
This searches author:albert and (default text field): einstein. This
may not be what you expect?

On Mon, Nov 2, 2009 at 2:30 PM, Erick Erickson  wrote:
> H, are you sorting? And has your readers been reopened? Is the
> second query of that sort also slow? If the answer to this last question is
> "no",
> have you tried some autowarming queries?
>
> Best
> Erick
>
> On Mon, Nov 2, 2009 at 4:34 PM, mike anderson wrote:
>
>> I took a look through my Solr logs this weekend and noticed that the
>> longest
>> queries were on particular fields, like "author:albert einstein". Is this a
>> result consistent with other setups out there? If not, Is there a trick to
>> make these go faster? I've read up on filter queries and use those when
>> applicable, but they don't really solve all my problems.
>>
>> If anybody wants to take a shot at it but needs to see my solrconfig, etc
>> just let me know.
>>
>> Cheers,
>> Mike
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: tracking solr response time

2009-11-02 Thread Yonik Seeley
On Mon, Nov 2, 2009 at 2:21 PM, bharath venkatesh
 wrote:
> we observed many times there is huge mismatch between qtime and
> time measured at the client for the response

Long times to stream back the result to the client could be due to
 - client not reading fast enough
 - network congestion
 - reading the stored fields takes a long time
- this can happen with really big indexes that can't all fit in
memory, and stored fields tend to not be cached well by the OS
(essentially random access patterns over a huge area).  This ends up
causing a disk seek per document being
streamed back.
 - locking contention for reading the index (under Solr 1.3, but not
under 1.4 on non-windows platforms)

I didn't see where you said what Solr version you were using.  There
are some pretty big concurrency differences between 1.3 and 1.4 too
(if your tests involve many concurrent requests).

-Yonik
http://www.lucidimagination.com


RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
FieldCache uses internally WeakHashMap... nothing wrong, but... no any
Garbage Collection tuning will help in case if allocated RAM is not enough
for replacing Weak** with Strong**, especially for SOLR faceting... 10%-15%
CPU taken by GC were reported...
-Fuad





solrj query size limit?

2009-11-02 Thread Gregg Horan
I'm constructing a query using solrj that has a fairly large number of 'OR'
clauses.  I'm just adding it as a big string to setQuery(), in the format
"accountId:(this OR that OR yada)".

This works all day long with 300 values.  When I push it up to 350-400
values, I get a "Bad Request" SolrServerException.  It appears to just be a
client error - nothing reaching the server logs.  Very repeatable... dial it
back down and it goes through again fine.

The total string length of the query (including a handful of other faceting
entries) is about 9500chars.   I do have the maxBooleanClauses jacked up to
2048.  Using javabin.  1.4-dev.

Are there any other options or settings I might be overlooking?

-Gregg


Proper way to set up Multi Core / Core admin

2009-11-02 Thread Jonathan Hendler
Getting started with multi core setup following http://wiki.apache.org/solr/CoreAdmin 
 and the book. Generally everything makes sense, but I have one  
question.


Here's how easy it was:

place the solr.war into the server
create your core directories in the newly created solr/ directory
set up solr.xml, the config files for a data import handler, the  
[core]/conf/solrconfig.xml [core]/conf/schema.xml, etc
copy the /admin directory present in /solr into each /solr/[core]  
directory


Is step 4 a correct step in the setting up of a multi core environment?

TIA

Re: Proper way to set up Multi Core / Core admin

2009-11-02 Thread Jonathan Hendler

Sorry for the confusion - step four is to be avoided, obviously.


On Nov 2, 2009, at 11:46 PM, Jonathan Hendler wrote:

Getting started with multi core setup following http://wiki.apache.org/solr/CoreAdmin 
 and the book. Generally everything makes sense, but I have one  
question.


Here's how easy it was:

place the solr.war into the server
create your core directories in the newly created solr/ directory
set up solr.xml, the config files for a data import handler, the  
[core]/conf/solrconfig.xml [core]/conf/schema.xml, etc
copy the /admin directory present in /solr into each /solr/[core]  
directory


Is step 4 a correct step in the setting up of a multi core  
environment?


TIA




Re: Match all terms in doc

2009-11-02 Thread Shalin Shekhar Mangar
On Sun, Nov 1, 2009 at 3:33 AM, Magnus Eklund wrote:

> Hi
>
> How do I restrict hits to documents containing all words (regardless of
> order) of a query in particular field?
>
> Suppose I have two documents with a field called name in my index:
>
> doc1 => name: Pink
> doc2 => name: Pink Floyd
>
> When querying for "Pink" I want only doc1 and when querying for "Pink
> Floyd" or "Floyd Pink" I want doc2.
>
>
You can query like:
+name:Floyd +name:Pink

The + character means a must have condition. This will match documents which
have Floyd as well as Pink in any order.
-- 
Regards,
Shalin Shekhar Mangar.


Re: SpellCheckComponent suggestions and case

2009-11-02 Thread Shalin Shekhar Mangar
On Sat, Oct 31, 2009 at 2:51 AM, Acadaca  wrote:

>
> I am having great difficulty getting SpellCheckComponent to ignore case.
>
> Given a search of Glod, the suggestion is wood
> Given a search of glod, the suggestion is gold
>
> I am using LowerCaseTokenizerFactory for both query and index, so as I
> understand it Glod and glod should be treated the same. If not, how can I
> truly ignore case?
>
>
What parameters are you specifying for a spell check query? If you use
spellcheck.q then the field's query analyzer is used otherwise (if you use q
parameter), it just uses a white space tokenizer and case can matter.

-- 
Regards,
Shalin Shekhar Mangar.


Re: json.wrf parameter

2009-11-02 Thread Shalin Shekhar Mangar
On Sun, Nov 1, 2009 at 5:55 AM, Ankit Bhatnagar wrote:

> Hi Yonik,
>
> I have a question regarding json.wrf parameter that you introduced in Solr
> query.
>
> I am using YUi Datasource widget and it accepts JSONP format.
>
> Could you tell me if I specify json.wrf in the query will solr return the
> response enclosed in () which is essentially JSONP format.
>
>
Yes, using json.wrf will return the response enclosed in round brackets.

-- 
Regards,
Shalin Shekhar Mangar.


Re: solrj query size limit?

2009-11-02 Thread Avlesh Singh
Did you hit the limit for maximum number of characters in a GET request?

Cheers
Avlesh

On Tue, Nov 3, 2009 at 9:36 AM, Gregg Horan  wrote:

> I'm constructing a query using solrj that has a fairly large number of 'OR'
> clauses.  I'm just adding it as a big string to setQuery(), in the format
> "accountId:(this OR that OR yada)".
>
> This works all day long with 300 values.  When I push it up to 350-400
> values, I get a "Bad Request" SolrServerException.  It appears to just be a
> client error - nothing reaching the server logs.  Very repeatable... dial
> it
> back down and it goes through again fine.
>
> The total string length of the query (including a handful of other faceting
> entries) is about 9500chars.   I do have the maxBooleanClauses jacked up to
> 2048.  Using javabin.  1.4-dev.
>
> Are there any other options or settings I might be overlooking?
>
> -Gregg
>


Re: Problems downloading lucene 2.9.1

2009-11-02 Thread Licinio Fernández Maurelo
Thanks guys !!!

2009/11/2 Ryan McKinley 

>
> On Nov 2, 2009, at 8:29 AM, Grant Ingersoll wrote:
>
>
>> On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote:
>>
>>  Hi folks,
>>>
>>> as we are using an snapshot dependecy to solr1.4, today we are getting
>>> problems when maven try to download lucene 2.9.1 (there isn't a any 2.9.1
>>> there).
>>>
>>> Which repository can i use to download it?
>>>
>>
>> They won't be there until 2.9.1 is officially released.  We are trying to
>> speed up the Solr release by piggybacking on the Lucene release, but this
>> little bit is the one downside.
>>
>
> Until then, you can add a repo to:
>
> http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/
>
>
>


-- 
Lici


Re: Problems downloading lucene 2.9.1

2009-11-02 Thread Licinio Fernández Maurelo
Well, i've solved this problem executing  mvn install:install-file
-DgroupId=org.apache.lucene -DartifactId=lucene-analyzers -Dversion=2.9.1
-Dpackaging=jar -Dfile= for each lucene-* artifact.

I think there must be an easier way to do this, am i wrong?

Hope it helps

Thx

El 3 de noviembre de 2009 08:03, Licinio Fernández Maurelo <
licinio.fernan...@gmail.com> escribió:

> Thanks guys !!!
>
> 2009/11/2 Ryan McKinley 
>
>
>> On Nov 2, 2009, at 8:29 AM, Grant Ingersoll wrote:
>>
>>
>>> On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote:
>>>
>>>  Hi folks,

 as we are using an snapshot dependecy to solr1.4, today we are getting
 problems when maven try to download lucene 2.9.1 (there isn't a any
 2.9.1
 there).

 Which repository can i use to download it?

>>>
>>> They won't be there until 2.9.1 is officially released.  We are trying to
>>> speed up the Solr release by piggybacking on the Lucene release, but this
>>> little bit is the one downside.
>>>
>>
>> Until then, you can add a repo to:
>>
>> http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/
>>
>>
>>
>
>
> --
> Lici
>



-- 
Lici