Re: Does Solr fit my needs?

2012-04-30 Thread G.Long

Hi :)

Thank you all for your answers. I'll try these solutions :)

Kind regards,

Gary

Le 27/04/2012 16:31, G.Long a écrit :

Hi there :)

I'm looking for a way to save xml files into some sort of database and 
i'm wondering if Solr would fit my needs.
The xml files I want to save have a lot of child nodes which also 
contain child nodes with multiple values. The depth level can be more 
than 10.


After having indexed the files, I would like to be able to query for 
subparts of those xml files and be able to reconstruct them as xml 
files with all their children included. However, I'm wondering if it 
is possible with an index like solr lucene to keep or easily recover 
the structure of my xml data?


Thanks for your help,

Regards,

Gary




Re: Java out of memory - with fieldcache faceting

2012-04-30 Thread Dan Tuffery
You need to add more memory to the JVM that is running Solr:

http://wiki.apache.org/solr/SolrPerformanceFactors#OutOfMemoryErrors

Dan

On Mon, Apr 30, 2012 at 9:43 AM, Yuval Dotan  wrote:

> Hi Guys
> I have a problem and i need your assistance
> I get an exception when doing field cache faceting (the enum method works
> perfectly):
>
> */solr/select?q=*:*&facet=true&facet.field=src_ip_str&facet.limit=10*
>
> 
> java.lang.OutOfMemoryError: Java heap space
> 
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:449)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:277)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
> at org.eclipse.jetty.server.Server.handle(Server.java:351) at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634) at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
> at java.lang.Thread.run(Thread.java:679) Caused by:
> java.lang.OutOfMemoryError: Java heap space at
> org.apache.lucene.util.packed.Direct16.(Direct16.java:38) at
> org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:267) at
> org.apache.lucene.util.packed.GrowableWriter.set(GrowableWriter.java:81) at
> org.apache.lucene.search.FieldCacheImpl$DocTermsIndexCache.createValue(FieldCacheImpl.java:1178)
> at
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:248)
> at
> org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1081)
> at
> org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1077)
> at
> org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:459)
> at
> org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:310)
> at
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:396)
> at
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:205)
> at
> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:81)
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1541) at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
> at

Lucene FieldCache - Out of memory exception

2012-04-30 Thread Rahul R
Hello,
I am using solr 1.3 with jdk 1.5.0_14 and weblogic 10MP1 application server
on Solaris. I use embedded solr server. More details :
Number of docs in solr index : 1.4 million
Physical size of index : 640MB
Total number of fields in the index : 700 (99% of these are dynamic fields)
Total number of fields enabled for faceting : 440
Avg number of facet fields participating in a faceted query : 50-70
Total RAM allocated to weblogic appserver : 3GB (max possible)

In a multi user environment with 3 users using this application for a
period of around 40 minutes, the application runs out of memory. Analysis
of the heap dump shows that almost 85% of the memory is retained by the
FieldCache. Now I understand that the field cache is out of our control but
would appreciate some suggestions on how to handle this issue.

Some questions on this front :
- some mail threads on this forum seem to indicate that there could be some
connection between having dynamic fields and usage of FieldCache. Is this
true ? Most of the fields in my index are dynamic fields.
- as mentioned above, most of my faceted queries could have around 50-70
facet fields (I would do SolrQuery.addFacetField() for around 50-70 fields
per query). Could this be the source of the problem ? Is this too high for
solr to support ?
- Initially, I had a facet.sort defined in solrconfig.xml. Since FieldCache
builds up on sorting, I even removed the facet.sort and tried, but no
respite. The behavior is same as before.
- The document id that I have for each document is quite big (around 50
characters on average). Can this be a problem ? I reduced this to around 15
characters and tried but still there is no improvement.
- Can the size of the data be a problem ? But on this forum, I see many
users talking of more than 100 million documents in their index. I have
only 1.4 million with physical size of 640MB. The physical server on which
this application is running, has sufficient RAM and CPU.
- What gets stored in the FieldCache ? Is it the entire document or just
the document Id ?


Any help is much appreciated. Thank you.

regards
Rahul


Re: Weird query results with edismax and boolean operator +

2012-04-30 Thread Vadim Kisselmann
Hi Jan,
thanks for your response!

My "qf" parameter for edismax is: "title". My
"defaultSearchField=text" in schema.xml.
In my app i generate a query with "qf=title,text", so i think the
default parameters in config/schema should bei overridden, right?

I found eventually 2 reasons for this behavior.
1. "mm"-parameter in solrconfig.xml for edismax is 0. 0 stands for
"OR", but it should be an "AND" => 100%.
2. I suppose that my app does not override my "default-qf".
I test it today and report, with my parsed query and all params.

Best regards
Vadim




2012/4/29 Jan Høydahl :
> Hi,
>
> What is your "qf" parameter?
> Can you run the three queries with debugQuery=true&echoParams=all and attach 
> parsed query and all params? It will probably explain what is happening.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 27. apr. 2012, at 11:21, Vadim Kisselmann wrote:
>
>> Hi folks,
>>
>> i use solr 4.0 from trunk, and edismax as standard query handler.
>> In my schema i defined this:  
>>
>> I have this simple problem:
>>
>> nascar +author:serg* (3500 matches)
>>
>> +nascar +author:serg* (1 match)
>>
>> nascar author:serg* (5200 matches)
>>
>> nascar  AND author:serg* (1 match)
>>
>> I think i understand the query syntax, but this behavior confused me.
>> Why this match-differences?
>>
>> By the way, i get in all matches at least one of my terms.
>> But not always both.
>>
>> Best regards
>> Vadim
>


Re: Dynamic creation of cores for this use case.

2012-04-30 Thread pprabhcisco123


Thanks kuli, for your response. We tried to implement as per the
instruction.  But the problem again is how to create index for every thirty
customers sepertaley. is there any programmatic way out to do or do we need
to create query in configuration file.

Thanks 
Prabakarab.P

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-creation-of-cores-for-this-use-case-tp3937696p3950352.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How do create dynamic core using SOLRJ

2012-04-30 Thread ayyappan
It is seems to be working fine .
But i have few question abt indexing

1)i want do index to each customer as well as partner.
2 )how do i create index to each partner (30 customers) ?

As of now i am index all customer using data-config.xml


   







--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-do-create-dynamic-core-using-SOLRJ-tp3943530p3950398.html
Sent from the Solr - User mailing list archive at Nabble.com.


Saravanan Chinnadurai/Actionimages is out of the office.

2012-04-30 Thread Saravanan . Chinnadurai
I will be out of the office starting  30/04/2012 and will not return until
01/05/2012.

Please email to itsta...@actionimages.com  for any urgent issues.


Action Images is a division of Reuters Limited and your data will therefore be 
protected
in accordance with the Reuters Group Privacy / Data Protection notice which is 
available
in the privacy footer at www.reuters.com
Registered in England No. 145516   VAT REG: 397000555


Re: Java out of memory - with fieldcache faceting

2012-04-30 Thread Yuval Dotan
Thanks for the fast answer
One more question:
Is there a way to know (some formula) what is the size of memory i need for
these actions?

Thanks
Yuval

On Mon, Apr 30, 2012 at 11:50, Dan Tuffery  wrote:

> You need to add more memory to the JVM that is running Solr:
>
> http://wiki.apache.org/solr/SolrPerformanceFactors#OutOfMemoryErrors
>
> Dan
>
> On Mon, Apr 30, 2012 at 9:43 AM, Yuval Dotan  wrote:
>
> > Hi Guys
> > I have a problem and i need your assistance
> > I get an exception when doing field cache faceting (the enum method works
> > perfectly):
> >
> > */solr/select?q=*:*&facet=true&facet.field=src_ip_str&facet.limit=10*
> >
> > 
> > java.lang.OutOfMemoryError: Java heap space
> > 
> > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> at
> >
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:449)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:277)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
> > at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
> > at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
> > at org.eclipse.jetty.server.Server.handle(Server.java:351) at
> >
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
> > at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
> > at
> >
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
> > at
> >
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
> > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634) at
> > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
> > at
> >
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
> > at java.lang.Thread.run(Thread.java:679) Caused by:
> > java.lang.OutOfMemoryError: Java heap space at
> > org.apache.lucene.util.packed.Direct16.(Direct16.java:38) at
> > org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:267)
> at
> > org.apache.lucene.util.packed.GrowableWriter.set(GrowableWriter.java:81)
> at
> >
> org.apache.lucene.search.FieldCacheImpl$DocTermsIndexCache.createValue(FieldCacheImpl.java:1178)
> > at
> >
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:248)
> > at
> >
> org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1081)
> > at
> >
> org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1077)
> > at
> >
> org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:459)
> > at
> > org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:310)
> > at
> >
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:396)
> > at
> >
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:205)
> > at
> >
> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:81)
> > at
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
> > at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1541) at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> > at
> >
> org.eclipse.jetty.servlet.Servl

FW: Unsubscribe does not appear to be working

2012-04-30 Thread Kevin Bootz
I continue to receive posts from the solr group even after submitting an 
unsubscribe per the instructions from the ezmlm app. Is there perhaps a delay 
after I confirm the unsubscribe request? 14 posts received so far today. At 
this point I have a delete rule to auto trash any received but unnecessary 
traffic on our server.

Thanks


-Original Message-
From: Kevin Bootz 
Sent: Sunday, April 29, 2012 11:13 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unsubscribe does not appear to be working

Thanks all. Second unsubscribe confirmation email sent to the ezmlm app. 
Perhaps it will take this time...
"

Hi! This is the ezmlm program. I'm managing the solr-user@lucene.apache.org 
mailing list.

I'm working for my owner, who can be reached at 
solr-user-ow...@lucene.apache.org.

To confirm that you would like

   myemail

removed from the solr-user mailing list, please send a short reply to this 
address:

   solr-user-uc.1335712063.gfmnpbjnkpamcooicane-myem...@lucene.apache.org

...
"


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Friday, April 27, 2012 12:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Unsubscribe does not appear to be working


: There is no such thing as a 'solr forum' or a 'solr forum account.'
: 
: If you are subscribed to this list, an email to the unsubscribe
: address will unsubscribe you. If some intermediary or third party is
: forwarding email from this list to you, no one here can help you.

And more specificaly:

* sending email to solr-user-help@lucene will generate an automated reply with 
details about how to unsubscribe and even how to tell what address you are 
subscribed with.

* if the autoamted system isn't working for you, please send all of the 
important details (who you are, what you've tried, what automated responses 
you've gotten) to the solr-user-owner@lucene alias so the moderators can try to 
help you.



-Hoss


FW: unsubscribe

2012-04-30 Thread Kevin Bootz

BTW,
The first request to unsubscribe was sent in February if that helps track 
this down

Thx

From: Kevin Bootz
Sent: Friday, February 24, 2012 7:55 AM
To: 
'solr-user-uc.1330079879.acnmkgjcnnlfgdhmmlkn-kbootz=caci@lucene.apache.org'
Subject: unsubscribe




Re: Java out of memory - with fieldcache faceting

2012-04-30 Thread Dan Tuffery
There's a Lucene/Solr memory size estimator spreadsheet in the SVN:

http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls

Dan

On Mon, Apr 30, 2012 at 11:39 AM, Yuval Dotan  wrote:

> Thanks for the fast answer
> One more question:
> Is there a way to know (some formula) what is the size of memory i need for
> these actions?
>
> Thanks
> Yuval
>
> On Mon, Apr 30, 2012 at 11:50, Dan Tuffery  wrote:
>
> > You need to add more memory to the JVM that is running Solr:
> >
> > http://wiki.apache.org/solr/SolrPerformanceFactors#OutOfMemoryErrors
> >
> > Dan
> >
> > On Mon, Apr 30, 2012 at 9:43 AM, Yuval Dotan 
> wrote:
> >
> > > Hi Guys
> > > I have a problem and i need your assistance
> > > I get an exception when doing field cache faceting (the enum method
> works
> > > perfectly):
> > >
> > > */solr/select?q=*:*&facet=true&facet.field=src_ip_str&facet.limit=10*
> > >
> > > 
> > > java.lang.OutOfMemoryError: Java heap space
> > > 
> > > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:449)
> > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:277)
> > > at
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> > > at
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
> > > at
> > >
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
> > > at
> > >
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
> > > at
> > >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
> > > at
> > >
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
> > > at org.eclipse.jetty.server.Server.handle(Server.java:351) at
> > >
> >
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
> > > at
> > >
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
> > > at
> > >
> >
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
> > > at
> > >
> >
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
> > > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634) at
> > > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
> at
> > >
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
> > > at
> > >
> >
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
> > > at
> > >
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
> > > at
> > >
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
> > > at java.lang.Thread.run(Thread.java:679) Caused by:
> > > java.lang.OutOfMemoryError: Java heap space at
> > > org.apache.lucene.util.packed.Direct16.(Direct16.java:38) at
> > >
> org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:267)
> > at
> > >
> org.apache.lucene.util.packed.GrowableWriter.set(GrowableWriter.java:81)
> > at
> > >
> >
> org.apache.lucene.search.FieldCacheImpl$DocTermsIndexCache.createValue(FieldCacheImpl.java:1178)
> > > at
> > >
> >
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:248)
> > > at
> > >
> >
> org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1081)
> > > at
> > >
> >
> org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1077)
> > > at
> > >
> >
> org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:459)
> > > at
> > >
> org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:310)
> > > at
> > >
> >
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:396)
> > > at
> > >
> >
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:205)
> > > at
> > >
> >
> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:81)
> > > at
> > >
> >
> org.apache.solr.handler

Re: Weird query results with edismax and boolean operator +

2012-04-30 Thread Vadim Kisselmann
I tested it.
With default "qf=title text" in solrconfig and "mm=100%"
i get the same result(1) for "nascar AND author:serg*" and "+nascar
+author:serg*", great.
With "nascar +author:serg*" i get 3500 matches, in this case the
mm-parameter seems not to work.

Here are my debug params for "nascar AND author:serg*":

nascar AND author:serg*
(+(+DisjunctionMaxQuery((text:nascar |
title:nascar)~0.01) +author:serg*))/no_coord
+(+(text:nascar | title:nascar)~0.01
+author:serg*)
8.235954 = (MATCH) sum of:
  8.10929 = (MATCH) max plus 0.01 times others of:
8.031613 = (MATCH) weight(text:nascar in 0) [DefaultSimilarity], result of:
  8.031613 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
0.84814763 = queryWeight, product of:
  6.6960144 = idf(docFreq=27, maxDocs=8335)
  0.12666455 = queryNorm
9.469594 = fieldWeight in 0, product of:
  1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
  6.6960144 = idf(docFreq=27, maxDocs=8335)
  1.0 = fieldNorm(doc=0)
7.7676363 = (MATCH) weight(title:nascar in 0) [DefaultSimilarity],
result of:
  7.7676363 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.9919093 = queryWeight, product of:
  7.830994 = idf(docFreq=8, maxDocs=8335)
  0.12666455 = queryNorm
7.830994 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  7.830994 = idf(docFreq=8, maxDocs=8335)
  1.0 = fieldNorm(doc=0)
  0.12666455 = (MATCH) ConstantScore(author:serg*), product of:
1.0 = boost
0.12666455 = queryNorm



And here for  "nascar +author:serg*":
nascar +author:serg*
(+(DisjunctionMaxQuery((text:nascar |
title:nascar)~0.01) +author:serg*))/no_coord
+((text:nascar | title:nascar)~0.01
+author:serg*)
8.235954 = (MATCH) sum of:
  8.10929 = (MATCH) max plus 0.01 times others of:
8.031613 = (MATCH) weight(text:nascar in 0) [DefaultSimilarity], result of:
  8.031613 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
0.84814763 = queryWeight, product of:
  6.6960144 = idf(docFreq=27, maxDocs=8335)
  0.12666455 = queryNorm
9.469594 = fieldWeight in 0, product of:
  1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
  6.6960144 = idf(docFreq=27, maxDocs=8335)
  1.0 = fieldNorm(doc=0)
7.7676363 = (MATCH) weight(title:nascar in 0) [DefaultSimilarity],
result of:
  7.7676363 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.9919093 = queryWeight, product of:
  7.830994 = idf(docFreq=8, maxDocs=8335)
  0.12666455 = queryNorm
7.830994 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  7.830994 = idf(docFreq=8, maxDocs=8335)
  1.0 = fieldNorm(doc=0)
  0.12666455 = (MATCH) ConstantScore(author:serg*), product of:
1.0 = boost
0.12666455 = queryNorm


0.063332275 = (MATCH) product of:
  0.12666455 = (MATCH) sum of:
0.12666455 = (MATCH) ConstantScore(author:serg*), product of:
  1.0 = boost
  0.12666455 = queryNorm
  0.5 = coord(1/2)



You can see, that for first doc in "nascar +author:serg*" all
query-params match, but in the second doc only
"ConstantScore(author:serg*)".
But with an "mm=100%" all query-params should match.
http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/
http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html

Best regards
Vadim



2012/4/30 Vadim Kisselmann :
> Hi Jan,
> thanks for your response!
>
> My "qf" parameter for edismax is: "title". My
> "defaultSearchField=text" in schema.xml.
> In my app i generate a query with "qf=title,text", so i think the
> default parameters in config/schema should bei overridden, right?
>
> I found eventually 2 reasons for this behavior.
> 1. "mm"-parameter in solrconfig.xml for edismax is 0. 0 stands for
> "OR", but it should be an "AND" => 100%.
> 2. I suppose that my app does not override my "default-qf".
> I test it today and report, with my parsed query and all params.
>
> Best regards
> Vadim
>
>
>
>
> 2012/4/29 Jan Høydahl :
>> Hi,
>>
>> What is your "qf" parameter?
>> Can you run the three queries with debugQuery=true&echoParams=all and attach 
>> parsed query and all params? It will probably explain what is happening.
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>>
>> On 27. apr. 2012, at 11:21, Vadim Kisselmann wrote:
>>
>>> Hi folks,
>>>
>>> i use solr 4.0 from trunk, and edismax as standard query handler.
>>> In my schema i defined this:  
>>>
>>> I have this simple problem:
>>>
>>> nascar +author:serg* (3500 matches)
>>>
>>> +nascar +author:serg* (1 match)
>>>
>>> nascar author:serg* (5200 matches)
>>>
>>> nascar  AND author:serg* (1 match)
>>>
>>> I think i understand the query syntax, but this behavior

Re: commit fail

2012-04-30 Thread Erick Erickson
In the 3.6 world, LukeRequestHandler does some...er...really expensive
things when you click into the admin/schema browser. This is _much_
better in trunk BTW.

So, as Yonik says, LukeRequestHandler probably accounts for
one of the threads.

Does this occur when nobody is playing around with the admin
handler?

Erick

On Sat, Apr 28, 2012 at 10:03 AM, Yonik Seeley
 wrote:
> On Sat, Apr 28, 2012 at 7:02 AM, mav.p...@holidaylettings.co.uk
>  wrote:
>> Hi,
>>
>> This is what the thread dump looks like.
>>
>> Any ideas?
>
> Looks like the thread taking up CPU is in LukeRequestHandler
>
>> 1062730578@qtp-1535043768-5' Id=16, RUNNABLE on lock=, total cpu
>> time=16156160.ms user time=16153110.msat
>> org.apache.solr.handler.admin.LukeRequestHandler.getIndexedFieldsInfo(LukeR
>> equestHandler.java:320)
>
> That probably accounts for the 1 CPU doing things... but it's not
> clear at all why commits are failing.
>
> Perhaps the commit is succeeding, but the client is just not waiting
> long enough for it to complete?
>
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10


Re: change index/store at indexing time

2012-04-30 Thread Erick Erickson
Your idea of using a Transformer will work just fine, you have a lot more
flexibility in a custom Transformer, see:
http://wiki.apache.org/solr/DIHCustomTransformer

You could also write a custom update handler that examined the
document on the server side and implemented your logic, or even
just add an element into an "update chain".

None of these is very hard, but they can be a bit intimidating
when stating from scratch.

Best
Erick

On Sun, Apr 29, 2012 at 12:53 PM, Vazquez, Maria (STM)
 wrote:
> Thanks for your response.
> That's what I need, changes at indexing time. Dynamic fields are not what I 
> need because the field name is the same, I just need to change if they 
> indexed/stored based on some logic.
> It's so easily achieved with the Lucene API, I was sure there was a way to do 
> the same in Solr.
>
>
> On Apr 28, 2012, at 22:34, "Jeevanandam"  wrote:
>
>> Maria,
>>
>> thanks for detailed explanation.
>> as per schema.xml; stored or indexed should be defined at design-time. Per 
>> my understanding defining at runtime is not feasible.
>> BTW, you can have multiValued="true" attribute for dynamic fields too.
>>
>> - Jeevanandam
>>
>> On 29-04-2012 2:06 am, Vazquez, Maria (STM) wrote:
>>> Thanks Jeevanandam.
>>> That still doesn't have the same behavior as Lucene since multiple
>>> fields with different names have to be created.
>>> What I want is this exactly (multi-value field)
>>>
>>> document.add(new Field("geoids", geoId, Field.Store.YES,
>>> Field.Index.NOT_ANALYZED_NO_NORMS));
>>>
>>> document.add(new Field("geoids", geoId, Field.Store.NO,
>>> Field.Index.NOT_ANALYZED_NO_NORMS));
>>>
>>> In Lucene I can save geoids first as stored and in the next line as
>>> not stored and it will do exactly that. I want to duplicate this
>>> behavior in Solr but I can't do it having only one field in the schema
>>> called geoids that I an manipulate at inde time whether to store or
>>> not depending on a condition.
>>>
>>> Thanks again for the help, hope this explanation makes it more clear
>>> in what I'm trying to do.
>>>
>>> Maria
>>>
>>> On Apr 28, 2012, at 11:49 AM, "Jeevanandam"
>>> mailto:je...@myjeeva.com>> wrote:
>>>
>>> Maria,
>>>
>>> For your need please define unique pattern using dynamic field in schema.xml
>>>
>>> Please have a look http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
>>>
>>> Hope that helps!
>>>
>>> -Jeevanandam
>>>
>>> Technology keeps you connected!
>>>
>>> On Apr 28, 2012, at 10:33 PM, "Vazquez, Maria (STM)"
>>> mailto:maria.vazq...@dexone.com>> wrote:
>>>
>>> I can call a script for the logic part but what I want to figure out
>>> is how to save the same field sometimes as stored and indexed,
>>> sometimes as stored not indexed, etc. From a transformer or a script I
>>> didn't see anything where I can modify that at indexing time.
>>> Thanks a lot,
>>> Maria
>>>
>>>
>>> On Apr 27, 2012, at 18:38, "Bill Bell"
>>> mailto:billnb...@gmail.com>> wrote:
>>>
>>> Yes you can. Just use a script that is called for each row.
>>>
>>> Bill Bell
>>> Sent from mobile
>>>
>>>
>>> On Apr 27, 2012, at 6:38 PM, "Vazquez, Maria (STM)"
>>> mailto:maria.vazq...@dexone.com>> wrote:
>>>
>>> Hi,
>>> I'm migrating a project from Lucene 2.9 to Solr 3.4.
>>> There is a special case in the code that indexes the same field in
>>> two different ways, which is completely legal in Lucene directly but I
>>> don't know how to duplicate this same behavior in Solr:
>>>
>>> if (isFirstGeo) {
>>> document.add(new Field("geoids", geoId, Field.Store.YES,
>>> Field.Index.NOT_ANALYZED_NO_NORMS));
>>> isFirstGeo = false;
>>> } else {
>>> if (countProducts < 100)
>>>      document.add(new Field("geoids", geoId, Field.Store.NO,
>>> Field.Index.NOT_ANALYZED_NO_NORMS));
>>> else
>>>      document.add(new Field("geoids", geoId, Field.Store.YES,
>>> Field.Index.NO));
>>> }
>>>
>>> Is there any way to do this in Solr in a Tranformer? I'm using the
>>> DIH to index and I can't see a way to do this other than having three
>>> fields in the schema like geoids_store_index, geoids_nostore_index,
>>> and geoids_store_noindex.
>>>
>>> Thanks a lot in advance.
>>> Maria
>>


Re: Scaling Solr - Suggestions !!

2012-04-30 Thread Erick Erickson
I'd get to the root of why indexes are corrupt! This should
be very unusual. If you're seeing this at all frequently,
it indicates something is very wrong and starting bunches
of JVMs up is a band-aid over a much more serious
problem.

Are you, by chance, doing a kill -9? or other hard-abort?

Best
Erick

On Mon, Apr 30, 2012 at 12:22 AM, Sujatha Arun  wrote:
> Now the reason ,I have used different webapps instead of a single one for
> the cores is ,while prototyping ,I discovered that ,when one of the cores
> index is corrupt ,the entire webapp does not start up and the same must be
> true of  "too many open files" etc ,that is to say if there is an issue
> withe any one core [Schema /index] ,the entire webapp does not start up.
>
> Thanks for your suggestion.
>
>
> Regards
> Sujatha
>
>
>
>
>
>
>
> On Sat, Apr 28, 2012 at 6:49 PM, Michael Della Bitta <
> michael.della.bi...@appinions.com> wrote:
>
>> Just my opinion, but I'm not sure I see the value in deploying the cores
>> to different webapps in a single container on a single machine to avoid
>> a single point of failure... You still have a single point of failure at
>> the process level down to the hardware, which when you think about it,
>> is mostly everything. But perhaps you're at least using more than one
>> container.
>>
>> It sounds to me that the easiest route to scalability for you would be
>> to add more machines. Unless your cores are particularly complex or your
>> traffic is heavy, a 3GB core should be no match for a single machine.
>> And the traffic problem can be solved by replication and load balancing.
>>
>> Michael
>>
>> On Sat, 2012-04-28 at 13:24 +0530, Sujatha Arun wrote:
>> > Hello,
>> >
>> > *Background* :For each of our  customers, we create 3 solr webapps with
>> > different search  schema's,serving different search requirements and we
>> > have about 70 customers.So we have about 210 webapps curently .
>> >
>> > *Hardware*: Single Server , one JVM , Heap memory 19GB ,Total Ram :32GB ,
>> > Permgen initally 1GB ,now increased to 2GB.
>> >
>> > *Solr Indexes* : Most are the order of a few MB ,about 2  big index of
>> > about 3GB  each
>> >
>> > *Scaling Step 1 *:  We saw the permgen value go upto to nearly 850 mb
>> ,when
>> > we created so  many webapps ,hence now we are moving to solr cores and we
>> > are going to have about 50 cores per webapp ,bringing the number of
>> webapps
>> > to about 5 . We want to distribute the cores with multiple webapps to
>> avoid
>> > a single point of failure.
>> >
>> >
>> > *Requirement* :
>> >
>> >
>> >    -   We need to only scale the cores horizontally ,whose index sizes
>> are
>> >    big.
>> >    -   We also require permission based search for each webapp ,would
>> solr
>> >    NRT fit our needs ,where we can index the permission into the document
>> >    ,which would mean   there would be frequent addition and deletion of
>> >    permissions to the documents across cores.
>> >    -   We also require  automatic fail over
>> >
>> > What technology would be ideal fit given Solr Cloud ,Katta , Solandra
>> > ,Lily,Elastic Search etc [Preferably Open source] [ We would be required
>> to
>> > maintain many webapps with multicores ] and what about the commercial
>> > offering given out use case
>> >
>> > Thanks.
>> >
>> > Regards,
>> > Sujatha
>>
>>
>>


Re: Using Customized sorting in Solr

2012-04-30 Thread Erick Erickson
Consider writing a custom sort method or a custom function
that you use for sorting. Be _very_ careful that anything you
do here is very efficient, it'll be called a _lot_.

Best
Erick

On Mon, Apr 30, 2012 at 2:10 AM, solr user  wrote:
> Hi,
>
> Any suggestions,
>
> Am I trying to do too much with solr? Is there any other search engine,
> which should be used here?
>
> I am looking into solr codebase and planning to modify QueryComponent. Will
> this be the right approach?
>
> Regards,
>
> Shivam
>
> On Fri, Apr 27, 2012 at 10:48 AM, solr user  wrote:
>
>> Jan,
>>
>> Thanks for the response,
>>
>> I though of using it, but it will be suboptimal to do this in the scenario
>> I have. I guess I have to explain the scenario better, let me try it again:-
>>
>> 1. I have importance based buckets in the system, this is implemented
>> using a variable named bucket_count having integer values 0,1,2,3, and I
>> have to show results in order of bucket_count i.e. results from 0th bucket
>> at top, then results from 1st bucket and so on. That is done by doing a asc
>> sort on this variable.
>> 2. Now *within these buckets* I need to ensure that 1st listing of every
>> advertiser comes at top, then 2nd listing from every advertiser and so on.
>>
>> Now if I go with the grouping on advertiserId and and use the
>> group.offset, then probably I also need to do additive filtering on
>> bucket_count. To explain it better pseudo algorithm will be like
>>
>> 1. query solr with group.offset 0 and bucket count 0
>> 2. if results more than zero in step1 then increase group offset and
>> follow step 1 again
>> 3. else increase bucket count with group offset zero and start from step 1.
>>
>> With this logic in the worst case I need to query solr (number of
>> importance buckets)*(max number of listings by an advertiser). Which could
>> be very high number of solr queries for a single user query. Please suggest
>> if I can do this with more optimal way. I am also open to do modifications
>> in solr/lucene code if needed.
>>
>> Regards,
>> BC Rathore
>>
>>
>>
>> On Fri, Apr 27, 2012 at 4:09 AM, Jan Høydahl wrote:
>>
>>> Hi,
>>>
>>> How about trying grouping with paging?
>>> First you do
>>> group=true&group.field=advertiserId&group.limit=1&group.offset=0&group.main=true&sort=something&group.sort=how-much-paid
>>> desc
>>>
>>> That gives you one listing per advertiser, sorted the way you like.
>>> Then to grab the next batch of ads, you go group.offset=1 etc etc.
>>>
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> Solr Training - www.solrtraining.com
>>>
>>> On 26. apr. 2012, at 08:10, solr user wrote:
>>>
>>> > Hi,
>>> >
>>> > We are planning to move the search of one of our listing based portal to
>>> > solr/lucene search server from sphinx search server. But we are facing a
>>> > challenge is porting customized sorting being used in our portal. We
>>> only
>>> > have last 60 days of data live.The algorithm is as follows:-
>>> >
>>> >   1.  Put all listings into 54 buckets – (Date bucket for 60 days)  i.e.
>>> >   buckets of 7day, 1 day, 1 day……
>>> >   2.  For each date bucket we make 2 buckets –(Paid / free bucket)
>>> >   3.  For each paid / free bucket cycle the advertisers on uniqueness
>>> basis
>>> >
>>> >                  i.e. inside a bucket the ordering should be 1st listing
>>> > of each advertiser, 2nd listing of each advertiser and so on
>>> >                  in other words within a *sub-bucket* second listing of
>>> an
>>> > advertiser will be displayed only after first listing of all advertiser
>>> has
>>> > been displayed.
>>> >
>>> > For taking care of point 1 and 2 we have created a field named
>>> bucket_index
>>> > at the time of indexing the data and get the results sorted by this
>>> index,
>>> > but we are not able to find a way to create a sort field at index time
>>> or
>>> > think of a sort function for the point no 3.  Please suggest if there
>>> is a
>>> > way to do so in solr.
>>> >
>>> > Tia,
>>> >
>>> > BC Rathore
>>>
>>>
>>


Re: solr.WordDelimiterFilterFactory query time

2012-04-30 Thread Erick Erickson
Try attaching &debugQuery=on to your query and seeing if that helps
you understand what's going on. If that doesn't help, also look at
admin/analysis. If all that doesn't help, post your schema definition
for the field type and the results of &debugQuery=on (you might
look at: http://wiki.apache.org/solr/UsingMailingLists).

But my first guess is that you have the stemmer filter in front of
your WDDF filter, so your input is stemmed to something like
blackberri at index time, but if our stemmer is after WDDF, at
query time you search for blackberry.

Or you have phrases enabled and it's looking for balckberry right
next to 9810.

But those are guesses..

Best
Erick

On Mon, Apr 30, 2012 at 2:13 AM, abhayd  wrote:
> hi
>
> I am using solr.WordDelimiterFilterFactory for a text_en field during query
> time.
>
> my title for document is: blackberry torch 9810
> My query : torch9810 works fine
> It splits alpha numeric and gets me the document.
>
> But when query is:blackberry9810 it splits to blackberry 9810 but I dont get
> the document I mentioned above.
> If i change query to blackberry 9810 (two words) i get the document.
>
> Can anyone explain what I m doing wrong? When i query blackberry9810 i would
> like to get the same results as blackberry 9810
>
> thanks
> abhay
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-WordDelimiterFilterFactory-query-time-tp3950045.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Solr: extracting/indexing HTML via cURL

2012-04-30 Thread okayndc
Hello,

Over the weekend I experimented with extracting HTML content via cURL and
just
wondering why the extraction/indexing process does not include the HTML
tags.
It seems as though the HTML tags either being ignored or stripped somewhere
in the pipeline.
If this is the case, is it possible to include the HTML tags, as I would
like to keep the
formatted HTML intact?

Any help is greatly appreciated.


Re: solr.WordDelimiterFilterFactory query time

2012-04-30 Thread Jack Krupansky
When WDF filters blackberry9810 it will treat it as a sequence of tokens but 
as if it were a phrase, like "blackberry 9810", with the two terms adjacent, 
at least with the edismax query parser. I'm not sure what the other query 
parsers do.


If you are using edismax, you can set the QS (query slop) request parameter 
to 1 (rather than 0), so that blackberry9810 will be treated as "blackberry 
9810"~1 which means that an additional term can (optionally) be matched by 
the phrase query.


In other words, "blackberry9810"~1 would match blackberry torch 9810 as well 
as blackberry 9810.


-- Jack Krupansky

-Original Message- 
From: abhayd

Sent: Monday, April 30, 2012 2:13 AM
To: solr-user@lucene.apache.org
Subject: solr.WordDelimiterFilterFactory query time

hi

I am using solr.WordDelimiterFilterFactory for a text_en field during query
time.

my title for document is: blackberry torch 9810
My query : torch9810 works fine
It splits alpha numeric and gets me the document.

But when query is:blackberry9810 it splits to blackberry 9810 but I dont get
the document I mentioned above.
If i change query to blackberry 9810 (two words) i get the document.

Can anyone explain what I m doing wrong? When i query blackberry9810 i would
like to get the same results as blackberry 9810

thanks
abhay

--
View this message in context:
http://lucene.472066.n3.nabble.com/solr-WordDelimiterFilterFactory-query-time-tp3950045.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Java out of memory - with fieldcache faceting

2012-04-30 Thread Otis Gospodnetic
Hi,

Tell us more about:

* what you facet on
* how many facet values are in each facet
* how much RAM you have
* 32 or 64 bit
* -Xmx you are using
* faceting method you are using
* ...

Otis 

Performance Monitoring for Solr - 
http://sematext.com/spm/solr-performance-monitoring





>
> From: Yuval Dotan 
>To: solr-user@lucene.apache.org 
>Sent: Monday, April 30, 2012 4:31 AM
>Subject: Java out of memory - with fieldcache faceting
> 
>
>Hi Guys
>I have a problem and i need your assistance
>I get an exception when doing field cache faceting:
>
>
>/solr/select?q=*:*&facet=true&facet.field=src_ip_str&facet.limit=10
>
>
>
>java.lang.OutOfMemoryError: Java heap space
>
>java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at 
>org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:449)
> at 
>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:277)
> at 
>org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) 
>at 
>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) 
>at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) 
>at 
>org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
> at 
>org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) 
>at 
>org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
> at
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) 
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
 at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) 
at org.eclipse.jetty.server.Server.handle(Server.java:351) at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
 at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
 at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
 at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
 at
 org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634) at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
 at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534) 
at java.lang.Thread.run(Thread.java:679) Caused by: java.lang.OutOfMemoryError: 
Java heap space at 
org.apache.lucene.util.packed.Direct16.(Direct16.java:38) at 
org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:267) at 
org.apache.lucene.util.packed.GrowableWriter.set(GrowableWriter.java:81) at 
org.apache.lucene.search.FieldCacheImpl$DocTermsIndexCache.createValue(FieldCacheImpl.java:1178)
 at
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:248) at 
org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1081) 
at 
org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1077) 
at 
org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:459) 
at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:310) at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:396) 
at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:205) 
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:81)
 at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
 at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1541) at
 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) 
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
 at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) 
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) 
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) 
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
 at 

Re: solr.WordDelimiterFilterFactory query time

2012-04-30 Thread abhayd
hi Erick,
autoGeneratePhraseQueries="false" is set for field type. And it works fine
for standard query parser.

Problem seem to be when i start using dismax. As u suggested i checked
analysis tool and even after word delimiter is applied i see search term as
"blackberry 9801" so  i dont think it stemmer.

here is debug out put (partial only )
---




blackberry9801


blackberry9801
blackberry9801


DisjunctionMaxQuery((click_terms:"blackberry 9801"^5.0 |
description:"blackberry 9801"^3.0 | displayName:"blackberry 9801"^15.0 |
displayNameEscaped:"blackberry 9801"^15.0 | manufacturer:"blackberry
9801"^10.0 | text_all:"blackberry 9801" | title:"blackberry 9801"^5.0)~0.01)



(click_terms:"blackberry 9801"^5.0 | description:"blackberry 9801"^3.0 |
displayName:"blackberry 9801"^15.0 | displayNameEscaped:"blackberry
9801"^15.0 | manufacturer:"blackberry 9801"^10.0 | text_all:"blackberry
9801" | title:"blackberry 9801"^5.0)~0.01

---

field definition

  
  




  
  




  

--

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-WordDelimiterFilterFactory-query-time-tp3950045p3950922.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr.WordDelimiterFilterFactory query time

2012-04-30 Thread Erick Erickson
See Jack's comments about phrases, all your parsed
queries are phrases, and your indexed terms aren't
next to each other.

Best
Erick

On Mon, Apr 30, 2012 at 10:54 AM, abhayd  wrote:
> hi Erick,
> autoGeneratePhraseQueries="false" is set for field type. And it works fine
> for standard query parser.
>
> Problem seem to be when i start using dismax. As u suggested i checked
> analysis tool and even after word delimiter is applied i see search term as
> "blackberry 9801" so  i dont think it stemmer.
>
> here is debug out put (partial only )
> ---
>
> 
>
> 
> blackberry9801
> 
> 
> blackberry9801
> blackberry9801
>
> 
> DisjunctionMaxQuery((click_terms:"blackberry 9801"^5.0 |
> description:"blackberry 9801"^3.0 | displayName:"blackberry 9801"^15.0 |
> displayNameEscaped:"blackberry 9801"^15.0 | manufacturer:"blackberry
> 9801"^10.0 | text_all:"blackberry 9801" | title:"blackberry 9801"^5.0)~0.01)
> 
>
> 
> (click_terms:"blackberry 9801"^5.0 | description:"blackberry 9801"^3.0 |
> displayName:"blackberry 9801"^15.0 | displayNameEscaped:"blackberry
> 9801"^15.0 | manufacturer:"blackberry 9801"^10.0 | text_all:"blackberry
> 9801" | title:"blackberry 9801"^5.0)~0.01
> 
> ---
>
> field definition
> 
>   positionIncrementGap="100" autoGeneratePhraseQueries="false">
>      
>        
>
>         class="solr.WordDelimiterFilterFactory" generateNumberParts="1"
> generateWordParts="1" splitOnCaseChange="1"/>
>        
>      
>      
>        
>         ignoreCase="true" synonyms="synonyms.txt"/>
>         class="solr.WordDelimiterFilterFactory" generateNumberParts="1"
> generateWordParts="1" splitOnCaseChange="1"/>
>        
>      
>    
> --
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-WordDelimiterFilterFactory-query-time-tp3950045p3950922.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Newbie question on sorting

2012-04-30 Thread Jacek
Hello all,

I'm facing this simple problem, yet impossible to resolve for me (I'm a
newbie in Solr).
I need to sort the results by score (it is simple, of course), but then
what I need is to take top 10 results, and re-order it (only those top 10
results) by a date field.
It's not the same as sort=score,creationdate

Any suggestions will be greatly appreciated!


Re: solr.WordDelimiterFilterFactory query time

2012-04-30 Thread Jack Krupansky
The &qs=1 request parameter should work for the dismax query parser as well 
as edismax.


-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Monday, April 30, 2012 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: solr.WordDelimiterFilterFactory query time

See Jack's comments about phrases, all your parsed
queries are phrases, and your indexed terms aren't
next to each other.

Best
Erick

On Mon, Apr 30, 2012 at 10:54 AM, abhayd  wrote:

hi Erick,
autoGeneratePhraseQueries="false" is set for field type. And it works fine
for standard query parser.

Problem seem to be when i start using dismax. As u suggested i checked
analysis tool and even after word delimiter is applied i see search term 
as

"blackberry 9801" so  i dont think it stemmer.

here is debug out put (partial only )
---




blackberry9801


blackberry9801
blackberry9801


DisjunctionMaxQuery((click_terms:"blackberry 9801"^5.0 |
description:"blackberry 9801"^3.0 | displayName:"blackberry 9801"^15.0 |
displayNameEscaped:"blackberry 9801"^15.0 | manufacturer:"blackberry
9801"^10.0 | text_all:"blackberry 9801" | title:"blackberry 
9801"^5.0)~0.01)




(click_terms:"blackberry 9801"^5.0 | description:"blackberry 9801"^3.0 |
displayName:"blackberry 9801"^15.0 | displayNameEscaped:"blackberry
9801"^15.0 | manufacturer:"blackberry 9801"^10.0 | text_all:"blackberry
9801" | title:"blackberry 9801"^5.0)~0.01

---

field definition

 
 
   

   
   
 
 
   
   
   
   
 
   
--

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-WordDelimiterFilterFactory-query-time-tp3950045p3950922.html
Sent from the Solr - User mailing list archive at Nabble.com. 




Re: solr.WordDelimiterFilterFactory query time

2012-04-30 Thread abhayd
hi jack & erick,
Thanks 
I do have qs set in solrconfig for query handler dismax settings.

10

Still does not work

abhay

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-WordDelimiterFilterFactory-query-time-tp3950045p3951038.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr: extracting/indexing HTML via cURL

2012-04-30 Thread Jack Krupansky
If by "extracting HTML content via cURL" you mean using SolrCell to parse 
html files, this seems to make sense. The sequence is that regardless of the 
file type, each file extraction "parser" will strip off all formatting and 
produce a raw text stream. Office, PDF, and HTML files are all treated the 
same in that way. Then, the unformatted text stream is sent through the 
field type analyzers to be tokenized into terms that Lucene can index. The 
input string to the field type analyzer is what gets stored for the field, 
but this occurs after the extraction file parser has already removed 
formatting.


No way for the formatting to be preserved in that case, other than to go 
back to the original input document before extraction parsing.


If you really do want to preserve full HTML formatted text, you would need 
to define a field whose field type uses the HTMLStripCharFilter and then 
directly add documents that direct the raw HTML to that field.


There may be some other way to hook into the update processing chain, but 
that may be too much effort compared to the HTML strip filter.


-- Jack Krupansky

-Original Message- 
From: okayndc

Sent: Monday, April 30, 2012 10:07 AM
To: solr-user@lucene.apache.org
Subject: Solr: extracting/indexing HTML via cURL

Hello,

Over the weekend I experimented with extracting HTML content via cURL and
just
wondering why the extraction/indexing process does not include the HTML
tags.
It seems as though the HTML tags either being ignored or stripped somewhere
in the pipeline.
If this is the case, is it possible to include the HTML tags, as I would
like to keep the
formatted HTML intact?

Any help is greatly appreciated. 



Re: solr.WordDelimiterFilterFactory query time

2012-04-30 Thread abhayd
hi jack,
tried &qs=10 but unfortunately it does not seem to help.

Not sure what else could be wrong

abhay

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-WordDelimiterFilterFactory-query-time-tp3950045p3951082.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr.WordDelimiterFilterFactory query time

2012-04-30 Thread abhayd
hi jack,
tried &qs=10 but unfortunately it does not seem to help.

Not sure what else could be wrong

abhay

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-WordDelimiterFilterFactory-query-time-tp3950045p3951083.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr.WordDelimiterFilterFactory query time

2012-04-30 Thread Jack Krupansky
Just to be clear, I used the Solr example schema and indexed two test 
documents, one with "Blackberry 9810" and one with "Blackberry torch 9810" 
in the sku field (which uses field type text_en_splitting_tight which uses 
WDF) and the following query returns both documents:


http://localhost:8983/solr/select/?q=blackberry9810&debug=true&qs=1&qf=sku&defType=dismax

You might try the same and at least verify that that is working for you. 
With &qs=0 it returns only one document.


Assuming that the example works fine for you, that suggests that there is 
something else going on with your analyzer. Maybe there might be a 
difference between the WDF settings for the index and query analyzers. You 
should attach your schema and config as well as the &debug output.


-- Jack Krupansky

-Original Message- 
From: abhayd

Sent: Monday, April 30, 2012 11:56 AM
To: solr-user@lucene.apache.org
Subject: Re: solr.WordDelimiterFilterFactory query time

hi jack,
tried &qs=10 but unfortunately it does not seem to help.

Not sure what else could be wrong

abhay

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-WordDelimiterFilterFactory-query-time-tp3950045p3951082.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: change index/store at indexing time

2012-04-30 Thread Vazquez, Maria (STM)
Thanks Erick.
I'm not concerned about the logic, all I want to achieve is sometimes
storing/indexing a multi-valued field and sometimes not (same field with
same name) based on some logic. In a transformer I cannot change the
schema dynamically to do that, not that I know of at least.
So if I define that field in the schema.xml I guess I'm stuck with the
store/indexed true/false and cannot change that at indexing time.
In Lucene you can just do what I had in my original email and the field
will be sometimes stored/indexed and sometimes not without a problem.

Does that make sense?
Thanks,
Maria



Maria Vazquez  |  Manager Software Engineering  |  Dex One
Phone: 310.586.4157

www.DexOne.com   |  www.DexKnows.com

 















On 4/30/12 6:08 AM, "Erick Erickson"  wrote:

>Your idea of using a Transformer will work just fine, you have a lot more
>flexibility in a custom Transformer, see:
>http://wiki.apache.org/solr/DIHCustomTransformer
>
>You could also write a custom update handler that examined the
>document on the server side and implemented your logic, or even
>just add an element into an "update chain".
>
>None of these is very hard, but they can be a bit intimidating
>when stating from scratch.
>
>Best
>Erick
>
>On Sun, Apr 29, 2012 at 12:53 PM, Vazquez, Maria (STM)
> wrote:
>> Thanks for your response.
>> That's what I need, changes at indexing time. Dynamic fields are not
>>what I need because the field name is the same, I just need to change if
>>they indexed/stored based on some logic.
>> It's so easily achieved with the Lucene API, I was sure there was a way
>>to do the same in Solr.
>>
>>
>> On Apr 28, 2012, at 22:34, "Jeevanandam"  wrote:
>>
>>> Maria,
>>>
>>> thanks for detailed explanation.
>>> as per schema.xml; stored or indexed should be defined at design-time.
>>>Per my understanding defining at runtime is not feasible.
>>> BTW, you can have multiValued="true" attribute for dynamic fields too.
>>>
>>> - Jeevanandam
>>>
>>> On 29-04-2012 2:06 am, Vazquez, Maria (STM) wrote:
 Thanks Jeevanandam.
 That still doesn't have the same behavior as Lucene since multiple
 fields with different names have to be created.
 What I want is this exactly (multi-value field)

 document.add(new Field("geoids", geoId, Field.Store.YES,
 Field.Index.NOT_ANALYZED_NO_NORMS));

 document.add(new Field("geoids", geoId, Field.Store.NO,
 Field.Index.NOT_ANALYZED_NO_NORMS));

 In Lucene I can save geoids first as stored and in the next line as
 not stored and it will do exactly that. I want to duplicate this
 behavior in Solr but I can't do it having only one field in the schema
 called geoids that I an manipulate at inde time whether to store or
 not depending on a condition.

 Thanks again for the help, hope this explanation makes it more clear
 in what I'm trying to do.

 Maria

 On Apr 28, 2012, at 11:49 AM, "Jeevanandam"
 mailto:je...@myjeeva.com>> wrote:

 Maria,

 For your need please define unique pattern using dynamic field in
schema.xml

 Please have a look
http://wiki.apache.org/solr/SchemaXml#Dynamic_fields

 Hope that helps!

 -Jeevanandam

 Technology keeps you connected!

 On Apr 28, 2012, at 10:33 PM, "Vazquez, Maria (STM)"
 mailto:maria.vazq...@dexone.com>> wrote:

 I can call a script for the logic part but what I want to figure out
 is how to save the same field sometimes as stored and indexed,
 sometimes as stored not indexed, etc. From a transformer or a script I
 didn't see anything where I can modify that at indexing time.
 Thanks a lot,
 Maria


 On Apr 27, 2012, at 18:38, "Bill Bell"
 mailto:billnb...@gmail.com>> wrote:

 Yes you can. Just use a script that is called for each row.

 Bill Bell
 Sent from mobile


 On Apr 27, 2012, at 6:38 PM, "Vazquez, Maria (STM)"
 mailto:maria.vazq...@dexone.com>> wrote:

 Hi,
 I'm migrating a project from Lucene 2.9 to Solr 3.4.
 There is a special case in the code that indexes the same field in
 two different ways, which is completely legal in Lucene directly but I
 don't know how to duplicate this same behavior in Solr:

 if (isFirstGeo) {
 document.add(new Field("geoids", geoId, Field.Store.YES,
 Field.Index.NOT_ANALYZED_NO_NORMS));
 isFirstGeo = false;
 } else {
 if (countProducts < 100)
  document.add(new Field("geoids", geoId, Field.Store.NO,
 Field.Index.NOT_ANALYZED_NO_NORMS));
 else
  document.add(new Field("geoids", geoId, Field.Store.YES,
 Field.Index.NO));
 }

 Is there any way to do this in Solr in a Tranformer? I'm using the
 DIH to index and I can't see a way to do this other than having three
 fields in the schema like geoids_s

Re: Scaling Solr - Suggestions !!

2012-04-30 Thread Sujatha Arun
I was copying the indexes from webapp to cores ,when this happened .It
could have been an error from my end ,but just worried that an issue with
one core  would reflect on webapp .

Regards
Sujatha

On Mon, Apr 30, 2012 at 7:20 PM, Erick Erickson wrote:

> I'd get to the root of why indexes are corrupt! This should
> be very unusual. If you're seeing this at all frequently,
> it indicates something is very wrong and starting bunches
> of JVMs up is a band-aid over a much more serious
> problem.
>
> Are you, by chance, doing a kill -9? or other hard-abort?
>
> Best
> Erick
>
> On Mon, Apr 30, 2012 at 12:22 AM, Sujatha Arun 
> wrote:
> > Now the reason ,I have used different webapps instead of a single one for
> > the cores is ,while prototyping ,I discovered that ,when one of the cores
> > index is corrupt ,the entire webapp does not start up and the same must
> be
> > true of  "too many open files" etc ,that is to say if there is an issue
> > withe any one core [Schema /index] ,the entire webapp does not start up.
> >
> > Thanks for your suggestion.
> >
> >
> > Regards
> > Sujatha
> >
> >
> >
> >
> >
> >
> >
> > On Sat, Apr 28, 2012 at 6:49 PM, Michael Della Bitta <
> > michael.della.bi...@appinions.com> wrote:
> >
> >> Just my opinion, but I'm not sure I see the value in deploying the cores
> >> to different webapps in a single container on a single machine to avoid
> >> a single point of failure... You still have a single point of failure at
> >> the process level down to the hardware, which when you think about it,
> >> is mostly everything. But perhaps you're at least using more than one
> >> container.
> >>
> >> It sounds to me that the easiest route to scalability for you would be
> >> to add more machines. Unless your cores are particularly complex or your
> >> traffic is heavy, a 3GB core should be no match for a single machine.
> >> And the traffic problem can be solved by replication and load balancing.
> >>
> >> Michael
> >>
> >> On Sat, 2012-04-28 at 13:24 +0530, Sujatha Arun wrote:
> >> > Hello,
> >> >
> >> > *Background* :For each of our  customers, we create 3 solr webapps
> with
> >> > different search  schema's,serving different search requirements and
> we
> >> > have about 70 customers.So we have about 210 webapps curently .
> >> >
> >> > *Hardware*: Single Server , one JVM , Heap memory 19GB ,Total Ram
> :32GB ,
> >> > Permgen initally 1GB ,now increased to 2GB.
> >> >
> >> > *Solr Indexes* : Most are the order of a few MB ,about 2  big index of
> >> > about 3GB  each
> >> >
> >> > *Scaling Step 1 *:  We saw the permgen value go upto to nearly 850 mb
> >> ,when
> >> > we created so  many webapps ,hence now we are moving to solr cores
> and we
> >> > are going to have about 50 cores per webapp ,bringing the number of
> >> webapps
> >> > to about 5 . We want to distribute the cores with multiple webapps to
> >> avoid
> >> > a single point of failure.
> >> >
> >> >
> >> > *Requirement* :
> >> >
> >> >
> >> >-   We need to only scale the cores horizontally ,whose index sizes
> >> are
> >> >big.
> >> >-   We also require permission based search for each webapp ,would
> >> solr
> >> >NRT fit our needs ,where we can index the permission into the
> document
> >> >,which would mean   there would be frequent addition and deletion
> of
> >> >permissions to the documents across cores.
> >> >-   We also require  automatic fail over
> >> >
> >> > What technology would be ideal fit given Solr Cloud ,Katta , Solandra
> >> > ,Lily,Elastic Search etc [Preferably Open source] [ We would be
> required
> >> to
> >> > maintain many webapps with multicores ] and what about the commercial
> >> > offering given out use case
> >> >
> >> > Thanks.
> >> >
> >> > Regards,
> >> > Sujatha
> >>
> >>
> >>
>


Solr logo for print

2012-04-30 Thread Otis Gospodnetic
Hi,

I'm trying to find a Solr logo in a vector or some other format suitable for 
print.  I found Lucene logo 
at http://svn.apache.org/repos/asf/lucene/site/publish/images/logo.eps , but 
can't find one for Solr.  Does anyone know where to find it?

At the bottom of  http://wiki.apache.org/solr/PublicServers I found a link 
to https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/ ,
 but that leads to 404.


Can't find anything in svn either:

$ find . -name \*image\* | xargs ls -l | grep -i solr
-rw-r--r-- 1 otis otis 14993 2011-01-04 00:32 
./solr/src/site/build/site/images/solr-book-image.jpg
./solr/src/site/build/site/images:
-rw-r--r-- 1 otis otis 14993 2011-01-04 00:32 solr-book-image.jpg
-rw-r--r-- 1 otis otis 12719 2011-01-04 00:32 solr.jpg


Thanks,
Otis

Performance Monitoring for Solr - 
http://sematext.com/spm/solr-performance-monitoring


Re: change index/store at indexing time

2012-04-30 Thread Erick Erickson
OK, I took another look at what you were trying to
accomplish and, I find the use-case kind of hard to
figure out, but that's my problem .

But it is true that there's really no good way to _change_ the
way the field is analyzed in Solr. Of course since Solr is
built on Lucene, you could to a lot of work and bypass the
analysis chain, but that seems like more work than this is
really worth. So using multiple fields seems like the way
to go.

This is pretty unusual. You have a field that you can
> search and display the first value
> search but not display values 2-99
> display but not search values 100-*

nobody's thought about how to specify this in a schema file,
and it appears to be an unusual enough use case that
nobody has proposed anything like that that I know of.

Best
Erick

On Mon, Apr 30, 2012 at 1:11 PM, Vazquez, Maria (STM)
 wrote:
> Thanks Erick.
> I'm not concerned about the logic, all I want to achieve is sometimes
> storing/indexing a multi-valued field and sometimes not (same field with
> same name) based on some logic. In a transformer I cannot change the
> schema dynamically to do that, not that I know of at least.
> So if I define that field in the schema.xml I guess I'm stuck with the
> store/indexed true/false and cannot change that at indexing time.
> In Lucene you can just do what I had in my original email and the field
> will be sometimes stored/indexed and sometimes not without a problem.
>
> Does that make sense?
> Thanks,
> Maria
>
>
>
> Maria Vazquez  |  Manager Software Engineering  |  Dex One
> Phone: 310.586.4157
>
> www.DexOne.com   |  www.DexKnows.com
> 
>  
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 4/30/12 6:08 AM, "Erick Erickson"  wrote:
>
>>Your idea of using a Transformer will work just fine, you have a lot more
>>flexibility in a custom Transformer, see:
>>http://wiki.apache.org/solr/DIHCustomTransformer
>>
>>You could also write a custom update handler that examined the
>>document on the server side and implemented your logic, or even
>>just add an element into an "update chain".
>>
>>None of these is very hard, but they can be a bit intimidating
>>when stating from scratch.
>>
>>Best
>>Erick
>>
>>On Sun, Apr 29, 2012 at 12:53 PM, Vazquez, Maria (STM)
>> wrote:
>>> Thanks for your response.
>>> That's what I need, changes at indexing time. Dynamic fields are not
>>>what I need because the field name is the same, I just need to change if
>>>they indexed/stored based on some logic.
>>> It's so easily achieved with the Lucene API, I was sure there was a way
>>>to do the same in Solr.
>>>
>>>
>>> On Apr 28, 2012, at 22:34, "Jeevanandam"  wrote:
>>>
 Maria,

 thanks for detailed explanation.
 as per schema.xml; stored or indexed should be defined at design-time.
Per my understanding defining at runtime is not feasible.
 BTW, you can have multiValued="true" attribute for dynamic fields too.

 - Jeevanandam

 On 29-04-2012 2:06 am, Vazquez, Maria (STM) wrote:
> Thanks Jeevanandam.
> That still doesn't have the same behavior as Lucene since multiple
> fields with different names have to be created.
> What I want is this exactly (multi-value field)
>
> document.add(new Field("geoids", geoId, Field.Store.YES,
> Field.Index.NOT_ANALYZED_NO_NORMS));
>
> document.add(new Field("geoids", geoId, Field.Store.NO,
> Field.Index.NOT_ANALYZED_NO_NORMS));
>
> In Lucene I can save geoids first as stored and in the next line as
> not stored and it will do exactly that. I want to duplicate this
> behavior in Solr but I can't do it having only one field in the schema
> called geoids that I an manipulate at inde time whether to store or
> not depending on a condition.
>
> Thanks again for the help, hope this explanation makes it more clear
> in what I'm trying to do.
>
> Maria
>
> On Apr 28, 2012, at 11:49 AM, "Jeevanandam"
> mailto:je...@myjeeva.com>> wrote:
>
> Maria,
>
> For your need please define unique pattern using dynamic field in
>schema.xml
>
> Please have a look
>http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
>
> Hope that helps!
>
> -Jeevanandam
>
> Technology keeps you connected!
>
> On Apr 28, 2012, at 10:33 PM, "Vazquez, Maria (STM)"
> mailto:maria.vazq...@dexone.com>> wrote:
>
> I can call a script for the logic part but what I want to figure out
> is how to save the same field sometimes as stored and indexed,
> sometimes as stored not indexed, etc. From a transformer or a script I
> didn't see anything where I can modify that at indexing time.
> Thanks a lot,
> Maria
>
>
> On Apr 27, 2012, at 18:38, "Bill Bell"
> mailto:billnb...@gmail.com>> wrote:
>
> Yes you can. Just use a script that is called for each row.
>
> Bill Bell
> Sent from m

Re: Ampersand issue

2012-04-30 Thread William Bell
One idea was to wrap the field with CDATA. Or base64 encode it.



On Fri, Apr 27, 2012 at 7:50 PM, Bill Bell  wrote:
> We are indexing a simple XML field from SQL Server into Solr as a stored 
> field. We have noticed that the & is outputed as & when using 
> wt=XML. When using wt=JSON we get the normal &. If there a way to 
> indicate that we don't want to encode the field since it is already XML when 
> using wt=XML ?
>
> Bill Bell
> Sent from mobile
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: change index/store at indexing time

2012-04-30 Thread Lee Carroll
Vazquez,
Sorry I don't have an answer but I'd love to know what you need this for :-)

I think the logic is going to have to bleed into your search app. In
short copy field and your app knows which to search in.

lee c



On 30 April 2012 20:41, Erick Erickson  wrote:
> OK, I took another look at what you were trying to
> accomplish and, I find the use-case kind of hard to
> figure out, but that's my problem .
>
> But it is true that there's really no good way to _change_ the
> way the field is analyzed in Solr. Of course since Solr is
> built on Lucene, you could to a lot of work and bypass the
> analysis chain, but that seems like more work than this is
> really worth. So using multiple fields seems like the way
> to go.
>
> This is pretty unusual. You have a field that you can
>> search and display the first value
>> search but not display values 2-99
>> display but not search values 100-*
>
> nobody's thought about how to specify this in a schema file,
> and it appears to be an unusual enough use case that
> nobody has proposed anything like that that I know of.
>
> Best
> Erick
>
> On Mon, Apr 30, 2012 at 1:11 PM, Vazquez, Maria (STM)
>  wrote:
>> Thanks Erick.
>> I'm not concerned about the logic, all I want to achieve is sometimes
>> storing/indexing a multi-valued field and sometimes not (same field with
>> same name) based on some logic. In a transformer I cannot change the
>> schema dynamically to do that, not that I know of at least.
>> So if I define that field in the schema.xml I guess I'm stuck with the
>> store/indexed true/false and cannot change that at indexing time.
>> In Lucene you can just do what I had in my original email and the field
>> will be sometimes stored/indexed and sometimes not without a problem.
>>
>> Does that make sense?
>> Thanks,
>> Maria
>>
>>
>>
>> Maria Vazquez  |  Manager Software Engineering  |  Dex One
>> Phone: 310.586.4157
>>
>> www.DexOne.com   |  www.DexKnows.com
>> 
>>  
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 4/30/12 6:08 AM, "Erick Erickson"  wrote:
>>
>>>Your idea of using a Transformer will work just fine, you have a lot more
>>>flexibility in a custom Transformer, see:
>>>http://wiki.apache.org/solr/DIHCustomTransformer
>>>
>>>You could also write a custom update handler that examined the
>>>document on the server side and implemented your logic, or even
>>>just add an element into an "update chain".
>>>
>>>None of these is very hard, but they can be a bit intimidating
>>>when stating from scratch.
>>>
>>>Best
>>>Erick
>>>
>>>On Sun, Apr 29, 2012 at 12:53 PM, Vazquez, Maria (STM)
>>> wrote:
 Thanks for your response.
 That's what I need, changes at indexing time. Dynamic fields are not
what I need because the field name is the same, I just need to change if
they indexed/stored based on some logic.
 It's so easily achieved with the Lucene API, I was sure there was a way
to do the same in Solr.


 On Apr 28, 2012, at 22:34, "Jeevanandam"  wrote:

> Maria,
>
> thanks for detailed explanation.
> as per schema.xml; stored or indexed should be defined at design-time.
>Per my understanding defining at runtime is not feasible.
> BTW, you can have multiValued="true" attribute for dynamic fields too.
>
> - Jeevanandam
>
> On 29-04-2012 2:06 am, Vazquez, Maria (STM) wrote:
>> Thanks Jeevanandam.
>> That still doesn't have the same behavior as Lucene since multiple
>> fields with different names have to be created.
>> What I want is this exactly (multi-value field)
>>
>> document.add(new Field("geoids", geoId, Field.Store.YES,
>> Field.Index.NOT_ANALYZED_NO_NORMS));
>>
>> document.add(new Field("geoids", geoId, Field.Store.NO,
>> Field.Index.NOT_ANALYZED_NO_NORMS));
>>
>> In Lucene I can save geoids first as stored and in the next line as
>> not stored and it will do exactly that. I want to duplicate this
>> behavior in Solr but I can't do it having only one field in the schema
>> called geoids that I an manipulate at inde time whether to store or
>> not depending on a condition.
>>
>> Thanks again for the help, hope this explanation makes it more clear
>> in what I'm trying to do.
>>
>> Maria
>>
>> On Apr 28, 2012, at 11:49 AM, "Jeevanandam"
>> mailto:je...@myjeeva.com>> wrote:
>>
>> Maria,
>>
>> For your need please define unique pattern using dynamic field in
>>schema.xml
>>
>> Please have a look
>>http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
>>
>> Hope that helps!
>>
>> -Jeevanandam
>>
>> Technology keeps you connected!
>>
>> On Apr 28, 2012, at 10:33 PM, "Vazquez, Maria (STM)"
>> mailto:maria.vazq...@dexone.com>> wrote:
>>
>> I can call a script for the logic part but what I want to figure out
>> is how to save the s

Re: Solr logo for print

2012-04-30 Thread Dan Tuffery
Try this one:

http://www.lucidimagination.com/sites/default/files/image/solr_logo_rgb.png

Dan

On Mon, Apr 30, 2012 at 8:38 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hi,
>
> I'm trying to find a Solr logo in a vector or some other format suitable
> for print.  I found Lucene logo at
> http://svn.apache.org/repos/asf/lucene/site/publish/images/logo.eps , but
> can't find one for Solr.  Does anyone know where to find it?
>
> At the bottom of  http://wiki.apache.org/solr/PublicServers I found a
> link to
> https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/
>  ,
> but that leads to 404.
>
>
> Can't find anything in svn either:
>
> $ find . -name \*image\* | xargs ls -l | grep -i solr
> -rw-r--r-- 1 otis otis 14993 2011-01-04 00:32
> ./solr/src/site/build/site/images/solr-book-image.jpg
> ./solr/src/site/build/site/images:
> -rw-r--r-- 1 otis otis 14993 2011-01-04 00:32 solr-book-image.jpg
> -rw-r--r-- 1 otis otis 12719 2011-01-04 00:32 solr.jpg
>
>
> Thanks,
> Otis
> 
> Performance Monitoring for Solr -
> http://sematext.com/spm/solr-performance-monitoring
>


post.jar failing

2012-04-30 Thread William Bell
I am getting a post.jar failure when trying to post the following
CDATA field... It used to work on older versions. This is in SOlr 3.6.



  SP2514N
  Samsung SpinPoint P120 SP2514N - hard drive - 250
GB - ATA-133
  Samsung Electronics Co. Ltd.
  electronics
  hard drive
  7200RPM, 8MB cache, IDE Ultra ATA-133
  NoiseGuard, SilentSeek technology, Fluid
Dynamic Bearing (FDB) motor
  92
  6
  true
  
  2006-02-13T15:26:37Z
  
  35.0752,-97.032




Apr 30, 2012 1:53:49 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=SP2514N] Error adding
field 'address_xml'='

MEDSCH

UNIVERSITY OF COLORADO SCHOOL OF MEDICINE
1974
MD


'


-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Solr logo for print

2012-04-30 Thread Lukáš Vlček
Otis,

I think there was some JIRA ticket (Logo contents or something like that)
which might have all the logo proposals, including the winning one,
attached.

Regards,
Lukas

On Mon, Apr 30, 2012 at 9:38 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hi,
>
> I'm trying to find a Solr logo in a vector or some other format suitable
> for print.  I found Lucene logo at
> http://svn.apache.org/repos/asf/lucene/site/publish/images/logo.eps , but
> can't find one for Solr.  Does anyone know where to find it?
>
> At the bottom of  http://wiki.apache.org/solr/PublicServers I found a
> link to
> https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/
>  ,
> but that leads to 404.
>
>
> Can't find anything in svn either:
>
> $ find . -name \*image\* | xargs ls -l | grep -i solr
> -rw-r--r-- 1 otis otis 14993 2011-01-04 00:32
> ./solr/src/site/build/site/images/solr-book-image.jpg
> ./solr/src/site/build/site/images:
> -rw-r--r-- 1 otis otis 14993 2011-01-04 00:32 solr-book-image.jpg
> -rw-r--r-- 1 otis otis 12719 2011-01-04 00:32 solr.jpg
>
>
> Thanks,
> Otis
> 
> Performance Monitoring for Solr -
> http://sematext.com/spm/solr-performance-monitoring
>


RE: CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-30 Thread Burton-West, Tom
Thanks wunder and Lance,

In the discussions I've seen of Japanese IR in the English language IR 
literature, Hiragana is either removed or strings are segmented first by 
character class.  I'm interested in finding out more about why bigramming 
across classes is desirable.
Based on my limited understanding of Japanese, I can see how perhaps bigramming 
a Han and Hiragana character might make sense but what about Han and Katakana?

Lance, how did you weight the unigram vs bigram fields for CJK? or did you just 
OR them together assuming that idf will give the bigrams more weight?

Tom



correct XPATH syntax

2012-04-30 Thread Twomey, David

Is this possible in DataImportHandler

I want the following XML to all collapse into one Author field


 
  Sørlie
  T
  T
 
 
  Perou
  C M
  CM
 
 
  Tibshirani
  R
  R
 
...

So my XPATH is like 



Re: solr.WordDelimiterFilterFactory query time

2012-04-30 Thread abhayd
hi jack,

thanks, i figured out the issue. It was settings during query and index time



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-WordDelimiterFilterFactory-query-time-tp3950045p3951811.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: correct XPATH syntax

2012-04-30 Thread Twomey, David
Sorry hit send too soon.  Continued the email below

On 4/30/12 4:46 PM, "Twomey, David"  wrote:

>
>Is this possible in DataImportHandler
>
>I want the following XML to all collapse into one mult-valued Author field
>
>
> 
>  Sørlie
>  T
>  T
> 
> 
>  Perou
>  C M
>  CM
> 
> 
>  Tibshirani
>  R
>  R
> 
>...
>
>So my XPATH is like
>xpath="/MedlineCitationSet/MedlineCitation/AuthorList/??"
>commonField="true" />

>



Upgrading to 3.6 broke cachedsqlentityprocessor

2012-04-30 Thread Brent Mills
I've read some things in jira on the new functionality that was put into 
caching in the DIH but I wouldn't think it should break the old behavior.  It 
doesn't look as though any errors are being thrown, it's just ignoring the 
caching part and opening a ton of connections.  Also I cannot find any 
documentation on the new functionality that was added so I'm not sure what 
syntax is valid and what's not.  Here is my entity that worked in 3.1 but no 
longer works in 3.6:




Re: Solr logo for print

2012-04-30 Thread Otis Gospodnetic
Thanks Lukas.  Yeah, I looked there, but as far as I can tell, all attachments 
are PNGs/GIFs/JPGs :(

Otis
 

Performance Monitoring for Solr - http://sematext.com/spm/index.html



>
> From: Lukáš Vlček 
>To: solr-user@lucene.apache.org; Otis Gospodnetic  
>Sent: Monday, April 30, 2012 4:23 PM
>Subject: Re: Solr logo for print
> 
>
>Otis,
>
>
>I think there was some JIRA ticket (Logo contents or something like that) 
>which might have all the logo proposals, including the winning one, attached.
>
>
>Regards,
>Lukas
>
>
>On Mon, Apr 30, 2012 at 9:38 PM, Otis Gospodnetic  
>wrote:
>
>Hi,
>>
>>I'm trying to find a Solr logo in a vector or some other format suitable for 
>>print.  I found Lucene logo 
>>at http://svn.apache.org/repos/asf/lucene/site/publish/images/logo.eps , but 
>>can't find one for Solr.  Does anyone know where to find it?
>>
>>At the bottom of  http://wiki.apache.org/solr/PublicServers I found a link 
>>to https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/ ,
>> but that leads to 404.
>>
>>
>>Can't find anything in svn either:
>>
>>$ find . -name \*image\* | xargs ls -l | grep -i solr
>>-rw-r--r-- 1 otis otis 14993 2011-01-04 00:32 
>>./solr/src/site/build/site/images/solr-book-image.jpg
>>./solr/src/site/build/site/images:
>>-rw-r--r-- 1 otis otis 14993 2011-01-04 00:32 solr-book-image.jpg
>>-rw-r--r-- 1 otis otis 12719 2011-01-04 00:32 solr.jpg
>>
>>
>>Thanks,
>>Otis
>>
>>Performance Monitoring for Solr - 
>>http://sematext.com/spm/solr-performance-monitoring
>>
>
>
>

Re: CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-30 Thread Walter Underwood
You'll see katakana used with kanji in noun compounds where one of the words is 
foreign.

In Japanese, "Rice University" is not written with the kanji word for "rice". 
They use katakana for "rice" and kanji for "university", like this: ライス大学.

This is very common. I expect that "President Obama" uses kanji for the title 
and katakana for "Obama".

Removing hiragana is a bad idea. There are some words that are only written in 
hiragana.

wunder

On Apr 30, 2012, at 1:27 PM, Burton-West, Tom wrote:

> Thanks wunder and Lance,
> 
> In the discussions I've seen of Japanese IR in the English language IR 
> literature, Hiragana is either removed or strings are segmented first by 
> character class.  I'm interested in finding out more about why bigramming 
> across classes is desirable.
> Based on my limited understanding of Japanese, I can see how perhaps 
> bigramming a Han and Hiragana character might make sense but what about Han 
> and Katakana?
> 
> Lance, how did you weight the unigram vs bigram fields for CJK? or did you 
> just OR them together assuming that idf will give the bigrams more weight?
> 
> Tom
> 





Re: solr.WordDelimiterFilterFactory query time

2012-04-30 Thread Jack Krupansky
Great. But could you tell us all what settings you had wrong and how you 
changed them so that somebody else with the problem searching the email 
archive will be able to see your solution? Thanks.


-- Jack Krupansky

-Original Message- 
From: abhayd

Sent: Monday, April 30, 2012 4:51 PM
To: solr-user@lucene.apache.org
Subject: Re: solr.WordDelimiterFilterFactory query time

hi jack,

thanks, i figured out the issue. It was settings during query and index time



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-WordDelimiterFilterFactory-query-time-tp3950045p3951811.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Setting margin in SimpleFragListBuilder from solrconfig

2012-04-30 Thread tobias roth
Hi,

Can I set the constructor parameter "margin" of SimpleFragListBuilder
from within solrconfig.xml?

I would suspect that something has to be added to this configuration
element in solrconfig.xml:



But what and how exactly?
(I'm using solr 3.5 at the moment)

Thanks,
Tobi


Proper commit behavior in a multi-writer environment

2012-04-30 Thread Adam Fields
If I have 40 writers all feeding the same index, do they all have to commit, or 
just one of them?

Am I going to kill performance if they're all issuing individual commits, or 
would it be better to not have the individual writers commit at all and just 
have one process that does nothing but commit every minute or so?

What happens to docs that are added to a solr index but not committed? Are they 
just held in a queue until the next commit by anyone?


--
- Adam
--
If you liked this email, you might also like:
"Speed up the Dock animation in Mac OS" 
-- http://workstuff.tumblr.com/post/21802026016
"How I came to love Jamie Zawinski" 
-- http://www.aquick.org/blog/2011/11/29/how-i-came-to-love-jamie-zawinski/
"Hudson Streetlights" 
-- http://www.flickr.com/photos/fields/6763398935/
"fields: MillionShort is a search engine that removes the top most popular 
site..." 
-- http://twitter.com/fields/statuses/197057229403336704
--
** I design intricate-yet-elegant processes for user and machine problems.
** Custom development project broken? Contact me, I can help.
** Some of what I do: http://workstuff.tumblr.com/post/70505118/aboutworkstuff

[ http://www.adamfields.com/resume.html ].. Experience
[ http://www.morningside-analytics.com ] .. Latest Venture
[ http://www.confabb.com ]  Founder




Re: Solr: extracting/indexing HTML via cURL

2012-04-30 Thread okayndc
Great, thank you for the input.  My understanding of HTMLStripCharFilter is
that it strips HTML tags, which is not what I want ~ is this correct?  I
want to keep the HTML tags intact.

On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky wrote:

> If by "extracting HTML content via cURL" you mean using SolrCell to parse
> html files, this seems to make sense. The sequence is that regardless of
> the file type, each file extraction "parser" will strip off all formatting
> and produce a raw text stream. Office, PDF, and HTML files are all treated
> the same in that way. Then, the unformatted text stream is sent through the
> field type analyzers to be tokenized into terms that Lucene can index. The
> input string to the field type analyzer is what gets stored for the field,
> but this occurs after the extraction file parser has already removed
> formatting.
>
> No way for the formatting to be preserved in that case, other than to go
> back to the original input document before extraction parsing.
>
> If you really do want to preserve full HTML formatted text, you would need
> to define a field whose field type uses the HTMLStripCharFilter and then
> directly add documents that direct the raw HTML to that field.
>
> There may be some other way to hook into the update processing chain, but
> that may be too much effort compared to the HTML strip filter.
>
> -- Jack Krupansky
>
> -Original Message- From: okayndc
> Sent: Monday, April 30, 2012 10:07 AM
> To: solr-user@lucene.apache.org
> Subject: Solr: extracting/indexing HTML via cURL
>
>
> Hello,
>
> Over the weekend I experimented with extracting HTML content via cURL and
> just
> wondering why the extraction/indexing process does not include the HTML
> tags.
> It seems as though the HTML tags either being ignored or stripped somewhere
> in the pipeline.
> If this is the case, is it possible to include the HTML tags, as I would
> like to keep the
> formatted HTML intact?
>
> Any help is greatly appreciated.
>


Re: Proper commit behavior in a multi-writer environment

2012-04-30 Thread Otis Gospodnetic
Adam,

This is where autocommit (see solrconfig.xml) comes in handy.  Don't have them 
all commit, no. :)

Otis 

Performance Monitoring for Solr - http://sematext.com/spm/index.html





>
> From: Adam Fields 
>To: solr-user@lucene.apache.org 
>Sent: Monday, April 30, 2012 5:03 PM
>Subject: Proper commit behavior in a multi-writer environment
> 
>If I have 40 writers all feeding the same index, do they all have to commit, 
>or just one of them?
>
>Am I going to kill performance if they're all issuing individual commits, or 
>would it be better to not have the individual writers commit at all and just 
>have one process that does nothing but commit every minute or so?
>
>What happens to docs that are added to a solr index but not committed? Are 
>they just held in a queue until the next commit by anyone?
>
>
>--
>                - Adam
>--
>If you liked this email, you might also like:
>"Speed up the Dock animation in Mac OS" 
>-- http://workstuff.tumblr.com/post/21802026016
>"How I came to love Jamie Zawinski" 
>-- http://www.aquick.org/blog/2011/11/29/how-i-came-to-love-jamie-zawinski/
>"Hudson Streetlights" 
>-- http://www.flickr.com/photos/fields/6763398935/
>"fields: MillionShort is a search engine that removes the top most popular 
>site..." 
>-- http://twitter.com/fields/statuses/197057229403336704
>--
>** I design intricate-yet-elegant processes for user and machine problems.
>** Custom development project broken? Contact me, I can help.
>** Some of what I do: http://workstuff.tumblr.com/post/70505118/aboutworkstuff
>
>[ http://www.adamfields.com/resume.html ].. Experience
>[ http://www.morningside-analytics.com ] .. Latest Venture
>[ http://www.confabb.com ]  Founder
>
>
>
>
>

Re: correct XPATH syntax

2012-04-30 Thread Twomey, David
Answering my own question:  I think I can do this by writing a script that
concats the Lastname, Forname and Initials and adding that to xpath =
/AuthorList/Author 

Yes?

On 4/30/12 4:49 PM, "Twomey, David"  wrote:

>Sorry hit send too soon.  Continued the email below
>
>On 4/30/12 4:46 PM, "Twomey, David"  wrote:
>
>>
>>Is this possible in DataImportHandler
>>
>>I want the following XML to all collapse into one mult-valued Author
>>field
>>
>>
>> 
>>  Sørlie
>>  T
>>  T
>> 
>> 
>>  Perou
>>  C M
>>  CM
>> 
>> 
>>  Tibshirani
>>  R
>>  R
>> 
>>...
>>
>>So my XPATH is like
>>xpath="/MedlineCitationSet/MedlineCitation/AuthorList/??"
>>commonField="true" />
>
>>
>



Re: extracting/indexing HTML via cURL

2012-04-30 Thread Jack Krupansky
I was thinking that you wanted to index the actual text from the HTML page, 
but have the stored field value still have the raw HTML with tags. If you 
just want to store only the raw HTML, a simple string field is sufficient, 
but then you can't easily do a text search on it.


Or, you can have two fields, one string field for the raw HTML (stored, but 
not indexed) and then do a CopyField to a text field field that has the 
HTMLStripCharFilter to strip the HTML tags and index only the text (indexed, 
but not stored.)


-- Jack Krupansky

-Original Message- 
From: okayndc

Sent: Monday, April 30, 2012 5:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr: extracting/indexing HTML via cURL

Great, thank you for the input.  My understanding of HTMLStripCharFilter is
that it strips HTML tags, which is not what I want ~ is this correct?  I
want to keep the HTML tags intact.

On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky 
wrote:



If by "extracting HTML content via cURL" you mean using SolrCell to parse
html files, this seems to make sense. The sequence is that regardless of
the file type, each file extraction "parser" will strip off all formatting
and produce a raw text stream. Office, PDF, and HTML files are all treated
the same in that way. Then, the unformatted text stream is sent through 
the

field type analyzers to be tokenized into terms that Lucene can index. The
input string to the field type analyzer is what gets stored for the field,
but this occurs after the extraction file parser has already removed
formatting.

No way for the formatting to be preserved in that case, other than to go
back to the original input document before extraction parsing.

If you really do want to preserve full HTML formatted text, you would need
to define a field whose field type uses the HTMLStripCharFilter and then
directly add documents that direct the raw HTML to that field.

There may be some other way to hook into the update processing chain, but
that may be too much effort compared to the HTML strip filter.

-- Jack Krupansky

-Original Message- From: okayndc
Sent: Monday, April 30, 2012 10:07 AM
To: solr-user@lucene.apache.org
Subject: Solr: extracting/indexing HTML via cURL


Hello,

Over the weekend I experimented with extracting HTML content via cURL and
just
wondering why the extraction/indexing process does not include the HTML
tags.
It seems as though the HTML tags either being ignored or stripped 
somewhere

in the pipeline.
If this is the case, is it possible to include the HTML tags, as I would
like to keep the
formatted HTML intact?

Any help is greatly appreciated.





core sleep/wake

2012-04-30 Thread oferiko
I have a multicore solr with a lot of cores that contains a lot of data (~50M
documents), but are rarely used.
Can i load a core from configuration, but have keep it in sleep mode, where
is has all the configuration available, but it hardly consumes resources,
and based on a query or an update, it will "come to life"?
Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/core-sleep-wake-tp3951850.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: dynamically create unique key

2012-04-30 Thread solr_noob
Hello Christopher
I ran into the same problem. When I disable dedupe from the update handler,
things worked fine. The problem is when i enable dedupe that I run into the
multivalued error. I'm also using SolJ to add documents.

Were you able to resolve this?

If so, would you kindly post your solution?

thanks for your input/help



--
View this message in context: 
http://lucene.472066.n3.nabble.com/dynamically-create-unique-key-tp1869924p3951857.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-30 Thread Burton-West, Tom
Thanks wunder,

I really appreciate the help.

Tom



Re: Solr logo for print

2012-04-30 Thread Chris Hostetter

http://svn.apache.org/viewvc?rev=1332444&view=rev

: At the bottom of  http://wiki.apache.org/solr/PublicServers I found a 
: link 
: 
to https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/ ,
 
: but that leads to 404.

fixed.

-Hoss

Re: Weird query results with edismax and boolean operator +

2012-04-30 Thread Jan Høydahl
Hi,

I see that you have already commented on SOLR-2649 "MM ignored in edismax 
queries with operators". So let's continue the way towards resolution there...

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 30. apr. 2012, at 14:28, Vadim Kisselmann wrote:

> I tested it.
> With default "qf=title text" in solrconfig and "mm=100%"
> i get the same result(1) for "nascar AND author:serg*" and "+nascar
> +author:serg*", great.
> With "nascar +author:serg*" i get 3500 matches, in this case the
> mm-parameter seems not to work.
> 
> Here are my debug params for "nascar AND author:serg*":
> 
> nascar AND author:serg*
> (+(+DisjunctionMaxQuery((text:nascar |
> title:nascar)~0.01) +author:serg*))/no_coord
> +(+(text:nascar | title:nascar)~0.01
> +author:serg*) name="com.bostonherald/news/international/europe/view/20120409russia_allows_anti-putin_demonstration_in_red_square">
> 8.235954 = (MATCH) sum of:
>  8.10929 = (MATCH) max plus 0.01 times others of:
>8.031613 = (MATCH) weight(text:nascar in 0) [DefaultSimilarity], result of:
>  8.031613 = score(doc=0,freq=2.0 = termFreq=2.0
> ), product of:
>0.84814763 = queryWeight, product of:
>  6.6960144 = idf(docFreq=27, maxDocs=8335)
>  0.12666455 = queryNorm
>9.469594 = fieldWeight in 0, product of:
>  1.4142135 = tf(freq=2.0), with freq of:
>2.0 = termFreq=2.0
>  6.6960144 = idf(docFreq=27, maxDocs=8335)
>  1.0 = fieldNorm(doc=0)
>7.7676363 = (MATCH) weight(title:nascar in 0) [DefaultSimilarity],
> result of:
>  7.7676363 = score(doc=0,freq=1.0 = termFreq=1.0
> ), product of:
>0.9919093 = queryWeight, product of:
>  7.830994 = idf(docFreq=8, maxDocs=8335)
>  0.12666455 = queryNorm
>7.830994 = fieldWeight in 0, product of:
>  1.0 = tf(freq=1.0), with freq of:
>1.0 = termFreq=1.0
>  7.830994 = idf(docFreq=8, maxDocs=8335)
>  1.0 = fieldNorm(doc=0)
>  0.12666455 = (MATCH) ConstantScore(author:serg*), product of:
>1.0 = boost
>0.12666455 = queryNorm
> 
> 
> 
> And here for  "nascar +author:serg*":
> nascar +author:serg*
> (+(DisjunctionMaxQuery((text:nascar |
> title:nascar)~0.01) +author:serg*))/no_coord
> +((text:nascar | title:nascar)~0.01
> +author:serg*) name="com.bostonherald/news/international/europe/view/20120409russia_allows_anti-putin_demonstration_in_red_square">
> 8.235954 = (MATCH) sum of:
>  8.10929 = (MATCH) max plus 0.01 times others of:
>8.031613 = (MATCH) weight(text:nascar in 0) [DefaultSimilarity], result of:
>  8.031613 = score(doc=0,freq=2.0 = termFreq=2.0
> ), product of:
>0.84814763 = queryWeight, product of:
>  6.6960144 = idf(docFreq=27, maxDocs=8335)
>  0.12666455 = queryNorm
>9.469594 = fieldWeight in 0, product of:
>  1.4142135 = tf(freq=2.0), with freq of:
>2.0 = termFreq=2.0
>  6.6960144 = idf(docFreq=27, maxDocs=8335)
>  1.0 = fieldNorm(doc=0)
>7.7676363 = (MATCH) weight(title:nascar in 0) [DefaultSimilarity],
> result of:
>  7.7676363 = score(doc=0,freq=1.0 = termFreq=1.0
> ), product of:
>0.9919093 = queryWeight, product of:
>  7.830994 = idf(docFreq=8, maxDocs=8335)
>  0.12666455 = queryNorm
>7.830994 = fieldWeight in 0, product of:
>  1.0 = tf(freq=1.0), with freq of:
>1.0 = termFreq=1.0
>  7.830994 = idf(docFreq=8, maxDocs=8335)
>  1.0 = fieldNorm(doc=0)
>  0.12666455 = (MATCH) ConstantScore(author:serg*), product of:
>1.0 = boost
>0.12666455 = queryNorm
> 
> 
> 0.063332275 = (MATCH) product of:
>  0.12666455 = (MATCH) sum of:
>0.12666455 = (MATCH) ConstantScore(author:serg*), product of:
>  1.0 = boost
>  0.12666455 = queryNorm
>  0.5 = coord(1/2)
> 
> 
> 
> You can see, that for first doc in "nascar +author:serg*" all
> query-params match, but in the second doc only
> "ConstantScore(author:serg*)".
> But with an "mm=100%" all query-params should match.
> http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/
> http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html
> 
> Best regards
> Vadim
> 
> 
> 
> 2012/4/30 Vadim Kisselmann :
>> Hi Jan,
>> thanks for your response!
>> 
>> My "qf" parameter for edismax is: "title". My
>> "defaultSearchField=text" in schema.xml.
>> In my app i generate a query with "qf=title,text", so i think the
>> default parameters in config/schema should bei overridden, right?
>> 
>> I found eventually 2 reasons for this behavior.
>> 1. "mm"-parameter in solrconfig.xml for edismax is 0. 0 stands for
>> "OR", but it should be an "AND" => 100%.
>> 2. I suppose that my app does not override my "default-qf".
>> I test it today and report, with my parsed query and all params.
>> 
>> Best regards
>> Vadim
>> 
>> 
>> 
>> 
>> 2012/4/29 Jan Høydahl :
>>> Hi,
>>> 
>>> What is your "qf" para

Re: hierarchical faceting?

2012-04-30 Thread Chris Hostetter

: Is there a tokenizer that tokenizes the string as one token?

Using KeywordTokenizer at query time should do whta you want.


-Hoss