date:20100104

Re: Spatial Solr (JTeam)

2010-01-04 Thread Thomas Rabaix

I have also move the jar into the global core's lib directory. and I still
have this issue.

I am running macosx snowleopard
  java version "1.6.0_17"
  Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
  Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)


I really don't know where the issue come from.

On Mon, Dec 28, 2009 at 2:54 PM, Mauricio Scheffer <
mauricioschef...@gmail.com> wrote:

> Seems to work for me... (I mean, I don't get a NoClassDefFoundError but I
> have other issues).
> I just put spatial-solr-1.0-RC3.jar in the core's lib directory and it
> worked.
>
> On Wed, Dec 23, 2009 at 8:25 PM, Thomas Rabaix  >wrote:
>
> > Hello,
> >
> > I would like to set up the spatial solr plugin from
> > http://www.jteam.nl/news/spatialsolr on solr 1.4. However I am getting a
> > error message when solr start.
> >
> > SEVERE: java.lang.NoClassDefFoundError:
> > org/apache/solr/search/QParserPlugin
> >
> > I guess nl.jteam.search.solrext.spatial.SpatialTierQueryParserPlugin
> > extends QParserPlugin. I have checked into the solr.war file (the one
> > provided by solr download webpage) and the class is present.
> >
> > Do you know if the current version "SSP version 1.0-RC3" is compatible
> with
> > solr 1.4 ?
> >
> > Thanks
> >
> > --
> > Thomas Rabaix
> >
>



-- 
Thomas Rabaix
http://rabaix.net

Re: Remove the deleted docs from the Solr Index

2010-01-04 Thread Shalin Shekhar Mangar

On Wed, Dec 30, 2009 at 12:10 AM, Mohamed Parvez  wrote:

> Ditto. There should have been an DIH command to re-sync the Index with the
> DB.
>

But there is such a command; it is called full-import.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Search algorithm used in Solr

2010-01-04 Thread Shalin Shekhar Mangar

On Mon, Jan 4, 2010 at 11:39 AM,  wrote:

> Hello everyone,
>
> Is there an article which explains (on a high level) the algorithm of
> search in Solr?
>
> How does Solr search approach compare to the "inverted index" technique?
>
>
Solr uses Lucene. It is the same inverted index technique at work.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Optimize not having any effect on my index

2010-01-04 Thread Aleksander Stensby

Hey, I managed to run it correctly after a few restarts. Don't really know
what happened.
Can't really see what this would have had to do with compound file format
tho? But no, I'm not using compund file format.

Cheers and thanks for your replies,
 Aleks

On Mon, Dec 21, 2009 at 8:27 AM, gurudev  wrote:

>
> Hi,
>
> Are you using the compound file format? If yes, then, have u set it
> properly
> in solrconfig.xml, if not, then, change to:
>
> true (this is by default 'false') under
> the tags:
>
> ...
>  and, ...
>
>
>
>
> Aleksander Stensby wrote:
> >
> > Hey guys,
> > I'm getting some strange behavior here, and I'm wondering if I'm doing
> > anything wrong..
> >
> > I've got an unoptimized index, and I'm trying to run the following
> > command:
> >
> http://server:8983/solr/update?optimize=true&maxSegments=10&waitFlush=false
> > Tried it first directly in the browser, it obviously took quite a bit of
> > time, but once it was finished I see no difference in my index. Same
> > number
> > of files, same size etc.
> > So i tried with curl:
> > curl http://server:8983/solr/update --data-binary '' -H
> > 'Content-type:text/xml; charset=utf-8'
> >
> > No difference here either... Am I doing anything wrong? Do i need to
> issue
> > a
> > commit after the optimize?
> >
> > Any pointers would be greatly appreciated.
> >
> > Cheers,
> >  Aleks
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Optimize-not-having-any-effect-on-my-index-tp26843094p26870653.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Facets and distributed search

2010-01-04 Thread Aleksander Stensby

Hi everyone! I've posted a similar question earlier, but in a thread related
to facets in general, so I thought I'd repost it here as a separate thread.

I have a faceted search that is very fast when I executed the query on a
single solr server, but is significantly slower when executed in a
distributed environment.
The set-back seem to be in the sharding of our data.. And that puzzles me a
little bit... I can't really see why SOLR is so slow at doing this.

The scenario:
Let's say we have two servers (s1 and s2).
If i query
the following:
q=threadid:33&facet=true&facet.field=author&limit=-1&facet.mincount=0&rows=0
directly on either server, the response is lightning fast. (<10ms)

So, in theory I could query them directly, concat the result myself and get
that done pretty fast.

But if I introduce the shards parameter, the response time booms to between
15000ms and 2ms!
shards=s1:8983/solr,s2:8983/solr

My initial thoughts is that I MUST be doing something wrong here?

So I try the following:
Run the query on server s1, with the shards param shards=s1:8983/solr
response time goes from sub 10ms to between 5000ms and 1ms!
Same results if i run the query on s2, and same if i use shards=s2:8983/solr

Is there really that much overhead in running a distributed facet field
query with Solr? Anyone else experienced this?

On the other hand, running regular queries without facet distributed is
lightning fast... (so can't really see that this is a network problem or
anything either). - I tried running a facet query on s1 with s1 as the
shards param, and that is still as slow as if the shards param was pointed
to a different server...

Any insight into this would be greatly appreciated! (Would like to avoid
having to hack together our own solution concatenating results...)

Cheers,
 Aleks

Re: Implementing Autocomplete/Query Suggest using Solr

2010-01-04 Thread Shalin Shekhar Mangar

On Wed, Dec 30, 2009 at 3:07 AM, Prasanna R  wrote:

>  I looked into the Solr/Lucene classes and found the required information.
> Am summarizing the same for the benefit of those that might refer to this
> thread in the future.
>
>  The change I had to make was very simple - make a call to getPrefixQuery
> instead of getWildcardQuery in my custom-modified Solr dismax query parser
> class. However, this will make a fairly significant difference in terms of
> efficiency. The key difference between the lucene WildcardQuery and
> PrefixQuery lies in their respective term enumerators, specifically in the
> term comparators. The termCompare method for PrefixQuery is more
> light-weight than that of WildcardQuery and is essentially an optimization
> given that a prefix query is nothing but a specialized case of Wildcard
> query. Also, this is why the lucene query parser automatically creates a
> PrefixQuery for query terms of the form 'foo*' instead of a WildcardQuery.
>
>
I don't understand this. There is nothing that one should need to do in
Solr's code to make this work. Prefix queries are supported out of the box
in Solr.


> And one final request for Comment to Shalin on this topic - I am guessing
> you ensured there were no duplicate terms in the field(s) used for
> autocompletion. For our first version, I am thinking of eliminating the
> duplicates outside of the results handler that gives suggestions since
> duplicate suggestions originate only from different document IDs in our
> system and we do want the list of document IDs matched. Is there a
> better/different way of doing the same?
>
>
No, I guess not.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr Cell - PDFs plus literal metadata - GET or POST ?

2010-01-04 Thread Shalin Shekhar Mangar

On Wed, Dec 30, 2009 at 7:49 AM, Ross  wrote:

> Hi all
>
> I'm experimenting with Solr. I've successfully indexed some PDFs and
> all looks good but now I want to index some PDFs with metadata pulled
> from another source. I see this example in the docs.
>
> curl "
> http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah
> "
>  -F "tutori...@tutorial.pdf"
>
> I can write code to generate a script with those commands substituting
> my own literal.whatever.  My metadata could be up to a couple of KB in
> size. Is there a way of making the literal a POST variable rather than
> a GET?


With Curl? Yes, see the man page.


>  Will Solr Cell accept it as a POST?


Yes, it will.

-- 
Regards,
Shalin Shekhar Mangar.

Re: performance question

2010-01-04 Thread Erik Hatcher



On Jan 4, 2010, at 12:04 AM, A. Steven Anderson wrote:



dynamic fields don't make it worse ... the number of actaul field  
names

you sort on makes it worse.

If you sort on 100 fields, the cost is the same regardless of  
wether all
100 of those fields exist because of a single   
declaration,

or 100 distinct  declarations.



Ahh...thanks for the clarification.

So, in general, there is no *significant* performance difference  
with using

dynamic fields. Correct?


Correct.  There's not even really an "insignificant" performance  
difference.  A dynamic field is the same as a regular field in  
practically every way on the search side of things.


Erik

Re: Search both diacritics and non-diacritics

2010-01-04 Thread Shalin Shekhar Mangar

On Sun, Jan 3, 2010 at 6:01 AM, Lance Norskog  wrote:

> The ASCIIFoldingFilter is a superset of the ISOLatin1Filter -
> ISOLatin1 is deprecated.  Here's the Javadoc from ASCIIFoldingFIlter.
> You did not mention which language you want to search.
>
> Unforch, the ASCIIFoldingFilter is not mentioned on the Solr wiki.
>
>
Thanks Lance. I've added it to the wiki at
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

-- 
Regards,
Shalin Shekhar Mangar.

Re: Configuring Solr to use RAMDirectory

2010-01-04 Thread Shalin Shekhar Mangar

On Thu, Dec 31, 2009 at 3:36 PM, dipti khullar wrote:

> Hi
>
> Can somebody let me know if its possible to configure RAMDirectory from
> solrconfig.xml. Although its clearly mentioned in
> https://issues.apache.org/jira/browse/SOLR-465 by Mark that he has worked
> upon it, but still I couldn't find any such property in config file in Solr
> 1.4 latest download.
> May be I am overlooking some simple property. Any help would be
> appreciated.
>
>
Note that there are things like replication which will not work if you are
using a RAMDirectory.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Rules engine and Solr

2010-01-04 Thread Shalin Shekhar Mangar

On Mon, Jan 4, 2010 at 10:24 AM, Avlesh Singh  wrote:

> I have a Solr (version 1.3) powered search server running in production.
> Search is keyword driven is supported using custom fields and tokenizers.
>
> I am planning to build a rules engine on top search. The rules are database
> driven and can't be stored inside solr indexes. These rules would
> ultimately
> two do things -
>
>   1. Change the order of Lucene hits.
>

A Lucene FieldComparator is what you'd need. The QueryElevationComponent
uses this technique.


>   2. Add/remove some results to/from the Lucene hits.
>
>
This is a bit more tricky. If you will always have a very limited number of
docs to add or remove, it may be best to change the query itself to include
or exclude them (i.e. add fq). Otherwise you'd need to write a custom
Collector (see DocSetCollector) and change SolrIndexSearcher to use it. We
are planning to modify SolrIndexSearcher to allow custom collectors soon for
field collapsing but for now you will have to modify it.


> What should be my starting point? Custom search handler?
>
>
A custom SearchComponent which extends/overrides QueryComponent will do the
job.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Invalid CRLF - StreamingUpdateSolrServer ?

2010-01-04 Thread Patrick Sauts


Thank you Yonik for your answer.

The platform encoding is "fr_FR.UTF-8", so it's still UTF-8, it should 
be I guess "en_US.UTF-8" ?


I've also tested LBHttpSolrServer (We wanted to have it as a "backup" 
for HAproxy) and it appears not to be thread safe ( what is also curious 
about it, is that there's no way to  manage the connections' pool ). If 
you're interresting in the logs, I can send those to you.


*Will there be a Solr 1.4.1 that'll fix those problems ?*

Cause using a SNAPSHOT doesn't seem a good idea to me.

I have another question but I don't know if I have to make a new post :
Can I use "-Dmaster=disabled" in JAVA_OPTS for a server that is slave 
and repeater ?


Patrick.


Yonik Seeley a écrit :

It could be this bug, fixed in trunk:

* SOLR-1595: StreamingUpdateSolrServer used the platform default character
  set when streaming updates, rather than using UTF-8 as the HTTP headers
  indicated, leading to an encoding mismatch. (hossman, yonik)

Could you try a recent nightly build (or build your own from trunk)
and see if it fixes it?

-Yonik
http://www.lucidimagination.com



On Thu, Dec 31, 2009 at 5:07 AM, Patrick Sauts  wrote:
  

I'm using solr 1.4 on tomcat 5.0.28, with client StreamingUpdateSolrServer
with 10threads and xml communication via Post method.

Is there a way to avoid this error (data lost)?
And is StreamingUpdateSolrServer reliable ?

GRAVE: org.apache.solr.common.SolrException: Invalid CRLF
  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:72)
  at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
  at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
  at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
  at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
  at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
  at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
  at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
  at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
  at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
  at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
  at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
  at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874)
  at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
  at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
  at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
  at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
  at java.lang.Thread.run(Thread.java:619)
Caused by: com.ctc.wstx.exc.WstxIOException: Invalid CRLF

Re: Invalid CRLF - StreamingUpdateSolrServer ?

2010-01-04 Thread Shalin Shekhar Mangar

On Mon, Jan 4, 2010 at 6:11 PM, Patrick Sauts wrote:

>
> I've also tested LBHttpSolrServer (We wanted to have it as a "backup" for
> HAproxy) and it appears not to be thread safe ( what is also curious about
> it, is that there's no way to  manage the connections' pool ). If you're
> interresting in the logs, I can send those to you.
>

What is the issue that you are facing? What is it exactly that you want to
change?

-- 
Regards,
Shalin Shekhar Mangar.

Improvising solr queries

2010-01-04 Thread dipti khullar

Hi

We have tried out various configurations settings to improvise the
performance of the site which is majorly using Solr but still the response
time remains about 4-5 reqs/sec. We also did some performance tests on Solr
1.4 but still there is a very minute improvement in performance. Currently
we are using Solr 1.3.

So our last resort remains, improvising the queries. We are using SolrJ -
CommonsHttpSolrServer

We guys are trying to tune up Solr Queries being used in our project.
Following sample query takes about 6 secs to execute under normal traffic.
At peak hours this often increases to 10-15 secs.

sitename:XYZ OR sitename:"All Sites") AND (localeid:1237400589415) AND
((assettype:Gallery))  AND (rbcategory:"ABC XYZ" ) AND (startdate:[* TO
2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO
*])&rows=9&start=63&sort=date
desc&facet=true&facet.field=assettype&facet.mincount=1

Similar to this query we have several much complex queries supporting all
major landing pages of our application.

Just want to confirm that whether anyone can identify any major flaws or
issues in the sample query?

Thanks
Dipti

Re: Invalid CRLF - StreamingUpdateSolrServer ?

2010-01-04 Thread Patrick Sauts

The issue was sometimes null result during facet navigation or simple 
search, results were back after a refresh, we tried to changed the cache 
to . But same behaviour.


*My implementation was :* (maybe wrong ?)
LBHttpSolrServer solrServer = new LBHttpSolrServer(new HttpClient(), new 
XMLResponseParser(), solrServerUrl.split(","));

solrServer.setConnectionManagerTimeout(CONNECTION_TIMEOUT);
solrServer.setConnectionTimeout(CONNECTION_TIMEOUT);
solrServer.setSoTimeout(READ_TIMEOUT);
solrServer.setAliveCheckInterval(CHECK_HEALTH_INTERVAL_MS);

*What I was suggesting :*
As a LBHttpSolrServer is a wrapper to CommonsHttpSolrServer

CommonsHttpSolrServer search1 = new 
CommonsHttpSolrServer("http://mysearch1";);

search1.setConnectionTimeout(CONNECTION_TIMEOUT);
search1.setSoTimeout(READ_TIMEOUT);
search1.setConnectionManagerTimeout(solr.CONNECTION_MANAGER_TIMEOUT);
search1.setDefaultMaxConnectionsPerHost(MAX_CONNECTIONS_PER_HOST1);
search1.setMaxTotalConnections(MAX_TOTAL_CONNECTIONS1);
search1.setParser(new XMLResponseParser());

CommonsHttpSolrServer search2 = new 
CommonsHttpSolrServer("http://mysearch1";);

search2.setConnectionTimeout(CONNECTION_TIMEOUT);
search2.setSoTimeout(READ_TIMEOUT);
search2.setConnectionManagerTimeout(solr.CONNECTION_MANAGER_TIMEOUT);
search2.setDefaultMaxConnectionsPerHost(MAX_CONNECTIONS_PER_HOST1);
search2.setMaxTotalConnections(MAX_TOTAL_CONNECTIONS1);
search2.setParser(new XMLResponseParser());

*LBHttpSolrServer solrServers = new LBHttpSolrServer(search1, search2);*

So we can manage the parameters per server.

Thank you for your time.

Patrick.


Shalin Shekhar Mangar a écrit :

On Mon, Jan 4, 2010 at 6:11 PM, Patrick Sauts wrote:

  

I've also tested LBHttpSolrServer (We wanted to have it as a "backup" for
HAproxy) and it appears not to be thread safe ( what is also curious about
it, is that there's no way to  manage the connections' pool ). If you're
interresting in the logs, I can send those to you.




What is the issue that you are facing? What is it exactly that you want to
change?

Re: Improvising solr queries

2010-01-04 Thread Shalin Shekhar Mangar

On Mon, Jan 4, 2010 at 6:39 PM, dipti khullar wrote:

> We have tried out various configurations settings to improvise the
> performance of the site which is majorly using Solr but still the response
> time remains about 4-5 reqs/sec. We also did some performance tests on Solr
> 1.4 but still there is a very minute improvement in performance. Currently
> we are using Solr 1.3.
>

That is too slow.

We need more information on your setup before we can help. What kind of
hardware you are using? Which OS/JVM? How much memory have you allocated to
the JVM?

What does your solrconfig look like? How many documents are there in your
index? What is the size of index on disk? What are the field types of the
fields you are searching on? Do you do highlighting on large fields? Can you
paste the cache section on the statistics page of your Solr dashboard
(preferably, just after a peak load)? How frequently is your index changed
(i.e. how frequently do you commit)?

I'd recommend an upgrade to Solr 1.4 anyway since it has major performance
improvements.

>
> So our last resort remains, improvising the queries. We are using SolrJ -
> CommonsHttpSolrServer
>
>
Actually that is one of the first things that you should look at.

> We guys are trying to tune up Solr Queries being used in our project.
> Following sample query takes about 6 secs to execute under normal traffic.
> At peak hours this often increases to 10-15 secs.
>
> sitename:XYZ OR sitename:"All Sites") AND (localeid:1237400589415) AND
> ((assettype:Gallery))  AND (rbcategory:"ABC XYZ" ) AND (startdate:[* TO
> 2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO
> *])&rows=9&start=63&sort=date
> desc&facet=true&facet.field=assettype&facet.mincount=1
>
> Similar to this query we have several much complex queries supporting all
> major landing pages of our application.
>
> Just want to confirm that whether anyone can identify any major flaws or
> issues in the sample query?
>
>
Most of those AND conditions can be separate filter queries. Filter queries
can be cached separately and can therefore be re-used. See
http://wiki.apache.org/solr/FilterQueryGuidance

-- 
Regards,
Shalin Shekhar Mangar.

RE: Reverse sort facet query [SOLR-1672]

2010-01-04 Thread Peter 4U



 

> Date: Sun, 3 Jan 2010 22:18:33 -0800
> From: hossman_luc...@fucit.org
> To: solr-user@lucene.apache.org
> Subject: RE: Reverse sort facet query [SOLR-1672]
> 
> 
> : Yes, I thought about adding some 'new syntax', but I opted for a separate 
> 'facet.sortorder' parameter,
> : 
> : mainly because I'm not familiar enough with the codebase to know what 
> effect this might have on
> : 
> : backward compatibility. It would be easy enough to modify the patch I 
> created to do it this way.
> 
> it shouldn't really affect anything -- it wouldn't really be new syntax, 
> just extending hte existing "sort" param syntax to apply to the 
> "facet.sort" param. The only back compat concern is making sure we 
> continue to support true/false as aliases, and having the default order 
> match the current bahvior if asc/desc aren't specified.
> 
> 
> -Hoss
> 


Yes, agreed. The current patch doesn't touch the b/w true/false aliasing, and 
any move to adding a new attr can keep all that intact.

I've been using the current patch extensively in our testing, and that's 
working well. The only caveat to this is that the reverse sort results

don't include 0-count facets (see notes in SOLR-1672), so reverse sort results 
start with the first count=1. This could be confusing as

there could well be many facets whose count is 0, and it might be expected that 
these be returned in the first instance.

>From my admittedly cursory look into the codebase regading this, I believe 
>patching to include 0 counts could open a can of worms in terms

of b/w compat and performance, as 0 counts look to be skipped (by default). I 
could be wrong, and you may know better how changes to 
SimpleFacets/UnInvertedField would affect performance and compatibility.

If there is indeed a performance optimization in facet counting iteration, it 
would, imo, be preferable to have the optimization, rather than the 0-counts.

 

Would you like me to go ahead and amend the patch (w/o 0-counts) to define a 
new 'sort' parameter? 

For naming, I would propose an extension of FacetParams.FACET_SORT_COUNT ala:

 

public static final String FACET_SORT_COUNT_REVERSE = "count.reverse";

 

I can then easily modify the patch to detect/use this value to invoke the new 
behaviour.

Comments? Suggestions?

 

Thanks,

Peter

 

 

 

 
  
_
Have more than one Hotmail account? Link them together to easily access both
 http://clk.atdmt.com/UKM/go/186394591/direct/01/

Re: Improvising solr queries

2010-01-04 Thread Shalin Shekhar Mangar

On Mon, Jan 4, 2010 at 7:25 PM, dipti khullar wrote:

> Thanks Shalin.
>
> Following are the relevant details:
>
> There are 2 search servers in a virtualized VMware environment. Each has  2
> instances of Solr running on separates ports in tomcat.
> Server 1: hosts 1 master(application 1), 1 slave (application 1)
> Server 2: hosta 1 master (application 2), 1 slave (application 1)
>
>
Have you tried a non-virtualized environment? Virtual instances are not that
great for high I/O throughput environments.


> Both servers have 4 CPUs and 4 GB RAM.
>
> Master
> - 4GB RAM
> - 1GB JVM Heap memory is allocated to Solr
> Slave1/Slave2:
> - 4GB RAM
> - 2GB JVM Heap memory is allocated to Solr
>
> Solr Details:
> apache-solr Version: 1.3.0
> Lucene - 2.4-dev
>
> - autocommit: 50 docs and 5 minutes
> - optimize runs on master in every 7 minutes
> - using postOptimize , we execute snapshooter on master
> - snappuller/snapinstaller on 2 slaves runs after every 10 minutes
>
>
You are committing every 5 minutes and optimizing every 7 minutes. Can you
try committing less often?


> Master and Slave1 (solr1)are on single box and Slave2(solr2) on different
> box. We use HAProxy to load balance query requests between
> 2 slaves. Master is only used for indexing.
>
> Solrj client which is used to query slave solr,gets timedout and there is
> high CPU usage/load avg.T he problem is reported on slaves for application
> 1. The SolrJ client which queries Solr over HTTP times out (10 sec is the
> timeout value) though in the Solr tomcat access log we find all requests
> have 200 response.
> During the tme, requests timeout the load avg. of the server goes extremely
> high (10-20).
> The issue gets resolved as soon as we optimize the slave index. In the solr
> admin, it shows only 4 requests/sec is handled with 400 ms response time.
>
> I am attaching solrconfig.xml for both master and slaves.
>
>
There is no autowarming on slaves which is probably OK if you are committing
so often. But do you really need to index new documents so often?

-- 
Regards,
Shalin Shekhar Mangar.

High Availability

2010-01-04 Thread Matthew Inger

I'm kind of stuck and looking for suggestions for high availability options.  
I've figured out without much trouble how to get the master-slave replication 
working.  This eliminates any single points of failure in the application in 
terms of the application's searching capability.

I would setup a master which would create the index and several slaves to act 
as the search servers, and put them behind a load balancer to distribute the 
requests.  This would ensure that if a slave node goes down, requests would 
continue to get serviced by the other nodes that are still up.

The problem I have is that my particular application also has the capability to 
trigger index updates from the user interface.  This means that the master now 
becomes a single point of failure for the user interface.  

The basic idea of the app is that there are multiple oracle instances 
contributing to a single document.  The volume and organization of the data 
(database links, normalization, etc...) prevents any sort of fast querying via 
SQL to do querying of the documents.  The solution is to build a lucene index 
(via solr), and use that for searching.  When updates are made in the UI, we 
will also send the updates directly to the solr server as well (we don't want 
to wait some arbitrary interval for a delta query to run).  

So you can see the problem here is that if the master is down, the sending of 
the updates to the master solr server will fail, thus causing an application 
exception.

I have tried configuring multiple solr servers which are both setup as masters 
and slaves to each other, but they keep clobber each other's index updates and 
rolling back each other's delta updates.  It seems that the replication doesn't 
take the generation # into account and check that the generation it's fetching 
is > the generation it already has before it applies it.

I thought of maybe introducing a JMS queue to send my updates to and having the 
JMS message listener set to manually acknowledge the messages only after a 
succesfull application of the solrj api calls, but that seems kind of 
contrived, and is only a band-aid.

Does anyone have any suggestions?


 
mattin...@yahoo.com
"Once you start down the dark path, forever will it
dominate your destiny.  Consume you it will " - Yoda

Re: High Availability

2010-01-04 Thread rob


Have you looked into a basic floating IP setup?

Have the master also replicate to another hot-spare master.

Any downtime during an outage of the 'live' master would be minimal as the 
hot-spare takes up the floating IP.




On Mon 04/01/10 16:13 , Matthew Inger  wrote:

> I'm kind of stuck and looking for suggestions for high availability
> options.  I've figured out without much trouble how to get the
> master-slave replication working.  This eliminates any single points
> of failure in the application in terms of the application's searching
> capability.
> I would setup a master which would create the index and several
> slaves to act as the search servers, and put them behind a load
> balancer to distribute the requests.  This would ensure that if a
> slave node goes down, requests would continue to get serviced by the
> other nodes that are still up.
> The problem I have is that my particular application also has the
> capability to trigger index updates from the user interface.  This
> means that the master now becomes a single point of failure for the
> user interface.  
> The basic idea of the app is that there are multiple oracle
> instances contributing to a single document.  The volume and
> organization of the data (database links, normalization, etc...)
> prevents any sort of fast querying via SQL to do querying of the
> documents.  The solution is to build a lucene index (via solr), and
> use that for searching.  When updates are made in the UI, we will
> also send the updates directly to the solr server as well (we don't
> want to wait some arbitrary interval for a delta query to run).  
> So you can see the problem here is that if the master is down, the
> sending of the updates to the master solr server will fail, thus
> causing an application exception.
> I have tried configuring multiple solr servers which are both setup
> as masters and slaves to each other, but they keep clobber each
> other's index updates and rolling back each other's delta updates. 
> It seems that the replication doesn't take the generation # into
> account and check that the generation it's fetching is > the
> generation it already has before it applies it.
> I thought of maybe introducing a JMS queue to send my updates to and
> having the JMS message listener set to manually acknowledge the
> messages only after a succesfull application of the solrj api calls,
> but that seems kind of contrived, and is only a band-aid.
> Does anyone have any suggestions?
> 
> "Once you start down the dark path, forever will it
> dominate your destiny.  Consume you it will " - Yoda
> 
> 
Message sent via Atmail Open - http://atmail.org/

Re: Facets and distributed search

2010-01-04 Thread Yonik Seeley

Something looks wrong... that type of slowdown is certainly not expected.
You should be able to see both the main query and a sub-query in the
logs... could you post an actual example?

-Yonik
http://www.lucidimagination.com


On Mon, Jan 4, 2010 at 4:15 AM, Aleksander Stensby
 wrote:
> Hi everyone! I've posted a similar question earlier, but in a thread related
> to facets in general, so I thought I'd repost it here as a separate thread.
>
> I have a faceted search that is very fast when I executed the query on a
> single solr server, but is significantly slower when executed in a
> distributed environment.
> The set-back seem to be in the sharding of our data.. And that puzzles me a
> little bit... I can't really see why SOLR is so slow at doing this.
>
> The scenario:
> Let's say we have two servers (s1 and s2).
> If i query
> the following:
> q=threadid:33&facet=true&facet.field=author&limit=-1&facet.mincount=0&rows=0
> directly on either server, the response is lightning fast. (<10ms)
>
> So, in theory I could query them directly, concat the result myself and get
> that done pretty fast.
>
> But if I introduce the shards parameter, the response time booms to between
> 15000ms and 2ms!
> shards=s1:8983/solr,s2:8983/solr
>
> My initial thoughts is that I MUST be doing something wrong here?
>
> So I try the following:
> Run the query on server s1, with the shards param shards=s1:8983/solr
> response time goes from sub 10ms to between 5000ms and 1ms!
> Same results if i run the query on s2, and same if i use shards=s2:8983/solr
>
> Is there really that much overhead in running a distributed facet field
> query with Solr? Anyone else experienced this?
>
> On the other hand, running regular queries without facet distributed is
> lightning fast... (so can't really see that this is a network problem or
> anything either). - I tried running a facet query on s1 with s1 as the
> shards param, and that is still as slow as if the shards param was pointed
> to a different server...
>
> Any insight into this would be greatly appreciated! (Would like to avoid
> having to hack together our own solution concatenating results...)
>
> Cheers,
>  Aleks
>

Re: High Availability

2010-01-04 Thread Matthew Inger

So, when the masters switch back, does that mean, we have to force a full delta 
update, correct?


 
mattin...@yahoo.com
"Once you start down the dark path, forever will it
dominate your destiny.  Consume you it will " - Yoda



- Original Message 
From: "r...@intelcompute.com" 
To: solr-user@lucene.apache.org
Sent: Mon, January 4, 2010 11:17:40 AM
Subject: Re: High Availability


Have you looked into a basic floating IP setup?

Have the master also replicate to another hot-spare master.

Any downtime during an outage of the 'live' master would be minimal as the 
hot-spare takes up the floating IP.




On Mon 04/01/10 16:13 , Matthew Inger  wrote:

> I'm kind of stuck and looking for suggestions for high availability
> options.  I've figured out without much trouble how to get the
> master-slave replication working.  This eliminates any single points
> of failure in the application in terms of the application's searching
> capability.
> I would setup a master which would create the index and several
> slaves to act as the search servers, and put them behind a load
> balancer to distribute the requests.  This would ensure that if a
> slave node goes down, requests would continue to get serviced by the
> other nodes that are still up.
> The problem I have is that my particular application also has the
> capability to trigger index updates from the user interface.  This
> means that the master now becomes a single point of failure for the
> user interface.  
> The basic idea of the app is that there are multiple oracle
> instances contributing to a single document.  The volume and
> organization of the data (database links, normalization, etc...)
> prevents any sort of fast querying via SQL to do querying of the
> documents.  The solution is to build a lucene index (via solr), and
> use that for searching.  When updates are made in the UI, we will
> also send the updates directly to the solr server as well (we don't
> want to wait some arbitrary interval for a delta query to run).  
> So you can see the problem here is that if the master is down, the
> sending of the updates to the master solr server will fail, thus
> causing an application exception.
> I have tried configuring multiple solr servers which are both setup
> as masters and slaves to each other, but they keep clobber each
> other's index updates and rolling back each other's delta updates. 
> It seems that the replication doesn't take the generation # into
> account and check that the generation it's fetching is > the
> generation it already has before it applies it.
> I thought of maybe introducing a JMS queue to send my updates to and
> having the JMS message listener set to manually acknowledge the
> messages only after a succesfull application of the solrj api calls,
> but that seems kind of contrived, and is only a band-aid.
> Does anyone have any suggestions?
> 
> "Once you start down the dark path, forever will it
> dominate your destiny.  Consume you it will " - Yoda
> 
> 
Message sent via Atmail Open - http://atmail.org/

Re: High Availability

2010-01-04 Thread rob


Even when Master 1 is alive again, it shouldn't get the floating IP until 
Master 2 actually fails.

So you'd ideally want them replicating to eachother, but since one will only be 
updated/Live at a time, it shouldn't cause an issue with cobbling data (?).

Just a suggestion tho, not done it myself on Solr, only with DB servers.




On Mon 04/01/10 16:28 , Matthew Inger  wrote:

> So, when the masters switch back, does that mean, we have to force a
> full delta update, correct?
> 
> "Once you start down the dark path, forever will it
> dominate your destiny.  Consume you it will " - Yoda
> - Original Message 
> From: "" 
> To: 
> Sent: Mon, January 4, 2010 11:17:40 AM
> Subject: Re: High Availability
> Have you looked into a basic floating IP setup?
> Have the master also replicate to another hot-spare master.
> Any downtime during an outage of the 'live' master would be minimal
> as the hot-spare takes up the floating IP.
> On Mon 04/01/10 16:13 , Matthew Inger  wrote:
> > I'm kind of stuck and looking for suggestions for high
> availability
> > options.  I've figured out without much trouble how to get the
> > master-slave replication working.  This eliminates any single
> points
> > of failure in the application in terms of the application's
> searching
> > capability.
> > I would setup a master which would create the index and several
> > slaves to act as the search servers, and put them behind a load
> > balancer to distribute the requests.  This would ensure that if a
> > slave node goes down, requests would continue to get serviced by
> the
> > other nodes that are still up.
> > The problem I have is that my particular application also has the
> > capability to trigger index updates from the user interface.  This
> > means that the master now becomes a single point of failure for
> the
> > user interface.  
> > The basic idea of the app is that there are multiple oracle
> > instances contributing to a single document.  The volume and
> > organization of the data (database links, normalization, etc...)
> > prevents any sort of fast querying via SQL to do querying of the
> > documents.  The solution is to build a lucene index (via solr),
> and
> > use that for searching.  When updates are made in the UI, we will
> > also send the updates directly to the solr server as well (we
> don't
> > want to wait some arbitrary interval for a delta query to run).  
> > So you can see the problem here is that if the master is down, the
> > sending of the updates to the master solr server will fail, thus
> > causing an application exception.
> > I have tried configuring multiple solr servers which are both
> setup
> > as masters and slaves to each other, but they keep clobber each
> > other's index updates and rolling back each other's delta updates.
> 
> > It seems that the replication doesn't take the generation # into
> > account and check that the generation it's fetching is > the
> > generation it already has before it applies it.
> > I thought of maybe introducing a JMS queue to send my updates to
> and
> > having the JMS message listener set to manually acknowledge the
> > messages only after a succesfull application of the solrj api
> calls,
> > but that seems kind of contrived, and is only a band-aid.
> > Does anyone have any suggestions?
> > 
> > "Once you start down the dark path, forever will it
> > dominate your destiny.  Consume you it will " - Yoda
> > 
> > 
> Message sent via Atmail Open - http://atmail.org/
> 
> 
Message sent via Atmail Open - http://atmail.org/

Re: High Availability

2010-01-04 Thread rob


I'm also not sure what hooks you could put in upon the IP floating to the other 
machine, to start/stop replication - if it IS an issue anyway.




On Mon 04/01/10 16:28 , Matthew Inger  wrote:

> So, when the masters switch back, does that mean, we have to force a
> full delta update, correct?
> 
> "Once you start down the dark path, forever will it
> dominate your destiny.  Consume you it will " - Yoda
> - Original Message 
> From: "" 
> To: 
> Sent: Mon, January 4, 2010 11:17:40 AM
> Subject: Re: High Availability
> Have you looked into a basic floating IP setup?
> Have the master also replicate to another hot-spare master.
> Any downtime during an outage of the 'live' master would be minimal
> as the hot-spare takes up the floating IP.
> On Mon 04/01/10 16:13 , Matthew Inger  wrote:
> > I'm kind of stuck and looking for suggestions for high
> availability
> > options.  I've figured out without much trouble how to get the
> > master-slave replication working.  This eliminates any single
> points
> > of failure in the application in terms of the application's
> searching
> > capability.
> > I would setup a master which would create the index and several
> > slaves to act as the search servers, and put them behind a load
> > balancer to distribute the requests.  This would ensure that if a
> > slave node goes down, requests would continue to get serviced by
> the
> > other nodes that are still up.
> > The problem I have is that my particular application also has the
> > capability to trigger index updates from the user interface.  This
> > means that the master now becomes a single point of failure for
> the
> > user interface.  
> > The basic idea of the app is that there are multiple oracle
> > instances contributing to a single document.  The volume and
> > organization of the data (database links, normalization, etc...)
> > prevents any sort of fast querying via SQL to do querying of the
> > documents.  The solution is to build a lucene index (via solr),
> and
> > use that for searching.  When updates are made in the UI, we will
> > also send the updates directly to the solr server as well (we
> don't
> > want to wait some arbitrary interval for a delta query to run).  
> > So you can see the problem here is that if the master is down, the
> > sending of the updates to the master solr server will fail, thus
> > causing an application exception.
> > I have tried configuring multiple solr servers which are both
> setup
> > as masters and slaves to each other, but they keep clobber each
> > other's index updates and rolling back each other's delta updates.
> 
> > It seems that the replication doesn't take the generation # into
> > account and check that the generation it's fetching is > the
> > generation it already has before it applies it.
> > I thought of maybe introducing a JMS queue to send my updates to
> and
> > having the JMS message listener set to manually acknowledge the
> > messages only after a succesfull application of the solrj api
> calls,
> > but that seems kind of contrived, and is only a band-aid.
> > Does anyone have any suggestions?
> > 
> > "Once you start down the dark path, forever will it
> > dominate your destiny.  Consume you it will " - Yoda
> > 
> > 
> Message sent via Atmail Open - http://atmail.org/
> 
> 
Message sent via Atmail Open - http://atmail.org/

Re: Any way to modify result ranking using an integer field?

2010-01-04 Thread Ahmet Arslan

> Thanks Ahmet.
> 
> Do I need to do anything to enable BoostQParserPlugin in
> Solr, or is it already enabled?

I just confirmed that it is already enabled. You can see affect of it by 
appending &debugQuery=on to your search url.

Re: Implementing Autocomplete/Query Suggest using Solr

2010-01-04 Thread Prasanna R

On Mon, Jan 4, 2010 at 1:20 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Wed, Dec 30, 2009 at 3:07 AM, Prasanna R  wrote:
>
> >  I looked into the Solr/Lucene classes and found the required
> information.
> > Am summarizing the same for the benefit of those that might refer to this
> > thread in the future.
> >
> >  The change I had to make was very simple - make a call to getPrefixQuery
> > instead of getWildcardQuery in my custom-modified Solr dismax query
> parser
> > class. However, this will make a fairly significant difference in terms
> of
> > efficiency. The key difference between the lucene WildcardQuery and
> > PrefixQuery lies in their respective term enumerators, specifically in
> the
> > term comparators. The termCompare method for PrefixQuery is more
> > light-weight than that of WildcardQuery and is essentially an
> optimization
> > given that a prefix query is nothing but a specialized case of Wildcard
> > query. Also, this is why the lucene query parser automatically creates a
> > PrefixQuery for query terms of the form 'foo*' instead of a
> WildcardQuery.
> >
> >
> I don't understand this. There is nothing that one should need to do in
> Solr's code to make this work. Prefix queries are supported out of the box
> in Solr.
>
>  I  am using the dismax query parser and I match on multiple fields with
different boosts. I run a prefix query on some fields in combination with a
regular field query on other fields. I do not know of any way in which one
could specify a prefix query on a particular field in your dismax query out
of the box in Solr 1.4. I had to update Solr to support additional syntax in
a dismax query that lets you choose to create a prefix query on a particular
field. As part of parsing this custom syntax, I was making a call to the
getWildcardQuery which I simply changed to getPrefixQuery.

Prasanna.

Phrase search issue with XMLPayload? Is it the better solution?

2010-01-04 Thread Shairon


I have a project that involves words extracted by OCR, each page has words,
each word has its geometry to blink a highlight to end user. 
I've been trying represent this document structure by xml

   
foo 
bar 
baz 
qux 
   
   

   
   

Using the field 'fulltext_st' ,



foo
bar
baz
qux


I can get all terms in my search result with them payloads.
But if I do search using phrase query I can't fetch any result.

Example:


search?q=foo

1



search?q=foo+bar

1
1


/search?q="foo bar"

*nothing*


I was wondering if I could get your thoughts if xmlpayload supports sort of
the things(with phrase search) or is there a good solution to index a doc
with many pages and one rectangle(graphical word geometry) for each term?



thank you in advance

-- 
View this message in context: 
http://old.nabble.com/Phrase-search-issue-with-XMLPayload--Is-it-the-better-solution--tp27018815p27018815.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Any way to modify result ranking using an integer field?

2010-01-04 Thread Andy

Thank you Ahmet.

Is there any way I can configure Solr to always use {!boost b=log(popularity)} 
as the default for all queries?

I'm using Solr through django-haystack, so all the Solr queries are actually 
generated by haystack. It'd be much cleaner if I could configure Solr to always 
use BoostQParserPlugin for all queries instead of manually modifying every 
single query generated by haystack.

--- On Mon, 1/4/10, Ahmet Arslan  wrote:

From: Ahmet Arslan 
Subject: Re: Any way to modify result ranking using an integer field?
To: solr-user@lucene.apache.org
Date: Monday, January 4, 2010, 2:33 PM

> Thanks Ahmet.
> 
> Do I need to do anything to enable BoostQParserPlugin in
> Solr, or is it already enabled?

I just confirmed that it is already enabled. You can see affect of it by 
appending &debugQuery=on to your search url.

Re: Improvising solr queries

2010-01-04 Thread Tom Hill

Hi -

Something doesn't make sense to me here:

On Mon, Jan 4, 2010 at 5:55 AM, dipti khullar wrote:

> - optimize runs on master in every 7 minutes
> - using postOptimize , we execute snapshooter on master
> - snappuller/snapinstaller on 2 slaves runs after every 10 minutes
>
>
Why would you optimize every 7 minutes, and update the slaves every ten?
After 70 minutes you'll be doing both at the same time.

How about optimizing every ten minutes, at :00,:10, :20, :30, :40, :50 and
then pulling every ten minutes at :01, :11, :21, :31, :41, :51 (assuming
your optimize completes in one minute).

Or did I misunderstand something?


> The issue gets resolved as soon as we optimize the slave index. In the solr
> admin, it shows only 4 requests/sec is handled with 400 ms response time.
>

>From your earlier description, it seems like you should only be distributing
an optimized index, so optimizing the slave should be a no-op. Check to see
what files you have on the slave after snappulling.

Tom

Non-leading wildcard search

2010-01-04 Thread Peter S


Hello,
There are lots of questions and answers in the forum regarding varying wildcard 
behaviour, but I haven't been able to find any
that address this particular behaviour. Perhaps someone could help?
Problem:
I have a fieldType that only goes through a KeywordTokenizer at index time, to 
ensure it stays 'verbatim' (e.g. it doesn't get split into any tokens - ws or 
otherwise).
Let's say there's some data stored in this field like this:


Something
Something Else
Something Else Altogether


When I query:  "Something" or "Something Else" or "*thing"  or "*omething*", I 
get back the expected results.
If, however, I query: "Some*" or "S*" or "s*" etc, I get no results (although 
this type of non-leading wildcard works fine with other fieldType schema 
elements that don't use KeywordTokenizer).
Is this something to do with KeywordTokenizer?
Is there a better way to index data (preserving case) and not splitting on ws 
or stemming etc. (i.e. no WhitespaceTokenizer or similar)?
My fieldType schema looks like this: (I've tried a number of other combinations 
as well including using class=solr.TextField)

  

  
  


  

 


I understand that wildcard queries don't go through analyzers, but why is it 
that 'tokenized' data matches on non-leading wildcard queries, whereas 
non-tokenized (or more specifically Keyword-Tokenized) doesn't?
The fieldType schema requires some tokenizer class, and it appears that 
KeywordTokenizer is the only one that tokenizes to a token size of 1 (i.e. the 
whole string).
I'm sure I'm missing something that is probably reasonbly obvious, but having 
tried myriad combinations, I thought it prudent to ask the experts in the forum.
 
Many thanks for any insight you can provide on this.
 
Peter
 

  
_
Use Hotmail to send and receive mail from your different email accounts
http://clk.atdmt.com/UKM/go/186394592/direct/01/

Re: Non-leading wildcard search

2010-01-04 Thread Yonik Seeley

On Mon, Jan 4, 2010 at 5:38 PM, Peter S  wrote:
> When I query:  "Something" or "Something Else" or "*thing"  or "*omething*", 
> I get back the expected results.
> If, however, I query: "Some*" or "S*" or "s*" etc, I get no results (although 
> this type of non-leading wildcard works fine with other fieldType schema 
> elements that don't use KeywordTokenizer).

Is your query string actually in quotes?  Wildcards aren't currently
supported in quotes.
So text_verbatim:Some* should work.

-Yonik
http://www.lucidimagination.com

RE: Non-leading wildcard search

2010-01-04 Thread Peter S

Hi Yonik,

Thanks for your quick reply.

No, the queries themselves aren't in quotes.

Since I sent the initial email, I have managed to get non-leading wildcard 
queries to work with this, but by unexpected means (for me at least :-).

If I add a LowerCaseFilterFactory to the fieldType, queries like s* (or S*) 
work as expected.

So the fieldType schema element now looks like:

I wasn't expecting this, as I would have thought this would change only the 
case behaviour, not the wildcard behaviour (or at least not just the 
non-leading wildcard behaviour). Perhaps I'm just not understanding how the 
terms (term in this case as not tokenized) is indexed and subsequently matched.

What I've noticed is that with the LowerCaseFilterFactory in place, document 
queries return results with case intact, but facet queries show the results in 
lower-case

(e.g. document->appname=Something  facet.field.appname=something). (I kind of 
expected the document->appname field to be lower case as well)

Does this sound like correct behaviour to you?

If it's correct, that's ok, I'll manage to work 'round it (maybe there's a way 
to map the facet field back to the document field?), but if it sounds wrong, 
perhaps it warrants further investigation.

Many thanks,

Peter

> Date: Mon, 4 Jan 2010 17:42:30 -0500
> Subject: Re: Non-leading wildcard search
> From: yo...@lucidimagination.com
> To: solr-user@lucene.apache.org
> 
> On Mon, Jan 4, 2010 at 5:38 PM, Peter S  wrote:
> > When I query:  "Something" or "Something Else" or "*thing"  or 
> > "*omething*", I get back the expected results.
> > If, however, I query: "Some*" or "S*" or "s*" etc, I get no results 
> > (although this type of non-leading wildcard works fine with other fieldType 
> > schema elements that don't use KeywordTokenizer).
> 
> Is your query string actually in quotes? Wildcards aren't currently
> supported in quotes.
> So text_verbatim:Some* should work.
> 
> -Yonik
> http://www.lucidimagination.com

_
View your other email accounts from your Hotmail inbox. Add them now.
http://clk.atdmt.com/UKM/go/186394592/direct/01/

RE: Non-leading wildcard search

2010-01-04 Thread Peter S


FYI:

 

I have found the root of this behaviour. It has to do with a test patch I've 
been working on for working 'round pre SOLR-219 (case insensitive wildcard 
searching).

With the test patch switched out, it works as expected. Although the case 
insensitive wildcard search reverts to pre-SOLR-219 behaviour.

 

I believe I can work 'round this by using a copyField that holds the lower-case 
text for wildcarding.

 

Many thanks, Yonik for your help.

 

Peter

 


 
> From: pete...@hotmail.com
> To: solr-user@lucene.apache.org
> Subject: RE: Non-leading wildcard search
> Date: Mon, 4 Jan 2010 23:29:04 +
> 
> 
> Hi Yonik,
> 
> 
> 
> Thanks for your quick reply.
> 
> No, the queries themselves aren't in quotes.
> 
> 
> 
> Since I sent the initial email, I have managed to get non-leading wildcard 
> queries to work with this, but by unexpected means (for me at least :-).
> 
> 
> 
> If I add a LowerCaseFilterFactory to the fieldType, queries like s* (or S*) 
> work as expected.
> 
> 
> 
> So the fieldType schema element now looks like:
> 
>  positionIncrementGap="100">
> 
> 
> 
> 
> 
> 
> 
>  ignoreCase="true" expand="true"/>
> 
> 
> 
> 
> 
> I wasn't expecting this, as I would have thought this would change only the 
> case behaviour, not the wildcard behaviour (or at least not just the 
> non-leading wildcard behaviour). Perhaps I'm just not understanding how the 
> terms (term in this case as not tokenized) is indexed and subsequently 
> matched.
> 
> 
> 
> What I've noticed is that with the LowerCaseFilterFactory in place, document 
> queries return results with case intact, but facet queries show the results 
> in lower-case
> 
> (e.g. document->appname=Something facet.field.appname=something). (I kind of 
> expected the document->appname field to be lower case as well)
> 
> 
> 
> Does this sound like correct behaviour to you?
> 
> If it's correct, that's ok, I'll manage to work 'round it (maybe there's a 
> way to map the facet field back to the document field?), but if it sounds 
> wrong, perhaps it warrants further investigation.
> 
> 
> 
> Many thanks,
> 
> Peter
> 
> 
> 
> 
> 
> > Date: Mon, 4 Jan 2010 17:42:30 -0500
> > Subject: Re: Non-leading wildcard search
> > From: yo...@lucidimagination.com
> > To: solr-user@lucene.apache.org
> > 
> > On Mon, Jan 4, 2010 at 5:38 PM, Peter S  wrote:
> > > When I query: "Something" or "Something Else" or "*thing" or 
> > > "*omething*", I get back the expected results.
> > > If, however, I query: "Some*" or "S*" or "s*" etc, I get no results 
> > > (although this type of non-leading wildcard works fine with other 
> > > fieldType schema elements that don't use KeywordTokenizer).
> > 
> > Is your query string actually in quotes? Wildcards aren't currently
> > supported in quotes.
> > So text_verbatim:Some* should work.
> > 
> > -Yonik
> > http://www.lucidimagination.com
> 
> _
> View your other email accounts from your Hotmail inbox. Add them now.
> http://clk.atdmt.com/UKM/go/186394592/direct/01/
  
_
Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy
http://clk.atdmt.com/UKM/go/186394592/direct/01/

Re: Indexing the latests MS Office documents

2010-01-04 Thread Peter Wolanin

You must have been searching old documentation - I think tika 0,3+ has
support for the new MS formats.  but don't take my word for it - why
don't you build tika and try it?

-Peter

On Sun, Jan 3, 2010 at 7:00 PM, Roland Villemoes  
wrote:
> Hi All,
>
> Anyone who knows how to index the latest MS office documents like .docx and 
> .xlsx  ?
>
> From searching it seems like Tika only supports the earlier formats .doc and 
> .xls
>
>
>
> med venlig hilsen/best regards
>
> Roland Villemoes
> Tel: (+45) 22 69 59 62
> E-Mail: mailto:r...@alpha-solutions.dk
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Improvising solr queries

2010-01-04 Thread Ian Holsman

On 1/5/10 12:46 AM, Shalin Shekhar Mangar wrote:

sitename:XYZ OR sitename:"All Sites") AND (localeid:1237400589415) AND
>  ((assettype:Gallery))  AND (rbcategory:"ABC XYZ" ) AND (startdate:[* TO
>  2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO
>  *])&rows=9&start=63&sort=date
>  desc&facet=true&facet.field=assettype&facet.mincount=1
>
>  Similar to this query we have several much complex queries supporting all
>  major landing pages of our application.
>
>  Just want to confirm that whether anyone can identify any major flaws or
>  issues in the sample query?
>
>

I'm not the expert Shalin is, but I seem to remember sorting by date was 
pretty rough on CPU. (this could have been resolved since I last looked 
at it)

the other thing I'd question is the facet. it looks like your only 
retrieving a single assetType  (Gallery).
so you will only get a single field back. if thats the case, wouldn't 
the rows returned (which is part of the response)

give you the same answer ?

Most of those AND conditions can be separate filter queries. Filter queries
can be cached separately and can therefore be re-used. See
http://wiki.apache.org/solr/FilterQueryGuidance

Listing Terms by Ascending IDF value . . ?

2010-01-04 Thread Christopher Ball

Hello,

 

I am trying to get a list of highly unusual terms or phrases (for example a
TF of 1 or 2) within an entire index (essentially this would be the inverse
of how Luke gives 'top terms' on the 'Overview' tab).

 

I see how I can do this within a specific query using the Term Vector
Component (qt=tvrh).

 

But do I have to write my own analyzer to get a list for the complete index
in ascending order?

 

Most grateful for any thoughts or insights,

 

Christopher

Re: Rules engine and Solr

2010-01-04 Thread Avlesh Singh

 Thanks for the response, Shalin. I am still in two minds over doing it
"inside" Solr versus "outside".
I'll get back with more questions, if any.

Cheers
Avlesh

On Mon, Jan 4, 2010 at 5:11 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Mon, Jan 4, 2010 at 10:24 AM, Avlesh Singh  wrote:
>
> > I have a Solr (version 1.3) powered search server running in production.
> > Search is keyword driven is supported using custom fields and tokenizers.
> >
> > I am planning to build a rules engine on top search. The rules are
> database
> > driven and can't be stored inside solr indexes. These rules would
> > ultimately
> > two do things -
> >
> >   1. Change the order of Lucene hits.
> >
>
> A Lucene FieldComparator is what you'd need. The QueryElevationComponent
> uses this technique.
>
>
> >   2. Add/remove some results to/from the Lucene hits.
> >
> >
> This is a bit more tricky. If you will always have a very limited number of
> docs to add or remove, it may be best to change the query itself to include
> or exclude them (i.e. add fq). Otherwise you'd need to write a custom
> Collector (see DocSetCollector) and change SolrIndexSearcher to use it. We
> are planning to modify SolrIndexSearcher to allow custom collectors soon
> for
> field collapsing but for now you will have to modify it.
>
>
> > What should be my starting point? Custom search handler?
> >
> >
> A custom SearchComponent which extends/overrides QueryComponent will do the
> job.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Improvising solr queries

2010-01-04 Thread dipti khullar

Hey Ian

This assettype is variable. It can have around 6 values at a time.
But this is true that we apply facet mostly on just one field - assettype.

Any idea if the use of date range queries is expensive? Also if Shalin can
put in some comments on
"sorting by date was pretty rough on CPU", I can start analyzing sort by
date specific queries.

Will look into suggestions/queries by Tom and Shalin and then post the
findings.

Thanks
Dipti


On Tue, Jan 5, 2010 at 9:17 AM, Ian Holsman  wrote:

> On 1/5/10 12:46 AM, Shalin Shekhar Mangar wrote:
>
>> sitename:XYZ OR sitename:"All Sites") AND (localeid:1237400589415) AND
>>> >  ((assettype:Gallery))  AND (rbcategory:"ABC XYZ" ) AND (startdate:[*
>>> TO
>>> >  2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO
>>> >  *])&rows=9&start=63&sort=date
>>> >  desc&facet=true&facet.field=assettype&facet.mincount=1
>>> >
>>> >  Similar to this query we have several much complex queries supporting
>>> all
>>> >  major landing pages of our application.
>>> >
>>> >  Just want to confirm that whether anyone can identify any major flaws
>>> or
>>> >  issues in the sample query?
>>> >
>>> >
>>>
>>>
>> I'm not the expert Shalin is, but I seem to remember sorting by date was
> pretty rough on CPU. (this could have been resolved since I last looked at
> it)
>
> the other thing I'd question is the facet. it looks like your only
> retrieving a single assetType  (Gallery).
> so you will only get a single field back. if thats the case, wouldn't the
> rows returned (which is part of the response)
> give you the same answer ?
>
>
>  Most of those AND conditions can be separate filter queries. Filter
>> queries
>> can be cached separately and can therefore be re-used. See
>> http://wiki.apache.org/solr/FilterQueryGuidance
>>
>>
>>
>
>

Re: Improvising solr queries

2010-01-04 Thread Shalin Shekhar Mangar

On Tue, Jan 5, 2010 at 11:16 AM, dipti khullar wrote:

>
> This assettype is variable. It can have around 6 values at a time.
> But this is true that we apply facet mostly on just one field - assettype.
>
>
Ian has a good point. You are faceting on assettype and you are also
filtering on it so you will get only one facet value "Gallery" with a count
equal to numFound.

> Any idea if the use of date range queries is expensive? Also if Shalin can
> put in some comments on
> "sorting by date was pretty rough on CPU", I can start analyzing sort by
> date specific queries.
>
>
This is a range search and not a sort. I don't know if range search on dates
is especially costly compared to a range search on any other type. But I do
know that trie fields in Solr 1.4 are much faster for range searches at the
cost of more tokens in the index.

With a date field, instead of using NOW, you should always try to round it
down to the coarsest interval you can use. So if it is possible to use
NOW/DAY instead of NOW, you should do that. The problem with querying on NOW
is that it is always unique and therefore the query can never be cached
(actually, it is cached but can never be hit). If you use NOW/DAY, the query
can be cached for a day.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Listing Terms by Ascending IDF value . . ?

2010-01-04 Thread Shalin Shekhar Mangar

On Tue, Jan 5, 2010 at 9:15 AM, Christopher Ball <
christopher.b...@metaheuristica.com> wrote:

> Hello,
>
> I am trying to get a list of highly unusual terms or phrases (for example a
> TF of 1 or 2) within an entire index (essentially this would be the inverse
> of how Luke gives 'top terms' on the 'Overview' tab).
>
> I see how I can do this within a specific query using the Term Vector
> Component (qt=tvrh).
>
>
Did you mean TermsComponent (qt=terms)?


> But do I have to write my own analyzer to get a list for the complete index
> in ascending order?
>
>
No, you don't need a custom analyzer. But TermsComponent can only sort by
frequency in descending order or by index order (lexicographical order).

Perhaps the patch in SOLR-1672 is more suitable for your task.

-- 
Regards,
Shalin Shekhar Mangar.

40 matches

Mail list logo