solr error when querying.

2012-05-23 Thread watson
Here is my query:
http://127.0.0.1:/solr/JOBS/select/??q=Apache&wt=xslt&tr=example.xslt

The response I get is the following.  I have example.xslt in the /conf/xslt
path.   What is wrong here?  Thanks!


HTTP ERROR 500

Problem accessing /solr/JOBS/select/. Reason:

getTransformer fails in getContentType

java.lang.RuntimeException: getTransformer fails in getContentType
at
org.apache.solr.response.XSLTResponseWriter.getContentType(XSLTResponseWriter.java:72)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:326)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:261)
at
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:129)
at
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:59)
at
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:122)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:110)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.IOException: Unable to initialize Templates
'example.xslt'
at
org.apache.solr.util.xslt.TransformerProvider.getTemplates(TransformerProvider.java:117)
at
org.apache.solr.util.xslt.TransformerProvider.getTransformer(TransformerProvider.java:77)
at
org.apache.solr.response.XSLTResponseWriter.getTransformer(XSLTResponseWriter.java:130)
at
org.apache.solr.response.XSLTResponseWriter.getContentType(XSLTResponseWriter.java:69)
... 23 more
Caused by: javax.xml.transform.TransformerConfigurationException: Could not
compile stylesheet
at
com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.newTemplates(Unknown
Source)
at
org.apache.solr.util.xslt.TransformerProvider.getTemplates(TransformerProvider.java:110)
... 26 more


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-error-when-querying-tp3985677.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: getTransformer error

2012-05-23 Thread watson
Anyone found a solution to the getTransformer error. I am getting the same
error.

Here is my output:


Problem accessing /solr/JOBS/select/. Reason:

getTransformer fails in getContentType

java.lang.RuntimeException: getTransformer fails in getContentType
at
org.apache.solr.response.XSLTResponseWriter.getContentType(XSLTResponseWriter.java:72)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:326)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:261)
at
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:129)
at
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:59)
at
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:122)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:110)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.IOException: Unable to initialize Templates
'example.xslt'
at
org.apache.solr.util.xslt.TransformerProvider.getTemplates(TransformerProvider.java:117)
at
org.apache.solr.util.xslt.TransformerProvider.getTransformer(TransformerProvider.java:77)
at
org.apache.solr.response.XSLTResponseWriter.getTransformer(XSLTResponseWriter.java:130)
at
org.apache.solr.response.XSLTResponseWriter.getContentType(XSLTResponseWriter.java:69)
... 23 more
Caused by: javax.xml.transform.TransformerConfigurationException: Could not
compile stylesheet
at
com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.newTemplates(Unknown
Source)
at
org.apache.solr.util.xslt.TransformerProvider.getTemplates(TransformerProvider.java:110)
... 26 more


--
View this message in context: 
http://lucene.472066.n3.nabble.com/getTransformer-error-tp3047726p3985687.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to manage resource out of index?

2010-07-06 Thread Rebecca Watson
hi li,

i looked at doing something similar - where we only index the text
but retrieve search results / highlight from files -- we ended up giving
up because of the amount of customisation required in solr -- mainly
because we wanted the distributed search functionality in solr which
meant making
sure the original file ended up the same filing system i.e. machine too!).

we ended up just storing the main text field too even though there was a
bit of text -- in the end solr/lucene can handle the index size fine and
disk space is cheaper than man-hours to customise solr/lucene to work
in this way!

that was our conclusion anyway and it works fine -- we also have
separate index / search server(s) so we don't care about merge time
either -- and as i said above - we use the distributed search so don't tend
to need to merge very large indexes anyway.
when your system grows / you go into production you'll probably split
the indexes too to use solr's distributed search func. for the sake of
query speed).

hope that helps,

bec :)

On 7 July 2010 14:07, Li Li  wrote:
> I used to store full text into lucene index. But I found it's very
> slow when merging index because when merging 2 segments it copy the
> fdt files into a new one. So I want to only index full text. But When
> searching I need the full text for applications such as hightlight and
> view full text. I can store the full text by  pair in
> database and load it to memory. And When I search in lucene(or solr),
> I retrive url of doc first, then use url to get full text. But when
> they are stored separately, it is hard to managed. They may be not
> consistent with each other. Does lucene or solr provied any method to
> ease this problem? Or any one  has some experience of this problem?
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Faceting unknown fields

2010-07-08 Thread Rebecca Watson
hi,

> So, can I index and facet these fields, without describe then in my schema?
>
> I will first try with dynamic fields, but I'm not sure it's going to work.

we do all our facet fields in this way, with just general string field
for single/multivalued
fields:

 
 

and faceting works...

but you will still need to know the specific name of the field(s) to use in the
facet.field URL parameter (i.e. as long as your UI knows!).

hope that helps

bec :)


Re: fq= "more then one" ?

2010-07-12 Thread Rebecca Watson
hi,

you shouldn't have two fq parameters -- some solr params work like
that, but fq doesn't

> http://172.20.1.33:8983/solr/select/?q=*:*&start=0&fq=EMAIL_HEADER_FROM:t...@mail.de&fq=EMAIL_HEADER_TO:t...@mail.de

you need to combine it into a single param i.e. try putting it as an
"OR" or "AND" if you're using the standard request handler:

fq=EMAIL_HEADER_FROM:t...@mail.de%20or%20email_header_to:t...@mail.de

or put something like + if you're using dismax (i think but i don't use it :) )

hope that helps,

bec :)


Re: fq= "more then one" ?

2010-07-12 Thread Rebecca Watson
oops - i thought you couldn't put more than one - ignore my answer then :)

On 12 July 2010 17:20, Rebecca Watson  wrote:
> hi,
>
> you shouldn't have two fq parameters -- some solr params work like
> that, but fq doesn't
>
>> http://172.20.1.33:8983/solr/select/?q=*:*&start=0&fq=EMAIL_HEADER_FROM:t...@mail.de&fq=EMAIL_HEADER_TO:t...@mail.de
>
> you need to combine it into a single param i.e. try putting it as an
> "OR" or "AND" if you're using the standard request handler:
>
> fq=EMAIL_HEADER_FROM:t...@mail.de%20or%20email_header_to:t...@mail.de
>
> or put something like + if you're using dismax (i think but i don't use it :) 
> )
>
> hope that helps,
>
> bec :)
>


Re: Problem with Wildcard searches in Solr

2010-07-13 Thread Rebecca Watson
Hi,

earlier this week i started messing with getting wildcard queries to
be analysed

i've got some weird analysers doing stemming/lowercasing and writing
in the same rules into a custom queryparser didn't seem logical given
i just want the analysers to apply as they do at index time

i came up with the hack below, which is just a modified version of
the LuceneQParserPlugin ie. the solr default one which creates
a SolrQueryParser query parser.

in the SolrQueryParser I overwrite the "getWildcardQuery" function so
that I insert a call to my method - "myWildcardQuery".

myWildcardQuery method converts the wildcard term into an analysed
version which it returns (and at least lowercases the if analysis fails
for some reason).

the myWildcardQuery method is just pulling in code from
lucene's QueryParser.getFieldQuery -- so all this code is a magical giant
cut and paste job right now (which you'll see when you look at the lucene/solr
classes involved!)

you use this custom queryparser in the usual way i.e.
by registering the queryparser in the solrconfig.xml file:

then call that queryparser in your request handler:


 
 ilexirQparser
   explicit
 10
  0
   *,score
  2.2
  standard
  on
 
 
 spellcheck
 tvComponent

  

i enable the leading wildcard queries using the reversedwildcard filter as per
previous email i.e. in index-time analyser add in:

(not at query time) -- then the lucene query parser picks up the use of this
filter and allows leading wildcard queries.

of course, non of this is going to sort out trying to match against the query
"co?mput?r" because you've probably stemmed "computer" to "comput" or something
at index time -- but if you add in a copyfield to an extra field that
isn't stemmed
at query time, then query both the original + the non-stemmed field (boost
accordingly -- i.e. you might want to boost the original non-stemmed field
higher!) you'll get the right match then :)

i'd be interested to hear from lucene/solr contributors why wildcards aren't
analysed in general anyway?

anyway hope that helps :)

bec

--



import java.io.IOException;
import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.CachingTokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.reverse.ReverseStringFilter;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.WildcardQuery;
import org.apache.solr.analysis.ReversedWildcardFilterFactory;
import org.apache.solr.common.params.CommonParams;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.LuceneQParserPlugin;
import org.apache.solr.search.QParser;
import org.apache.solr.search.QueryParsing;
import org.apache.solr.search.SolrQueryParser;

/**
 * modifies the code from LuceneQParserPlugin i.e. the default query parser
 * plugin used by solr.
 * @author bec
 */
public class ilexirQParserPlugin extends LuceneQParserPlugin {
public static String NAME = "lucene";

public void init(NamedList args) {
}

public QParser createParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {
return new ilexirQParser(qstr, localParams, params, req);
}
}

class ilexirQParser extends QParser {
String sortStr;
SolrQueryParser lparser;

public ilexirQParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {
super(qstr, localParams, params, req);
}

public Query parse() throws ParseException {
String qstr = getString();

String defaultField = getParam(CommonParams.DF);
if (defaultField == null) {
defaultField = 
getReq().getSchema().getDefaultSearchFieldName();
}
lparser = new SolrQueryParser(this, defaultField) {

/**
 * adapted from lucene's QueryParser.getFieldQuery !!
 *
 * @param field
 * @param termStr
 */
private String myWildcardQuery(String field, String 
termStr) {
System.out
.println("ILEXIR: ORIGINAL 
WILDCARD QUERY:" + termStr);
// get the corresponding analyser - this one is
 

Re: Problem with Wildcard searches in Solr

2010-07-13 Thread Rebecca Watson
hi,

sorry realised i had a typo:

> of course, non of this is going to sort out trying to match against the query
> "co?mput?r" because you've probably stemmed "computer" to "comput" or 
> something
> at index time -- but if you add in a copyfield to an extra field that
> isn't stemmed
> at query time, then query both the original + the non-stemmed field (boost
> accordingly -- i.e. you might want to boost the original non-stemmed field
> higher!) you'll get the right match then :)
>

should read - "but if you add in a copyfield to an extra field that
isn't stemmed at index time"

bec :)


Re: Locked Index files

2010-07-13 Thread Rebecca Watson
shut down your solr server first... if its not important! :)

On 13 July 2010 16:47, ZAROGKIKAS,GIORGOS  wrote:
> I found it but I can not delete
> Any suggestion???
>
> -Original Message-
> From: Yuval Feinstein [mailto:yuv...@answers.com]
> Sent: Tuesday, July 13, 2010 11:39 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Locked Index files
>
> Hi Giorgos.
> Try looking for write.lock files and deleting them.
> Cheers,
> Yuval
>
> -Original Message-
> From: ZAROGKIKAS,GIORGOS [mailto:g.zarogki...@multirama.gr]
> Sent: Tuesday, July 13, 2010 11:28 AM
> To: solr-user@lucene.apache.org
> Subject: Locked Index files
>
> Hi
>        My solr Index files are locked and I can’t index anything
>        How can I remove the lock file ?
>        I can’t delete it
>
>
>
>


faceting over field not in all documents

2010-07-13 Thread Rebecca Watson
hi,

has anyone had experience with faceting over a field where the field
is not present
in all documents within the index?

i'm hoping that -- faceting simply calculates+returns the counts for docs that
have the field present while results may still contain documents that don't
have the facet field (i.e. the field faceted on)?

thanks for any help, i guess if no one has tried this i'll let you know :)

bec :)


Re: faceting over field not in all documents

2010-07-13 Thread Rebecca Watson
brilliant! thanks very much for your help :)

On 13 July 2010 21:47, Jonathan Rochkind  wrote:
>> i'm hoping that -- faceting simply calculates+returns the counts for docs 
>> that
>> have the field present while results may still contain documents that don't
>> have the facet field (i.e. the field faceted on)?
>
> Yes, that's exactly what happens. You can use facet.missing to get a count 
> for documents with no value in the facet field too, if you want.
>
> Jonathan


Re: Error in building Solr-Cloud (ant example)

2010-07-15 Thread Rebecca Watson
hi mark,

jayf and i are working together :)

i tried to apply the patch to the trunk, but the ant tests failed...

i checked out the latest trunk:
svn checkout http://svn.apache.org/repos/asf/lucene/dev/trunk

patched it with SOLR-1873, and put the two JARs into trunk/solr/lib

ant compile in the top level trunk directory worked fine, but ant test
had a few errors.

the first error was:

[junit] Testsuite: org.apache.solr.cloud.BasicZkTest
[junit] Testcase:
testBasic(org.apache.solr.cloud.BasicZkTest):   Caused an ERROR
[junit] maxClauseCount must be >= 1
[junit] java.lang.IllegalArgumentException: maxClauseCount must be >= 1
[junit] at
org.apache.lucene.search.BooleanQuery.setMaxClauseCount(BooleanQuery.java:62)
[junit] at
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:131)
[junit] at
org.apache.solr.util.AbstractSolrTestCase.tearDown(AbstractSolrTestCase.java:182)
[junit] at
org.apache.solr.cloud.AbstractZkTestCase.tearDown(AbstractZkTestCase.java:135)
[junit] at
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:277)
[junit]


after this, tests passed until there were a lot of errors with this output:

[junit] - Standard Error -
[junit] Jul 15, 2010 3:00:53 PM org.apache.solr.handler.SnapPuller
fetchLatestIndex
[junit] SEVERE: Master at:
http://localhost:TEST_PORT/solr/replication is not available. Index
fetch failed. Exception: Invalid uri
'http://localhost:TEST_PORT/solr/replication': invalid port number

followed by a final message:
[junit] SEVERE: Master at: http://localhost:57146/solr/replication is
not available. Index fetch failed. Exception: Connection refused

a few more tests passed... then at the end:

BUILD FAILED
/Users/iwatson/work/solr/trunk/build.xml:31: The following error
occurred while executing this line:
/Users/iwatson/work/solr/trunk/solr/build.xml:395:
The following error occurred while executing this line:
/Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed!
The following error occurred while executing this line:
/Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed!
The following error occurred while executing this line:
/Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed!
The following error occurred while executing this line:
/Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed!
The following error occurred while executing this line:
/Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed!
The following error occurred while executing this line:
/Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed!
The following error occurred while executing this line:
/Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed!
The following error occurred while executing this line:
/Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed!

are these errors currently expected (i.e. issues being sorted) or
does it look like i'm doing something wrong/stupid!?

thanks for your help

bec :)

On 5 July 2010 04:34, Mark Miller  wrote:
> Hey jayf -
>
> Offhand I'm not sure why you are having these issues - last I knew, a
> couple people had had success with the cloud branch. Cloud has moved on
> from that branch really though - we probably should update the wiki
> about that. More important than though, that I need to get Cloud
> committed to trunk!
>
> I've been saying it for a while, but I'm going to make a strong effort
> to wrap up the final unit test issue (apparently a testing issue, not
> cloud issue) and get this committed for further iterations.
>
> The way to follow along with the latest work is to go to :
> https://issues.apache.org/jira/browse/SOLR-1873
>
> The latest patch there should apply to recent trunk.
>
> I've scheduled a bit of time to work on getting this committed this
> week, fingers crossed.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
> On 7/4/10 3:37 PM, jayf wrote:
>>
>> Hi there,
>>
>> I'm having a trouble installing Solr Cloud. I checked out the project, but
>> when compiling ("ant example" on OSX) I get compile a error (cannot find
>> symbol - pasted below).
>>
>> I also get a bunch of warnings:
>>     [javac] Note: Some input files use or override a deprecated API.
>>     [javac] Note: Recompile with -Xlint:deprecation for details.
>> I have tried both Java 1.5 and 1.6.
>>
>>
>> Before I got to this point, I was having problems with the included
>> ZooKeeper jar (java versioning issue) - so I had to download the source and
>> build this. Now 'ant' gets a bit further, to the stage listed above.
>>
>> Any idea of the problem??? THANKS!
>>
>>     [javac] Compiling 438 source files to
>> /Volumes/newpart/solrcloud/cloud/build/solr
>>     [javac]
>> /Volumes/newpart/solrcloud/cloud/src/java/org/apache/solr/cloud/ZkController.java:588:
>> cannot find symbol
>>     [javac] symbol  : method stringPropertyNames()
>>     [javac] location: class java.util.Properties

Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

2010-07-16 Thread Rebecca Watson
hi,

would faceting work?
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr

if you have a field for rootId that is multivalued + facet on it -- you'll get
value+count pairs back (top 100 i think by default)

bec :)

On 16 July 2010 16:07, Ninad Raut  wrote:
> Hi,
>
> I have a scenario in which I have to find count of distinct unique IDs
> present in a field (rootId field in my case) for a particular query.
>
> I require this for pagination purpose.
>
> Is there a way in Solr to do something like this we do in SQL:
>
> select count(distinct(rootId))
> from table
> where (the query part).
>
>
> Regards,
> Ninad R
>


Indexing Hanging during GC?

2010-08-12 Thread Rebecca Watson
Hi,

When indexing large amounts of data I hit a problem whereby Solr
becomes unresponsive
and doesn't recover (even when left overnight!). I think i've hit some
GC problems/tuning
is required of GC and I wanted to know if anyone has ever hit this problem.
I can replicate this error (albeit taking longer to do so) using
Solr/Lucene analysers
only so I thought other people might have hit this issue before over
large data sets

Background on my problem follows -- but I guess my main question is -- can Solr
become so overwhelmed by update posts that it becomes completely unresponsive??

Right now I think the problem is that the java GC is hanging but I've
been working
on this all week and it took a while to figure out it might be
GC-based / wasn't a
direct result of my custom analysers so i'd appreciate any advice anyone has
about indexing large document collections.

I also have a second questions for those in the know -- do we have a chance
of indexing/searching over our large dataset with what little hardware
we already
have available??

thanks in advance :)

bec

a bit of background: ---

I've got a large collection of articles we want to index/search over
-- about 180k
in total. Each article has say 500-1000 sentences and each sentence has about
15 fields, many of which are multi-valued and we store most fields as well for
display/highlighting purposes. So I'd guess over 100 million index documents.

In our small test collection of 700 articles this results in a single index of
about 13GB.

Our pipeline processes PDF files through to Solr native xml which we call
"index.xml" files i.e. in ... format ready to post straight to Solr's
update handler.

We create the index.xml files as we pull in information from
a few sources and creation of these files from their original PDF form is
farmed out across a grid and is quite time-consuming so we distribute this
process rather than creating index.xml files on the fly...

We do a lot of linguistic processing and to enable search functionality
of our resulting terms requires analysers that split terms/ join terms together
i.e. custom analysers that perform string operations and are quite
time-consuming/
have large overhead compared to most analysers (they take approx
20-30% more time
and use twice as many short-lived objects than the "text" field type).

Right now i'm working on my new Imac:
quad-core 2.8 GHz intel Core i7
16 GB 1067 MHz DDR3 RAM
2TB hard-drive (about half free)
Version 10.6.4 OSX

Production environment:
2 linux boxes each with:
8-core Intel(R) Xeon(R) CPU @ 2.00GHz
16GB RAM

I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
right now).

I setup Solr to use autocommit as we'll have several document collections / post
to Solr from different data sets:

 

  50 
  90 


I also have
  false
1024
10
-

*** First question:
Has anyone else found that Solr hangs/becomes unresponsive after too
many documents are indexed at once i.e. Solr can't keep up with the post rate?

I've got LCF crawling my local test set (file system connection
required only) and
posting documents to Solr using 6GB of RAM. As I said above, these documents
are in native Solr XML format () with one file per article so each
 contains all the sentence-level documents for the article.

With LCF I post about 2.5/3k articles (files) per hour -- so about
2.5k*500 /3600 =
350 s per second post-rate -- is this normal/expected??

Eventually, after about 3000 files (an hour or so) Solr starts to hang/becomes
unresponsive and with Jconsole/GC logging I can see that the Old-Gen space is
about 90% full and the following is the end of the solr log file-- where you
can see GC has been called:
--
3012.290: [GC Before GC:
Statistics for BinaryTreeDictionary:

Total Free Space: 53349392
Max   Chunk Size: 3200168
Number of Blocks: 66
Av.  Block  Size: 808324
Tree  Height: 13
Before GC:
Statistics for BinaryTreeDictionary:

Total Free Space: 0
Max   Chunk Size: 0
Number of Blocks: 0
Tree  Height: 0
3012.290: [ParNew (promotion failed): 143071K->142663K(153344K),
0.0769802 secs]3012.367: [CMS
--

I can replicate this with Solr using "text" field types in place of
those that use my
custom analysers -- whereby Solr takes longer to become unresponsive (about
3 hours / 13k docs) but there is the same kind of GC message at the end
 of the log file / Jconsole shows that the Old-Gen space was almost full so was
due for a collection sweep.

I don't use any special GC settings but found an article here:
http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/

that suggests using particular GC settings for Solr -- I will try
these but thought
someone else could suggest anoth

Re: Analysing SOLR logfiles

2010-08-12 Thread Rebecca Watson
we've just started using awstats - as suggested by the solr 1.4 book.

its open source!:
http://awstats.sourceforge.net/

On 12 August 2010 18:18, Jay Flattery  wrote:
> Thanks - splunk looks overkill.
> We're extremely small scale - were hoping for something open source :-)
>
>
> - Original Message 
> From: Jan Høydahl / Cominvent 
> To: solr-user@lucene.apache.org
> Sent: Wed, August 11, 2010 11:14:37 PM
> Subject: Re: Analysing SOLR logfiles
>
> Have a look at www.splunk.com
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 11. aug. 2010, at 19.34, Jay Flattery wrote:
>
>> Hi there,
>>
>>
>> Just wondering what tools people use to analyse SOLR log files.
>>
>> We're looking to do things like extracting common queries, calculating
>>averaging
>>
>>
>> Qtime and hits, returning particularly slow/expensive queries, etc.
>>
>> Would prefer not to code something (completely) from scratch.
>>
>> Thanks!
>>
>>
>>
>>
>
>
>
>
>


Re: Indexing Hanging during GC?

2010-08-12 Thread Rebecca Watson
sorry -- i used the term "documents" too loosely!

180k scientific articles with between 500-1000 sentences each
and we index sentence-level index documents
so i'm guessing about 100 million lucene index documents in total.

an update on my progress:

i used GC settings of:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
-XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
-XX:CMSInitiatingOccupancyFraction=70

which allowed the indexing process to run to 11.5k articles and
for about 2hours before I got the same kind of hanging/unresponsive Solr with
this as the tail of the solr logs:

Before GC:
Statistics for BinaryTreeDictionary:

Total Free Space: 2416734
Max   Chunk Size: 2412032
Number of Blocks: 3
Av.  Block  Size: 805578
Tree  Height: 3
5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.193 secs]5980.480: [CMS

I also saw (in jconsole) that the number of threads rose from the
steady 32 used for the
2 hours to 72 before Solr finally became unresponsive...

i've got the following GC info params switched on (as many as i could find!):
-XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
-XX:PrintFLSStatistics=1

with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
million fairly small
docs per hour!! this produced an index of about 40GB to give you an
idea of index
size...

because i've already got the documents in solr native xml format
i.e. one file per article each with ...
i.e. posting each set of sentence docs per article in every LCF file post...
this means that LCF can throw documents at Solr very fast and i think i'm
breaking it GC-wise.

i'm going to try adding in System.gc() calls to see if this runs ok
(albeit slower)...
otherwise i'm pretty much at a loss as to what could be causing this GC issue/
solr hanging if it's not a GC issue...

thanks :)

bec

On 12 August 2010 21:42, dc tech  wrote:
> I am a little confused - how did 180k documents become 100m index documents?
> We use have over 20 indices (for different content sets), one with 5m
> documents (about a couple of pages each) and another with 100k+ docs.
> We can index the 5m collection in a couple of days (limitation is in
> the source) which is 100k documents an hour without breaking a sweat.
>
>
>
> On 8/12/10, Rebecca Watson  wrote:
>> Hi,
>>
>> When indexing large amounts of data I hit a problem whereby Solr
>> becomes unresponsive
>> and doesn't recover (even when left overnight!). I think i've hit some
>> GC problems/tuning
>> is required of GC and I wanted to know if anyone has ever hit this problem.
>> I can replicate this error (albeit taking longer to do so) using
>> Solr/Lucene analysers
>> only so I thought other people might have hit this issue before over
>> large data sets
>>
>> Background on my problem follows -- but I guess my main question is -- can
>> Solr
>> become so overwhelmed by update posts that it becomes completely
>> unresponsive??
>>
>> Right now I think the problem is that the java GC is hanging but I've
>> been working
>> on this all week and it took a while to figure out it might be
>> GC-based / wasn't a
>> direct result of my custom analysers so i'd appreciate any advice anyone has
>> about indexing large document collections.
>>
>> I also have a second questions for those in the know -- do we have a chance
>> of indexing/searching over our large dataset with what little hardware
>> we already
>> have available??
>>
>> thanks in advance :)
>>
>> bec
>>
>> a bit of background: ---
>>
>> I've got a large collection of articles we want to index/search over
>> -- about 180k
>> in total. Each article has say 500-1000 sentences and each sentence has
>> about
>> 15 fields, many of which are multi-valued and we store most fields as well
>> for
>> display/highlighting purposes. So I'd guess over 100 million index
>> documents.
>>
>> In our small test collection of 700 articles this results in a single index
>> of
>> about 13GB.
>>
>> Our pipeline processes PDF files through to Solr native xml which we call
>> "index.xml" files i.e. in ... format ready to post straight to
>> Solr's
>> update handler.
>>
>> We create the index.xml files as we pull in information from
>> a few sources and creation of these files from their original PDF form is
>> farmed out across a grid and is quite time-consuming so we distribute this
>>

Re: Indexing Hanging during GC?

2010-08-12 Thread Rebecca Watson
hi,

> 1) I assume you are doing batching interspersed with commits

as each file I crawl for are article-level each  contains all the
sentences for the article so they are naturally batched into the about
500 documents per post in LCF.

I use auto-commit in Solr:

 50 
 90 
   

> 2) Why do you need sentence level Lucene docs?

that's an application specific need due to linguistic info needed on a
per-sentence
basis.

> 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be
> surprised if you a memory/connection leak their (or it is not
> releasing some resource explicitly)

I thought this could be the case too -- but if I replace the use of my custom
analysers and specify my fields are of type "text" instead (from standard
solrconfig.xml i.e. using solr-based analysers) then I get this kind of hanging
too -- at least it did when I didn't have any explicit GC settings... it does
take longer to replicate as my analysers/field types are more complex than
"text" field type.

i will try it again with the different GC settings tomorrow and post
the results.

> In general, we have NEVER had a problem in loading Solr.

i'm not sure if we would either if we posted as we created the
index.xml format...
but because we post 500+ documents a time (one article file per LCF post) and
LCF can post these files quickly i'm not sure if I need to try and slow down
the post rate!?

thanks for your replies,

bec :)

> On 8/12/10, Rebecca Watson  wrote:
>> sorry -- i used the term "documents" too loosely!
>>
>> 180k scientific articles with between 500-1000 sentences each
>> and we index sentence-level index documents
>> so i'm guessing about 100 million lucene index documents in total.
>>
>> an update on my progress:
>>
>> i used GC settings of:
>> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
>>       -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
>> -XX:CMSInitiatingOccupancyFraction=70
>>
>> which allowed the indexing process to run to 11.5k articles and
>> for about 2hours before I got the same kind of hanging/unresponsive Solr
>> with
>> this as the tail of the solr logs:
>>
>> Before GC:
>> Statistics for BinaryTreeDictionary:
>> 
>> Total Free Space: 2416734
>> Max   Chunk Size: 2412032
>> Number of Blocks: 3
>> Av.  Block  Size: 805578
>> Tree      Height: 3
>> 5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.193 secs]5980.480:
>> [CMS
>>
>> I also saw (in jconsole) that the number of threads rose from the
>> steady 32 used for the
>> 2 hours to 72 before Solr finally became unresponsive...
>>
>> i've got the following GC info params switched on (as many as i could
>> find!):
>> -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>>       -XX:+PrintGCApplicationConcurrentTime 
>> -XX:+PrintGCApplicationStoppedTime
>>       -XX:PrintFLSStatistics=1
>>
>> with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
>> million fairly small
>> docs per hour!! this produced an index of about 40GB to give you an
>> idea of index
>> size...
>>
>> because i've already got the documents in solr native xml format
>> i.e. one file per article each with ...
>> i.e. posting each set of sentence docs per article in every LCF file post...
>> this means that LCF can throw documents at Solr very fast and i think
>> i'm
>> breaking it GC-wise.
>>
>> i'm going to try adding in System.gc() calls to see if this runs ok
>> (albeit slower)...
>> otherwise i'm pretty much at a loss as to what could be causing this GC
>> issue/
>> solr hanging if it's not a GC issue...
>>
>> thanks :)
>>
>> bec
>>
>> On 12 August 2010 21:42, dc tech  wrote:
>>> I am a little confused - how did 180k documents become 100m index
>>> documents?
>>> We use have over 20 indices (for different content sets), one with 5m
>>> documents (about a couple of pages each) and another with 100k+ docs.
>>> We can index the 5m collection in a couple of days (limitation is in
>>> the source) which is 100k documents an hour without breaking a sweat.
>>>
>>>
>>>
>>> On 8/12/10, Rebecca Watson  wrote:
>>>> Hi,
>>>>
>>>> When indexing large amounts of data I hit a problem whereby Solr
>>>> becomes unresponsive
>>>> and doesn't recover (even when left overnight!). I think i've hit some
>>>> GC pr

Re: Indexing Hanging during GC?

2010-08-13 Thread Rebecca Watson
hi,

ok I have a theory about the cause of my problem -- java's GC failure
I think is due
to a solr memory leak caused from overlapping auto-commit calls --
does that sound
plausible?? (ducking for cover now...)

I watched the log files and noticed that when the threads start to increase
(from a stable 32 or so up to 72 before hanging!) there are two commit calls
too close to each other + it looked like the index is in the process
of merging at the time
of the first commit call -- i.e. first was a long commit call with
merge required
then before that one finished another commit call was issued.

i think this was due to the autocommit settings I had:

     50 
     90 
   

and eventually, it seems these two different auto-commit settings
would coincide!!
a few times this seems to happen and not cause a problem -- but I think two
eventually coincide where the first one is doing something heavy-duty
like a merge
over large index segments and so the system spirals downwards

combined with the fact I was posting to Solr as fast as possible (LCF
was waiting
for Solr) --> i think this causes java to keel over and die.

Two things were noticeable in Jconsole -
1) lots of threads were spawned with the two commit calls - the thread
spawing started
after the first commit call making me think it was a commit requiring
an index merge...
whereby threads overall went from the stable 32 used during
indexing for the 2 hours prior to 72 or so within 15 minutes after the
two commit calls
were made...

2) both Old-gen/survivor heaps were almost totally full! so i think a
memory leak
is happening with overlapping commit calls + heavy duty lucene index processing
behind solr (like index merge!?)

So if the overlapping commit call (second commit called before first
one finished)
caused a memory leak and with old-gen/survivor heaps full
at that point, Solr became unresponsive and never recovered.

is this expected when you use both autocommit settings / if concurrent commit
calls are issued to Solr?

This explains why it was happening even if without the use of my
custom analysers
("text" field type used in place of mine) but took longer to happen
--> my analysers
are more expensive CPU/RAM-wise so the overlapping commit calls were less likely
to be forgiven as my system was already using a lot of RAM...

Also, I played with the GC settings a bit where I could find settings
that helped
to postpone this issue as they were more forgiving to the increased
RAM usage during
overlapping commit calls (GC settings with increased eden heap space).

Solr was hanging after about 14k files (each one an article with a set
of  that
are each sentences in the article) with a total of about
7 million index documents.

If i switch off both auto-commit settings I can get through my
smallish 20k file set (10 million index
s) in 4 hours.

I'm Trying to run now on 100k articles (50 million index  within
100k files)
where I use LCF to crawl/post each file to Solr so i'll email an
update about this.

if this works ok i'm then going to try using only one auto-commit
setting rather than two and see
if this works ok.

thanks :)

bec


On 13 August 2010 00:24, Rebecca Watson  wrote:
> hi,
>
>> 1) I assume you are doing batching interspersed with commits
>
> as each file I crawl for are article-level each  contains all the
> sentences for the article so they are naturally batched into the about
> 500 documents per post in LCF.
>
> I use auto-commit in Solr:
> 
>     50 
>     90 
>   
>
>> 2) Why do you need sentence level Lucene docs?
>
> that's an application specific need due to linguistic info needed on a
> per-sentence
> basis.
>
>> 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be
>> surprised if you a memory/connection leak their (or it is not
>> releasing some resource explicitly)
>
> I thought this could be the case too -- but if I replace the use of my custom
> analysers and specify my fields are of type "text" instead (from standard
> solrconfig.xml i.e. using solr-based analysers) then I get this kind of 
> hanging
> too -- at least it did when I didn't have any explicit GC settings... it does
> take longer to replicate as my analysers/field types are more complex than
> "text" field type.
>
> i will try it again with the different GC settings tomorrow and post
> the results.
>
>> In general, we have NEVER had a problem in loading Solr.
>
> i'm not sure if we would either if we posted as we created the
> index.xml format...
> but because we post 500+ documents a time (one article file per LCF post) and
> LCF can post these files quickly i'm not sure if I need to try and slow down
> the post rate!?
>
> thanks for your replies,
>
> bec :)
>
>> On 8/12/10, Rebe

tii RAM usage on startup

2010-08-18 Thread Rebecca Watson
hi,

I am running solr 1.4.1 and java 1.6 with 6GB heap and the following
GC settings:
gc_args="-XX:+UseConcMarkSweepGC
-XX:+CMSClassUnloadingEnabled   -XX:NewSize=2g -XX:MaxNewSize=2g
-XX:CMSInitiatingOccupancyFraction=60"

So 6GB total heap and 2GB allocated to eden space.

I have caching, autocommit and auto-warming commented out of
solrconfig.xml

After I index 500k docs and call commit/optimize (via URL after indexing
has completed) my RAM usage is only about 1.5GB, but then if I stop
and restart my Solr server over the same data the RAM immediately
jumps to about 4GB and I can't understand why there is a difference
here? As this is close to the old gen limit -- i quickly find that Solr
becomes unresponsive.

The following shows that tii files are being loaded from 26MB
files to consume over 200MB in RAM when I restart the server.

is this expected?

thanks for any help/advice in advance,

bec :)

-

Rebecca-Watsons-iMac:work iwatson$ jmap -histo:live 8992 | head -30

 num #instances #bytes  class name
--
   1:  18334714 1422732624  [C
   2:  18332491  733299640  java.lang.String
   3:   6104929  244197160  org.apache.lucene.index.TermInfo
   4:   6104929  244197160  org.apache.lucene.index.TermInfo
   5:   6104929  244197160  org.apache.lucene.index.TermInfo
   6:   6104921  195357472  org.apache.lucene.index.Term
   7:   6104921  195357472  org.apache.lucene.index.Term
   8:   6104921  195357472  org.apache.lucene.index.Term
   9:   224  146527408  [J
  10:10   48839592  [Lorg.apache.lucene.index.TermInfo;
  11:10   48839592  [Lorg.apache.lucene.index.Term;
  12:10   48839592  [Lorg.apache.lucene.index.TermInfo;
  13:10   48839592  [Lorg.apache.lucene.index.TermInfo;
  14:10   48839592  [Lorg.apache.lucene.index.Term;
  15:10   48839592  [Lorg.apache.lucene.index.Term;
  16: 416306264728  
  17: 416305005104  
  18:  40494596352  
  19:  40493049984  
  20:  31292580040  
  21: 497132418496  
  22:  49831067192  [B
  23:  4381 806104  java.lang.Class
  24:  5979 533064  [[I
  25:  6124 438080  [S
  26:  7951 381648  java.util.HashMap$Entry
  27:  2071 375744  [Ljava.util.HashMap$Entry;
Rebecca-Watsons-iMac:work iwatson$ ls
./mach-lcf/data/data-serv-lcf/artdoc1/index/*.tii
-rw-r--r--  1 iwatson  staff26M 18 Aug 23:44
./mach-lcf/data/data-serv-lcf/artdoc1/index/_36.tii
-rw-r--r--  1 iwatson  staff26M 19 Aug 00:06
./mach-lcf/data/data-serv-lcf/artdoc1/index/_69.tii
-rw-r--r--  1 iwatson  staff25M 19 Aug 00:26
./mach-lcf/data/data-serv-lcf/artdoc1/index/_9d.tii
-rw-r--r--  1 iwatson  staff24M 19 Aug 00:50
./mach-lcf/data/data-serv-lcf/artdoc1/index/_ch.tii
-rw-r--r--  1 iwatson  staff25M 19 Aug 01:11
./mach-lcf/data/data-serv-lcf/artdoc1/index/_fj.tii
-rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:12
./mach-lcf/data/data-serv-lcf/artdoc1/index/_fq.tii
-rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:12
./mach-lcf/data/data-serv-lcf/artdoc1/index/_g1.tii
-rw-r--r--  1 iwatson  staff   167B 19 Aug 01:10
./mach-lcf/data/data-serv-lcf/artdoc1/index/_gb.tii
-rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:11
./mach-lcf/data/data-serv-lcf/artdoc1/index/_gc.tii
-rw-r--r--  1 iwatson  staff   223K 19 Aug 01:23
./mach-lcf/data/data-serv-lcf/artdoc1/index/_gd.tii


Re: Indexing Hanging during GC?

2010-08-18 Thread Rebecca Watson
hi all,

in case anyone is having similar issues now / in the future -- here's
what I think
is at least part of the problem:

once I commit the index, the RAM requirement jumps because the .tii files are
loaded in at that point and because i have a very large number of unique terms
I use 200MB+ of RAM for every tii file (even though they are only about 25MB on
disk due to the number of unique terms this results in a large memory
requirement
when they are loaded in). (thanks to people on the solr-user lists answering my
question on this -- search for subject "tii RAM usage on startup").

so when I had auto-commit on, my RAM was slowing disappearing and eventually
Solr hangs because the tii files are too big to load into memory.

the suggestion from my other thread was to try solr/lucene trunk (as
i'm using solr 1.4.1
and they have reduced the memory footprint within flexible indexing in
Lucene) OR
to increase the term index interval so I will try one/both of these and see if
this means I can increase the number of documents I can index given my current
hardware (6GB RAM) where these docs have a lot of unique terms!

thanks :)

bec

On 13 August 2010 19:15, Rebecca Watson  wrote:
> hi,
>
> ok I have a theory about the cause of my problem -- java's GC failure
> I think is due
> to a solr memory leak caused from overlapping auto-commit calls --
> does that sound
> plausible?? (ducking for cover now...)
>
> I watched the log files and noticed that when the threads start to increase
> (from a stable 32 or so up to 72 before hanging!) there are two commit calls
> too close to each other + it looked like the index is in the process
> of merging at the time
> of the first commit call -- i.e. first was a long commit call with
> merge required
> then before that one finished another commit call was issued.
>
> i think this was due to the autocommit settings I had:
> 
>      50 
>      90 
>    
>
> and eventually, it seems these two different auto-commit settings
> would coincide!!
> a few times this seems to happen and not cause a problem -- but I think two
> eventually coincide where the first one is doing something heavy-duty
> like a merge
> over large index segments and so the system spirals downwards
>
> combined with the fact I was posting to Solr as fast as possible (LCF
> was waiting
> for Solr) --> i think this causes java to keel over and die.
>
> Two things were noticeable in Jconsole -
> 1) lots of threads were spawned with the two commit calls - the thread
> spawing started
> after the first commit call making me think it was a commit requiring
> an index merge...
> whereby threads overall went from the stable 32 used during
> indexing for the 2 hours prior to 72 or so within 15 minutes after the
> two commit calls
> were made...
>
> 2) both Old-gen/survivor heaps were almost totally full! so i think a
> memory leak
> is happening with overlapping commit calls + heavy duty lucene index 
> processing
> behind solr (like index merge!?)
>
> So if the overlapping commit call (second commit called before first
> one finished)
> caused a memory leak and with old-gen/survivor heaps full
> at that point, Solr became unresponsive and never recovered.
>
> is this expected when you use both autocommit settings / if concurrent commit
> calls are issued to Solr?
>
> This explains why it was happening even if without the use of my
> custom analysers
> ("text" field type used in place of mine) but took longer to happen
> --> my analysers
> are more expensive CPU/RAM-wise so the overlapping commit calls were less 
> likely
> to be forgiven as my system was already using a lot of RAM...
>
> Also, I played with the GC settings a bit where I could find settings
> that helped
> to postpone this issue as they were more forgiving to the increased
> RAM usage during
> overlapping commit calls (GC settings with increased eden heap space).
>
> Solr was hanging after about 14k files (each one an article with a set
> of  that
> are each sentences in the article) with a total of about
> 7 million index documents.
>
> If i switch off both auto-commit settings I can get through my
> smallish 20k file set (10 million index
> s) in 4 hours.
>
> I'm Trying to run now on 100k articles (50 million index  within
> 100k files)
> where I use LCF to crawl/post each file to Solr so i'll email an
> update about this.
>
> if this works ok i'm then going to try using only one auto-commit
> setting rather than two and see
> if this works ok.
>
> thanks :)
>
> bec
>
>
> On 13 August 2010 00:24, Rebecca Watson  wrote:
>> hi,
>>
>>> 1) I assume you are doing batching intersp

Sorting Facets by First Occurrence

2009-11-30 Thread Cory Watson
I'm working on replacing a custom, internal search implementation with
Solr.  I'm having great success, with one small exception.

When implementing our version of faceting, one of our facets had a
peculiar sort order.  It was dictated by the order in which the field
occurred in the results.  The first time a value occurred it was added
to the list and regardless of the number of times it occurred, it
always stayed at the top.

For example, if a search yielded 10 results, 1 - 10, and hit 1 is in
category 'Toys', hit 2 through 9 are in 'Sports' and the last is in
'Household' then the facet would look like:

facet.fields -> category -> [ Toys: 1, Sports: 8 Household: 1 ]

The facet.sort only gives me the option to sort highest count first or
alphabetically.

So, the question I _really_ have is: how can I implement this feature?
 I could examine the results i'm returned and create my own facet
order from it, but I thought this might be useful for others.  I don't
know my way around Solr's source, so I though dropping a note to the
list would be faster than code spelunking with no light.

-- 
Cory 'G' Watson
http://www.onemogin.com


Negative boosts

2006-12-14 Thread Derek Watson

Hello,

I have been developing a new search application based on Solr (Very
nice!) using dismax. We are using query-time boosts to provide better
search results for user queries and index-time boosts to promote
certain documents over others.

My question is about the latter: We have a "position" field available
at index time that is an integer value, 0 being the 1st position, 1
being the 2nd position, 99 being the hundredth, etc.  How do I get
Solr to return documents in that same order? Is it possible to apply a
negative boost?

...
...
...

I notice in the documentation for parseFieldBoosts that the routine
"Doesn't care if boost info is negative, you're on your own.", but
what does that mean?

There is a desire to preserve the order as 0..xx and not reverse it
(which would be the obvious choice - x becomes 0, 0 becomes x) because
we are adding multiple sets of positions to the index, and I want the
first position of each set to be equal (zero).

This weighting is important to us for user queries as well as filtered
views. Is there another way to get what I'm looking for?

Thanks,
Derek


Re: Negative boosts

2006-12-14 Thread Derek Watson


If you want documents returned in the same order as a field, it's
easy... you sort!
If you want the value of a field to influence a score, not determine
the exact sort order, you can use FunctionQuery (currently hacked into
the query parser as _val_:myfield)


That seems like what I want -- boosting and not sorting. Is there a
function that will give me a bigger boost for field values closer to
zero?


Re: Sorting Facets by First Occurrence

2009-11-30 Thread Cory G Watson

On Nov 30, 2009, at 5:15 PM, Chris Hostetter wrote:
> All of Solr's existing faceting code is based on the DocSet which is an 
> unordered set of all matching documents -- i suspect your existing 
> application only reordered the facets based on their appearance in the 
> first N docs (possibly just the first page, but maybe more) so doing 
> something like that using the DocList would certainly be feasible.  if 
> your number of facet constraints is low enough that they are all returned 
> everytime then doing it in the client is probably the easiest -- but if 
> you have to worry about facet.limit preventing something from being 
> returned that might otherwise bubble to the top of your list when you 
> reorder it then you'll need to customise the FacetComponent.


You are right, I left out a few important bits there.  Tried to be brief and 
succeeding in being vague? :)

Effectively I was ordering the facet based on the N documents in the current 
"page".  My thought that his was a good feature for a facet now seems 
incorrect, as my needs are limited to the current page, not the whole set of 
results.

I'll probably elect to fetching data from the facets based on the page of 
documents I'm showing.  Thanks for the discussion, it helped! :)

Cory G Watson
http://www.onemogin.com