Re: Accented search

2008-06-24 Thread climbingrose
Here is how I did it (the code is from memory so it might not be correct
100%):
private boolean hasAccents;
private Token filteredToken;

public final Token next() throws IOException {
  if (hasAccents) {
hasAccents = false;
return filteredToken;
  }
  Token t = input.next();
  String filteredText = removeAccents(t.termText());
  if (filteredText.equals(t.termText()) { //no accents
return t;
  } else {
filteredToken = (Token) t.clone();
filteredToken.setTermText(filteredText):
filteredToken.setPositionIncrement(0);
hasAccents = true;
  }
  return t;
}

On Sat, Jun 21, 2008 at 2:37 AM, Phillip Farber <[EMAIL PROTECTED]> wrote:

> Regarding indexing words with accented and unaccented characters with
> positionIncrement zero:
>
> Chris Hostetter wrote:
>
>>
>> you don't really need a custom tokenizer -- just a buffered TokenFilter
>> that clones the original token if it contains accent chars, mutates the
>> clone, and then emits it next with a positionIncrement of 0.
>>
>>
> Could someone expand on how to implement this technique of buffering and
> cloning?
>
> Thanks,
>
> Phil
>



-- 
Regards,

Cuong Hoang


Suggestion for short text matching using dictionary

2008-06-26 Thread climbingrose
Firstly, my apologies for being off topic. I'm asking this question because
I think there are some machine learning and text processing experts on this
mailing list.

Basically, my task is to normalize a fairly unstructured set of short texts
using a dictionary. We have a pre-defined list of products and periodically
receive product feeds from various websites. Basically, our site is similar
to a shopping comparison engine but on a different domain. We would like to
normalize the products' names in the feeds to using our pre-defined list.
For example:

"Nokia N95 8GB Black" ---> "Nokia N95 8GB"
"Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB"

My original idea is to index the list of pre-defined names and then query
the index using the product's name. The highest scored result will be used
to normalize the product.

The problem with this is sometimes you get wrong matches because of noise.
For example, "Black Nokia N95, 8GB + Free bluetooth headset" can match
"Nokia Bluetooth Headset" which is desirable.

Is there a better solution for this problem? Thanks in advance.

-- 
Regards,

Cuong Hoang


Re: Suggestion for short text matching using dictionary

2008-06-27 Thread climbingrose
Thanks Grant. I did try Secondstring before and found out that it wasn't
particular good for doing a lot of text matching. I'm leaning toward the
combination of Lucene and Secondstring. Googling around a bit, I came across
this project http://datamining.anu.edu.au/projects/linkage.html. Looks
interesting but the implementation is in Python though. I think they use
Hidden Markov Model to label training data then matching records
probalistically.

On Fri, Jun 27, 2008 at 10:12 PM, Grant Ingersoll <[EMAIL PROTECTED]>
wrote:

> below
>
>
>
> On Jun 27, 2008, at 1:18 AM, climbingrose wrote:
>
>  Firstly, my apologies for being off topic. I'm asking this question
>> because
>> I think there are some machine learning and text processing experts on
>> this
>> mailing list.
>>
>> Basically, my task is to normalize a fairly unstructured set of short
>> texts
>> using a dictionary. We have a pre-defined list of products and
>> periodically
>> receive product feeds from various websites. Basically, our site is
>> similar
>> to a shopping comparison engine but on a different domain. We would like
>> to
>> normalize the products' names in the feeds to using our pre-defined list.
>> For example:
>>
>> "Nokia N95 8GB Black" ---> "Nokia N95 8GB"
>> "Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB"
>>
>> My original idea is to index the list of pre-defined names and then query
>> the index using the product's name. The highest scored result will be used
>> to normalize the product.
>>
>> The problem with this is sometimes you get wrong matches because of noise.
>> For example, "Black Nokia N95, 8GB + Free bluetooth headset" can match
>> "Nokia Bluetooth Headset" which is desirable.
>>
>
>
> I assume you mean "not desirable" here given the context...
>
> Your approach is worth trying.  At a deeper level, you may want to look
> into a topic called "record linkage" and an open source project called
> Second String by William Cohen's group at Carnegie Mellon (
> http://secondstring.sourceforge.net/) which has a whole bunch of
> implementations of fuzzy string matching algorithms like Jaro-Winkler,
> Levenstein, etc. that can then be used to implement what you are after.
>
> You could potentially use the spell checking functionality to simulate some
> of this a bit better than just a pure vector match.  Index your dictionary
> into a spelling index (see SOLR-572) and then send in spell checking
> queries.  In fact, you probably could integrate Second String into the spell
> checker pretty easily since one can now plugin the distance measure into the
> spell checker.
>
> You may find some help on this by searching http://lucene.markmail.org for
> things like "record linkage" or "record matching" or various other related
> terms.
>
> Another option is to write up a NormalizingTokenFilter that analyzes the
> tokens as they come in to see if they match your dictionary list.
>
> As with all of these, there is going to be some trial and error here to
> come up with something that hits most of the time, as it will never be
> perfect.
>
> Good luck,
> Grant
>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>


-- 
Regards,

Cuong Hoang


Re: using dismax with additional query?

2008-06-29 Thread climbingrose
Hi Bram,
You can use filter query (fq) to limit your results:

fq=tag:sometag&q=user_input_here


Have a look at dismax and standard query documentation on the wiki.


On Sun, Jun 29, 2008 at 6:49 PM, Bram de Jong <[EMAIL PROTECTED]> wrote:

> hello all,
>
> I would like to combine the DisMaxRequestHandler for processing user
> searches, but I would like the -aditionally- add more query
> parameters.
>
> For example, the user want to search inside all the documents tagged
> with one particular tag (a tag he clicked).
>
> So, I would like to define:
> regular_search_query=tag:sometag&q=user_input_here
>
> Is this possible?
>
>  - bram
>
> --
> http://www.freesound.org
> http://www.smartelectronix.com
> http://www.musicdsp.org
>



-- 
Regards,

Cuong Hoang


Limit Porter stemmer to plural stemming only?

2008-06-30 Thread climbingrose
Hi all,
Porter stemmer in general is really good. However, there are some cases
where it doesn't work. For example, "accountant" matches "Accountant" as
well as "Account Manager" which isn't desirable. Is it possible to use this
analyser for plural words only? For example:
+Accountant -> accountant
+Accountants -> accountant
+Account -> Account
+Accounts -> account

Thanks.

-- 
Regards,

Cuong Hoang


Re: Limit Porter stemmer to plural stemming only?

2008-06-30 Thread climbingrose
Ok, it looks like step 1a in Porter algo does what I need.
On Mon, Jun 30, 2008 at 6:39 PM, climbingrose <[EMAIL PROTECTED]>
wrote:

> Hi all,
> Porter stemmer in general is really good. However, there are some cases
> where it doesn't work. For example, "accountant" matches "Accountant" as
> well as "Account Manager" which isn't desirable. Is it possible to use this
> analyser for plural words only? For example:
> +Accountant -> accountant
> +Accountants -> accountant
> +Account -> Account
> +Accounts -> account
>
> Thanks.
>
> --
> Regards,
>
> Cuong Hoang
>



-- 
Regards,

Cuong Hoang


Re: Limit Porter stemmer to plural stemming only?

2008-06-30 Thread climbingrose
I modified the original English Stemmer written in Snowball language and
regenerate the Java implementation using Snowball compiler. It's been
working for me  so far. I certainly can share the modified Snowball English
Stemmer if anyone wants to use it.

Cheers,
Cuong

On Tue, Jul 1, 2008 at 4:12 AM, Mike Klaas <[EMAIL PROTECTED]> wrote:

> If you find a solution that works well, I encourage you to contribute it
> back to Solr.  Plural-only stemming is probably a common need (I've
> definitely wanted to use it before).
>
> cheers,
> -Mike
>
>
> On 30-Jun-08, at 2:25 AM, climbingrose wrote:
>
>  Ok, it looks like step 1a in Porter algo does what I need.
>> On Mon, Jun 30, 2008 at 6:39 PM, climbingrose <[EMAIL PROTECTED]>
>> wrote:
>>
>>  Hi all,
>>> Porter stemmer in general is really good. However, there are some cases
>>> where it doesn't work. For example, "accountant" matches "Accountant" as
>>> well as "Account Manager" which isn't desirable. Is it possible to use
>>> this
>>> analyser for plural words only? For example:
>>> +Accountant -> accountant
>>> +Accountants -> accountant
>>> +Account -> Account
>>> +Accounts -> account
>>>
>>> Thanks.
>>>
>>> --
>>> Regards,
>>>
>>> Cuong Hoang
>>>
>>>
>>
>>
>> --
>> Regards,
>>
>> Cuong Hoang
>>
>
>


-- 
Regards,

Cuong Hoang


Re: Limit Porter stemmer to plural stemming only?

2008-07-01 Thread climbingrose
Attached is the modified Snowball source code for plural-only English
stemmer. You need to compile it to Java using instruction here:
http://snowball.tartarus.org/runtime/use.html. Essentially, you need to:

1) Download (Snowball, algorithms, and libstemmer
library)<http://snowball.tartarus.org/dist/snowball_code.tgz> and
compile Snowball compiler it self using this command: gcc -O -o snowball
compiler/*.c.
2) Compile the the attached file to Java:
./snowball stem_ISO_8859_1.sbl -java -o EnglishStemmer -name EnglishStemmer

You can change EnglishStemmer to whatever you like, for example,
PluralEnglishStemmer. After that, you need to modify the generated Java
class so that it references the appropriate classes in net.sf.snowball.*
package instead of the one from Snowball website. I think only 2 classes you
need to import are Among and SnowballProgram.

Once, you have the new stemmer ready, write something similar to
EnglishPorterFilterFactory to use it within Solr.

Hope this helps.

Cheers,
Cuong


On Tue, Jul 1, 2008 at 6:07 PM, Guillaume Smet <[EMAIL PROTECTED]>
wrote:

> Hi Cuong,
>
> On Tue, Jul 1, 2008 at 4:45 AM, climbingrose <[EMAIL PROTECTED]>
> wrote:
> > I modified the original English Stemmer written in Snowball language and
> > regenerate the Java implementation using Snowball compiler. It's been
> > working for me  so far. I certainly can share the modified Snowball
> English
> > Stemmer if anyone wants to use it.
>
> Yeah, it would be nice. A step by step explanation of how to
> regenerate the Java files would be nice too (or a pointer to such a
> documentation if you found one).
>
> Thanks,
>
> --
> Guillaume
>


Re: Do I need Searcher on indexing machine

2008-07-10 Thread climbingrose
You do, I think. Have a look at DirectUpdateHandler2 class.

On Thu, Jul 10, 2008 at 9:16 PM, Gudata <[EMAIL PROTECTED]> wrote:

>
> Hi,
> I want (if possible) to dedicate one machine only for indexing and to be
> optimized only for that.
>
> In solrconfig.xml, I have:
> - commented all cache statements
> - set to use cold searchers.
> - set 1
>
>
>
> In the log files I see this all the time:
>
> INFO: Registered new searcher [EMAIL PROTECTED] main
> Jul 10, 2008 12:49:59 PM org.apache.solr.search.SolrIndexSearcher close
> INFO: Closing [EMAIL PROTECTED] main
>
> Why Solr is registering new searcher all the time. Is this overhead, and if
> yes, how to stop it?
>
> --
> View this message in context:
> http://www.nabble.com/Do-I-need-Searcher-on-indexing-machine-tp18380669p18380669.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,

Cuong Hoang


Document rating/popularity and scoring

2008-07-11 Thread climbingrose
Hi all,
Has anyone tried to factor rating/popularity into Solr scoring? For example,
I want documents with more page views to be ranked higher in the search
results. From what I can see, the most difficult thing is that we have to
update the number of page views for each document. With Solr-139, document
can be updated at field level. However, it still have to retrieve the
document and then do a reindex. With high traffic sites, the overhead might
be too high.

I'm thinking of using relational database to track page views / ratings and
then do a daily sync with Solr. Is there a way for Solr to retrieve data
from external sources (database server) and use the data for determining
document ranking?

Thanks.

-- 
Regards,

Cuong Hoang


Re: Document rating/popularity and scoring

2008-07-11 Thread climbingrose
Thanks Yonik. I will try it out. Btw, what cache should we use for
multivalued, untokenised fields with large number of terms? Faceted search
on these fields seem to be noticeably slower even if I have allocated enough
filterCache. There seems to be a lot of cache lookups for each query.
On Sat, Jul 12, 2008 at 1:58 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote:

> See ExternalFileField and BoostedQuery
>
> -Yonik
>
> On Fri, Jul 11, 2008 at 11:47 AM, climbingrose <[EMAIL PROTECTED]>
> wrote:
> > Hi all,
> > Has anyone tried to factor rating/popularity into Solr scoring? For
> example,
> > I want documents with more page views to be ranked higher in the search
> > results. From what I can see, the most difficult thing is that we have to
> > update the number of page views for each document. With Solr-139,
> document
> > can be updated at field level. However, it still have to retrieve the
> > document and then do a reindex. With high traffic sites, the overhead
> might
> > be too high.
> >
> > I'm thinking of using relational database to track page views / ratings
> and
> > then do a daily sync with Solr. Is there a way for Solr to retrieve data
> > from external sources (database server) and use the data for determining
> > document ranking?
> >
> > Thanks.
> >
> > --
> > Regards,
> >
> > Cuong Hoang
> >
>



-- 
Regards,

Cuong Hoang


Re: Document rating/popularity and scoring

2008-07-14 Thread climbingrose
Hi Yonik,

I have had a looked at ExternalFileField. However, I coudn't figured out how
to include the externally referenced field in the search results. Also,
sorting on this type of field isn't possible right?

Thanks.

On Sat, Jul 12, 2008 at 2:28 AM, climbingrose <[EMAIL PROTECTED]>
wrote:

> Thanks Yonik. I will try it out. Btw, what cache should we use for
> multivalued, untokenised fields with large number of terms? Faceted search
> on these fields seem to be noticeably slower even if I have allocated enough
> filterCache. There seems to be a lot of cache lookups for each query.
>
> On Sat, Jul 12, 2008 at 1:58 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
>> See ExternalFileField and BoostedQuery
>>
>> -Yonik
>>
>> On Fri, Jul 11, 2008 at 11:47 AM, climbingrose <[EMAIL PROTECTED]>
>> wrote:
>> > Hi all,
>> > Has anyone tried to factor rating/popularity into Solr scoring? For
>> example,
>> > I want documents with more page views to be ranked higher in the search
>> > results. From what I can see, the most difficult thing is that we have
>> to
>> > update the number of page views for each document. With Solr-139,
>> document
>> > can be updated at field level. However, it still have to retrieve the
>> > document and then do a reindex. With high traffic sites, the overhead
>> might
>> > be too high.
>> >
>> > I'm thinking of using relational database to track page views / ratings
>> and
>> > then do a daily sync with Solr. Is there a way for Solr to retrieve data
>> > from external sources (database server) and use the data for determining
>> > document ranking?
>> >
>> > Thanks.
>> >
>> > --
>> > Regards,
>> >
>> > Cuong Hoang
>> >
>>
>
>
>
> --
> Regards,
>
> Cuong Hoang
>


Best way to return ExternalFileField in the results

2008-07-15 Thread climbingrose
Hi all,
I've been trying to return a field of type ExternalFileField in the search
result. Upon examining XMLWriter class, it seems like Solr can't do this out
of the box. Therefore, I've tried to hack Solr to enable  this behaviour.
The goal is to call to ExternalFileField.getValueSource(SchemaField
field,QParser parser) in XMLWriter.writeDoc(String name, Document
document,...) method. There are two issues with doing this:

1) I need to create an instance of QParser in writeDoc method. What is the
best way to do this? What kind of overhead of creating a new QParser for
every document returned?

2) I have to modify writeDoc method to include the internal Lucene document
Id because I need it to retrieve the ExternalFileField:

fileField.getValueSource(schemaField,
qparser).getValues(request.getSearcher().getIndexReader()).floatVal(docId)

The immediate affect is that it breaks writeVal() method (because this
method references writeDoc()).

Any comments?

Thanks in advance.


-- 
Regards,

Cuong Hoang


Re: changing fileds name

2008-07-24 Thread climbingrose
You would need to modify schema.xml to change these names.

On Thu, Jul 24, 2008 at 8:06 AM, anshuljohri <[EMAIL PROTECTED]> wrote:

>
> Hi,
>
> I need to change the filed names in schema.xml eg. default names are
> id,sku,name,text etc. But i want to use my own name instead of these names.
> Lets say i use title, desc, sub, cat respectively. Than where i have to put
> my changes. I see that these default names are used in solrconfig.xml also
> at many places.
>
> I tried a lot but couldn't do all the changes. Can anyone plz guide me. As
> i
> am new to Solr, plz help me.
> Is there any separate tutorial for this which can guide me?
>
> Thanks,
> Anshul Johri
> --
> View this message in context:
> http://www.nabble.com/changing-fileds-name-tp18620015p18620015.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Regards,

Cuong Hoang


CollapseFilter with the latest Solr in trunk

2009-04-16 Thread climbingrose
Hi all,

Have any one try to use CollapseFilter with the latest version of Solr in
trunk? However, it looks like Solr 1.4 doesn't allow calling setFilterList()
and setFilter() on one instance of the QueryCommand. I modified the code in
QueryCommand to allow this:

public QueryCommand setFilterList(Query f) {
//  if( filter != null ) {
//throw new IllegalArgumentException( "Either filter or filterList
may be set in the QueryCommand, but not both." );
//  }
  filterList = null;
  if (f != null) {
filterList = new ArrayList(2);
filterList.add(f);
  }
  return this;
}

However, I still have a problem which prevent query filters from working
when used in conjunction with CollapseFilter. In other words, query filters
doesn't seem to have any effects on the result set when CollapseFilter is
used.

The other problem is related to OpenBitSet:

java.lang.ArrayIndexOutOfBoundsException: 2183
at org.apache.lucene.util.OpenBitSet.fastSet(OpenBitSet.java:242)
at org.apache.solr.search.CollapseFilter.addDoc(CollapseFilter.java:202)

at 
org.apache.solr.search.CollapseFilter.adjacentCollapse(CollapseFilter.java:161)
at 
org.apache.solr.search.CollapseFilter.(CollapseFilter.java:141)

at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:217)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)

at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)

at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)

at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)

at 
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
at 
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
at 
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)

at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)


at java.lang.Thread.run(Thread.java:619)

I think CollapseFilter is rather an important function in Solr that gets
used quite frequently. Does anyone have a solution for this?

-- 
Regards,

Cuong Hoang


Re: CollapseFilter with the latest Solr in trunk

2009-04-19 Thread climbingrose
Ok, here is how I fixed this problem:

  public DocListAndSet getDocListAndSet(Query query, List filterList,
DocSet docSet, Sort lsort, int offset, int len, int flags) throwsIOException {

//DocListAndSet ret = new DocListAndSet();

//getDocListC(ret,query,filterList,docSet,lsort,offset,len, flags |=
GET_DOCSET);

DocSet theFilt = getDocSet(filterList);

if (docSet != null) theFilt = (theFilt != null) ?
theFilt.intersection(docSet) : docSet;

QueryCommand qc = new QueryCommand();

qc.setQuery(query).setFilter(theFilt);

qc.setSort(lsort).setOffset(offset).setLen(len).setFlags(flags |=
GET_DOCSET);

QueryResult result = new QueryResult();

getDocListC(result,qc);



return result.getDocListAndSet();

  }


There is also one-off error in CollapseFilter which you can find solution on
Jira.

Cheers,
Cuong

On Sat, Apr 18, 2009 at 4:41 AM, Jeff Newburn  wrote:

> We are currently trying to do the same thing.  With the patch unaltered we
> can use fq as long as collapsing is turned on.  If we just send a normal
> document level query with an fq parameter it blows up.
>
> Additionally, it does not appear that the collapse.facet option works at
> all.
>
> --
> Jeff Newburn
> Software Engineer, Zappos.com
> jnewb...@zappos.com - 702-943-7562
>
>
> > From: climbingrose 
> > Reply-To: 
> > Date: Fri, 17 Apr 2009 16:53:00 +1000
> > To: solr-user 
> > Subject: CollapseFilter with the latest Solr in trunk
> >
> > Hi all,
> >
> > Have any one try to use CollapseFilter with the latest version of Solr in
> > trunk? However, it looks like Solr 1.4 doesn't allow calling
> setFilterList()
> > and setFilter() on one instance of the QueryCommand. I modified the code
> in
> > QueryCommand to allow this:
> >
> > public QueryCommand setFilterList(Query f) {
> > //  if( filter != null ) {
> > //throw new IllegalArgumentException( "Either filter or
> filterList
> > may be set in the QueryCommand, but not both." );
> > //  }
> >   filterList = null;
> >   if (f != null) {
> > filterList = new ArrayList(2);
> > filterList.add(f);
> >   }
> >   return this;
> > }
> >
> > However, I still have a problem which prevent query filters from working
> > when used in conjunction with CollapseFilter. In other words, query
> filters
> > doesn't seem to have any effects on the result set when CollapseFilter is
> > used.
> >
> > The other problem is related to OpenBitSet:
> >
> > java.lang.ArrayIndexOutOfBoundsException: 2183
> > at org.apache.lucene.util.OpenBitSet.fastSet(OpenBitSet.java:242)
> > at org.apache.solr.search.CollapseFilter.addDoc(CollapseFilter.java:202)
> >
> > at
> >
>
> org.apache.solr.search.CollapseFilter.adjacentCollapse(CollapseFilter.java:161>
> )
> > at
> org.apache.solr.search.CollapseFilter.<init>(CollapseFilter.java:141)
> >
> > at
> >
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:2
> > 17)
> > at
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandle
> > r.java:195)
> > at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.ja
> > va:131)
> >
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
> > at
> >
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303>
> )
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:23
> > 2)
> >
> > at
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFi
> > lterChain.java:202)
> > at
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChai
> > n.java:173)
> > at
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java
> > :213)
> >
> > at
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java
> > :178)
> > at
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
> > at
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
> >
> > at
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:1
> > 07)
> > at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
> > at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
> >
> > at
> >
> org.apache.

DisMax query and date boosting

2007-07-19 Thread climbingrose

Hi all,

I'm puzzling over how to boost a date field in a DisMax query. Atm, my qf is
"title^5 summary^1". However, what I really want to do is to allow document
with latest "listedDate" to have better score. For example, documents with
listedDate:[NOW-1DAY TO *] have additional score over documents with
listedDate:[* TO NOW-10DAY]. Any idea?

--
Regards,

Cuong Hoang


Re: DisMax query and date boosting

2007-07-19 Thread climbingrose

Thanks for both answers. Which one is better in terms of performance? bq or
bf?

On 7/20/07, Daniel Alheiros <[EMAIL PROTECTED]> wrote:


Sorry just correcting myself:
your_date_field:[NOW-24HOURS TO NOW]^10.0

Regards,
Daniel

On 19/7/07 15:25, "Daniel Alheiros" <[EMAIL PROTECTED]> wrote:

> I think in this case you can use a "bq" (Boost Query) so you can apply
this
> boost to the range you want.
>
> your_date_field:[NOW/DAY-24HOURS TO NOW]^10.0
>
> This example will boost your documents with date within the last 24h.
>
> Regards,
> Daniel
>
> On 19/7/07 14:45, "climbingrose" <[EMAIL PROTECTED]> wrote:
>
>> Hi all,
>>
>> I'm puzzling over how to boost a date field in a DisMax query. Atm, my
qf is
>> "title^5 summary^1". However, what I really want to do is to allow
document
>> with latest "listedDate" to have better score. For example, documents
with
>> listedDate:[NOW-1DAY TO *] have additional score over documents with
>> listedDate:[* TO NOW-10DAY]. Any idea?
>
>
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain
personal
> views which are not the views of the BBC unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in
reliance on
> it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in
reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.





--
Regards,

Cuong Hoang


Re: DisMax query and date boosting

2007-07-19 Thread climbingrose

Just tried the bq approach and it works beautifully. Exactly what I was
looking for. Still, I'd like to know which approach is the preferred? Thanks
again guys.

On 7/20/07, climbingrose <[EMAIL PROTECTED]> wrote:


Thanks for both answers. Which one is better in terms of performance? bq
or bf?

On 7/20/07, Daniel Alheiros < [EMAIL PROTECTED]> wrote:
>
> Sorry just correcting myself:
> your_date_field:[NOW-24HOURS TO NOW]^ 10.0
>
> Regards,
> Daniel
>
> On 19/7/07 15:25, "Daniel Alheiros" <[EMAIL PROTECTED]> wrote:
>
> > I think in this case you can use a "bq" (Boost Query) so you can apply
> this
> > boost to the range you want.
> >
> > your_date_field:[NOW/DAY-24HOURS TO NOW]^10.0
> >
> > This example will boost your documents with date within the last 24h.
> >
> > Regards,
> > Daniel
> >
> > On 19/7/07 14:45, "climbingrose" <[EMAIL PROTECTED]> wrote:
> >
> >> Hi all,
> >>
> >> I'm puzzling over how to boost a date field in a DisMax query. Atm,
> my qf is
> >> "title^5 summary^1". However, what I really want to do is to allow
> document
> >> with latest "listedDate" to have better score. For example, documents
> with
> >> listedDate:[NOW-1DAY TO *] have additional score over documents with
> >> listedDate:[* TO NOW-10DAY]. Any idea?
> >
> >
> > http://www.bbc.co.uk/
> > This e-mail (and any attachments) is confidential and may contain
> personal
> > views which are not the views of the BBC unless specifically stated.
> > If you have received it in error, please delete it from your system.
> > Do not use, copy or disclose the information in any way nor act in
> reliance on
> > it and notify the sender immediately.
> > Please note that the BBC monitors e-mails sent or received.
> > Further communication will signify your consent to this.
> >
>
>
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain
> personal views which are not the views of the BBC unless specifically
> stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in
> reliance on it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>
>


--
Regards,

Cuong Hoang





--
Regards,

Cuong Hoang


Re: DisMax query and date boosting

2007-07-19 Thread climbingrose

Thanks for the answer Chris. The DisMax query handler is just amazing!

On 7/20/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: Just tried the bq approach and it works beautifully. Exactly what I was
: looking for. Still, I'd like to know which approach is the preferred?
Thanks
: again guys.

i personally recommend the function approach, because it gives you a more
gradual falloff in terms of the scores of documents ... the BQ approahc
works great for simple boosting of "things in the last N days should score
really high" but 1 millisecond after that cut off the score plummets
immediately.

side note...

: > > Sorry just correcting myself:
: > > your_date_field:[NOW-24HOURS TO NOW]^ 10.0

the first example is perfectly fine, and will be more efficient because it
will take better advantage of the field cache...

: > > > your_date_field:[NOW/DAY-24HOURS TO NOW]^10.0

...if you don't round down to the nearest day, then every request will
generate a new query which will get put in the filterCache.  if a day
isn't granular enough for you, you can round to the nearest hour (or even
minute) but i strongly suggest you round to something so you don't wind up
using millisecond precision

your_date_field:[NOW/HOUR-1DAY TO NOW]^10.0



-Hoss





--
Regards,

Cuong Hoang


Re: mandatory and optional fields in the dismaxrequesthandler

2007-07-30 Thread climbingrose
I think I have the same question as Arnaud. For example, my dismax query has
qf=title^5 description^2. Now if I search for "Java developer", I want to
make sure that the results have at least "java" or "developer" in the title.
Is this possible with dismax query?

On 7/30/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> : Is it possible to specify precisely one or more mandatory fields in a
> : DismaxRequestHandler?
>
> what would the semantics making a field mandatory mean?  considering your
> specific example...
>
> :  
> : text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
> :  
> :  

Date rounding up

2007-08-08 Thread climbingrose
Hi all,

I think there might be something wrong with the date time rounding up. I
tried this query: "q=*:*&fq=listedDate:[NOW/DAY-1DAY TO *]" which I think
should return results since yesterday. So if today is 9th of August, it
should return all results from the 8th of August. However, Solr returns also
returns result from the 7th of August. Any idea?

-- 
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-08-10 Thread climbingrose
The spellchecker handler doesn't seem to work with multi-word query. For
example, when I tried to spellcheck "Java developar", it returns nothing
while if I tried "developar", spellchecker correctly returns "developer". I
followed the setup on the wiki.

Regards,

Cuong Hoang

On 7/10/07, Charles Hornberger <[EMAIL PROTECTED]> wrote:
>
> For what it's worth, I recently did a quick implementation of the
> spellchecker feature, and I simply created another field in my schema
> (Iike 'spell' in Tristan's example below). After feeding content into
> my search index, I used the spell field into add one single-field
> document for every distinct word in my document collection (I'm
> assuming the content folks have run spell-checkers :-)). E.g.:
>
> aardvark
> abacus
> abbot
> acacia
> etc.
>
> I also added some extra documents for proper names that appear in my
> documents. For instance, there are a couple fields that have
> comma-separated list of names, so I for each of those -- in addition
> to documents for "john", "doe", and "jane", which were generated by
> the naive word-splitting done in the first pass -- I added documents
> like so:
>
> john doe
> jane doe
> etc.
>
> You could do the same for other searchable multi-word tokens in your
> input -- song/album/book/movie titles, publisher names, geographic
> names (cities, neighborhoods, etc.), product names, and so on.
>
> -Charlie
>
> On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote:
> > I think there is some confusion regarding how the spell checker actually
> > uses the termSourceField.  It is suggested that you use a simple field
> type
> > such a "string", however since this field type does not tokenize or
> split
> > words, it is only useful in situations where the whole field is
> considered a
> > dictionary "word":
> >
> > 
> > 
> > Accountant
> > <
> http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand
> > > name="title">Auditor
> > Solicitor
> >  > 
> >
> > The follow example case will not work with spell checker since the whole
> > field is considered a single word or string:
> >
> > 
> > 
> > Accountant reveals that Accounting is boring
> >  > 
> >
> > I might suggest that you create an additional field in your schema that
> > takes advantage of the StandardTokenizer and StandardFilter which
> doesn't
> > perform a great deal of processing on the field yet should provide
> decent
> > results when used with the spell checker:
> >
> >  positionIncrementGap="100">
> >   
> > 
> > 
> > 
> > 
> >   
> >   
> > 
> >  > ignoreCase="true" expand="true"/>
> > 
> > 
> > 
> >   
> > 
> >
> > If you want this field to be automatically populated with the contents
> of
> > the title field when a document is added to the index, simply use a
> > copyField:
> >
> > 
> >
> > Hope this helps, let me know if this is still not clear, I probably will
> add
> > it to the wiki page soon.
> >
> > cheers,
> > Tristan
> >
> >
> >
> > On 7/9/07, climbingrose <[EMAIL PROTECTED]> wrote:
> > >
> > > Thanks for the quick reply. However, I'm still not able to setup
> > > spellchecker. Solr does create spell directory under data but doesn't
> seem
> > > to build the spellchecker index. Here are snippets of my schema.xml:
> > >
> > > 
> > >
> > >  > > startup="lazy">
> > > 
> > >  
> > >1
> > >0.5
> > >  
> > >
> > >  
> > >
> > >  
> > >  
> > >  
> > >  spell
> > >
> > >  
> > >  
> > >  
> > >  title
> > >
> > >
> > >
> > > I tried this url:
> > >
> > >
> http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand
> > > receive this:
> > >
> > > 
> > > 
> > > 0
> > > 2
> > > 
> > > rebuild
> > > 
> > > 
> > >
> > >
> > > On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote:
> > > >
> > > > The 

Re: Spell Check Handler

2007-08-10 Thread climbingrose
After looking the SpellChecker code, I realised that it only supports
single-word. I made a very naive modification of SpellCheckerHandler to get
multi-word support. Now the other problem that I have is how to have
different fields in SpellChecker index. For example, since my query has two
parts: "description" and "location", I don't want to build a spellchecker
index which combines both "description" and "location" into one
termSourceField. I want to check "description" part with the "description"
field in the spellchecker index and "location" part with "location" field in
the index. Otherwise I might have irrelevant suggestions for the "location"
part since the number of terms in "location" is generally much smaller
compared with that of "description". Any ideas?

Thanks.

On 8/11/07, climbingrose <[EMAIL PROTECTED]> wrote:
>
> The spellchecker handler doesn't seem to work with multi-word query. For
> example, when I tried to spellcheck "Java developar", it returns nothing
> while if I tried "developar", spellchecker correctly returns "developer".
> I followed the setup on the wiki.
>
> Regards,
>
> Cuong Hoang
>
> On 7/10/07, Charles Hornberger <[EMAIL PROTECTED]> wrote:
> >
> > For what it's worth, I recently did a quick implementation of the
> > spellchecker feature, and I simply created another field in my schema
> > (Iike 'spell' in Tristan's example below). After feeding content into
> > my search index, I used the spell field into add one single-field
> > document for every distinct word in my document collection (I'm
> > assuming the content folks have run spell-checkers :-)). E.g.:
> >
> > aardvark
> > abacus
> > abbot
> > acacia
> > etc.
> >
> > I also added some extra documents for proper names that appear in my
> > documents. For instance, there are a couple fields that have
> > comma-separated list of names, so I for each of those -- in addition
> > to documents for "john", "doe", and "jane", which were generated by
> > the naive word-splitting done in the first pass -- I added documents
> > like so:
> >
> > john doe
> > jane doe
> > etc.
> >
> > You could do the same for other searchable multi-word tokens in your
> > input -- song/album/book/movie titles, publisher names, geographic
> > names (cities, neighborhoods, etc.), product names, and so on.
> >
> > -Charlie
> >
> > On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote:
> > > I think there is some confusion regarding how the spell checker
> > actually
> > > uses the termSourceField.  It is suggested that you use a simple field
> > type
> > > such a "string", however since this field type does not tokenize or
> > split
> > > words, it is only useful in situations where the whole field is
> > considered a
> > > dictionary "word":
> > >
> > > 
> > > 
> > > Accountant
> > > <http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand
> > > > > name="title">Auditor
> > > Solicitor
> > >  > > 
> > >
> > > The follow example case will not work with spell checker since the
> > whole
> > > field is considered a single word or string:
> > >
> > > 
> > > 
> > > Accountant reveals that Accounting is
> > boring
> > >  > > 
> > >
> > > I might suggest that you create an additional field in your schema
> > that
> > > takes advantage of the StandardTokenizer and StandardFilter which
> > doesn't
> > > perform a great deal of processing on the field yet should provide
> > decent
> > > results when used with the spell checker:
> > >
> > >  > positionIncrementGap="100">
> > >   
> > > 
> > > 
> > > 
> > > 
> > >   
> > >   
> > > 
> > >  > > ignoreCase="true" expand="true"/>
> > > 
> > > 
> > > 
> > >   
> > > 
> > >
> > > If you want this field to be automatically populated with the contents
> > of
> > > the title field when a document is added to the index, simply use a
> > > copyField:
> > >
> > > 
> > >
> > > Hope this helps, let me know if 

Re: Spell Check Handler

2007-08-10 Thread climbingrose
OK, I just need to define 2 spellcheckers in solrconfig.xml for my purpose.

On 8/11/07, climbingrose <[EMAIL PROTECTED]> wrote:
>
> After looking the SpellChecker code, I realised that it only supports
> single-word. I made a very naive modification of SpellCheckerHandler to get
> multi-word support. Now the other problem that I have is how to have
> different fields in SpellChecker index. For example, since my query has two
> parts: "description" and "location", I don't want to build a spellchecker
> index which combines both "description" and "location" into one
> termSourceField. I want to check "description" part with the "description"
> field in the spellchecker index and "location" part with "location" field in
> the index. Otherwise I might have irrelevant suggestions for the "location"
> part since the number of terms in "location" is generally much smaller
> compared with that of "description". Any ideas?
>
> Thanks.
>
> On 8/11/07, climbingrose <[EMAIL PROTECTED]> wrote:
> >
> > The spellchecker handler doesn't seem to work with multi-word query. For
> > example, when I tried to spellcheck "Java developar", it returns nothing
> > while if I tried "developar", spellchecker correctly returns
> > "developer". I followed the setup on the wiki.
> >
> > Regards,
> >
> > Cuong Hoang
> >
> > On 7/10/07, Charles Hornberger < [EMAIL PROTECTED]> wrote:
> > >
> > > For what it's worth, I recently did a quick implementation of the
> > > spellchecker feature, and I simply created another field in my schema
> > > (Iike 'spell' in Tristan's example below). After feeding content into
> > > my search index, I used the spell field into add one single-field
> > > document for every distinct word in my document collection (I'm
> > > assuming the content folks have run spell-checkers :-)). E.g.:
> > >
> > > aardvark
> > > abacus
> > > abbot
> > > acacia
> > > etc.
> > >
> > > I also added some extra documents for proper names that appear in my
> > > documents. For instance, there are a couple fields that have
> > > comma-separated list of names, so I for each of those -- in addition
> > > to documents for "john", "doe", and "jane", which were generated by
> > > the naive word-splitting done in the first pass -- I added documents
> > > like so:
> > >
> > > john doe
> > > jane doe
> > > etc.
> > >
> > > You could do the same for other searchable multi-word tokens in your
> > > input -- song/album/book/movie titles, publisher names, geographic
> > > names (cities, neighborhoods, etc.), product names, and so on.
> > >
> > > -Charlie
> > >
> > > On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote:
> > > > I think there is some confusion regarding how the spell checker
> > > actually
> > > > uses the termSourceField.  It is suggested that you use a simple
> > > field type
> > > > such a "string", however since this field type does not tokenize or
> > > split
> > > > words, it is only useful in situations where the whole field is
> > > considered a
> > > > dictionary "word":
> > > >
> > > > 
> > > > 
> > > > Accountant
> > > > <http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand
> > > > > > > name="title">Auditor
> > > > Solicitor
> > > >  > > > 
> > > >
> > > > The follow example case will not work with spell checker since the
> > > whole
> > > > field is considered a single word or string:
> > > >
> > > > 
> > > > 
> > > > Accountant reveals that Accounting is
> > > boring
> > > >  > > > 
> > > >
> > > > I might suggest that you create an additional field in your schema
> > > that
> > > > takes advantage of the StandardTokenizer and StandardFilter which
> > > doesn't
> > > > perform a great deal of processing on the field yet should provide
> > > decent
> > > > results when used with the spell checker:
> > > >
> > > >  > > positio

Re: Spell Check Handler

2007-08-11 Thread climbingrose
That's exactly what I did with my custom version of the SpellCheckerHandler.
However, I didn't handle suggestionCount and only returned the one corrected
phrase which contains the "best" corrected terms. There is an issue on
Lucene issue tracker regarding multi-word spellchecker:
https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
.


On 8/11/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
>
> On 11/08/07, climbingrose <[EMAIL PROTECTED]> wrote:
> >
> > The spellchecker handler doesn't seem to work with multi-word query. For
> > example, when I tried to spellcheck "Java developar", it returns nothing
> > while if I tried "developar", spellchecker correctly returns
> "developer".
> > I
> > followed the setup on the wiki.
>
>
> While I suppose the general case for using the spelling checker would be a
> query containing a single misspelled word, it would be quite useful if the
> handler applied the analyzer specified by the termSourceField fieldType to
> the query input and then checked the spelling of each query token. This
> would seem to be the most flexible way of supporting multi-word queries
> (provided the termSourceField didn't use any stemmer filters I suppose).
>
> Piete
>



-- 
Regards,

Cuong Hoang


Re: FunctionQuery and boosting documents using date arithmetic

2007-08-11 Thread climbingrose
I'm having the date boosting function as well. I'm using this function:
F = recip(rord(creationDate),1,1000,1000)^10. However, since I have around
10,000 of documents added in one day, rord(createDate) returns very
different values for the same createDate. For example, the last document
added with have rord(createdDate) =1 while the last document added will have
rord(createdDate) = 10,000. When createDate > 10,000, value of F is
approaching 0. Therefore, the boost query doesn't make any difference
between the the last document added today and the document added 10 days
ago. Now if I replace 1000 in F with a large number, say 10,  the boost
function  suddenly gives the last few documents enormous boost and make the
other query scores irrelevant.

So in my case (and many others' I believe), the "true" date value would be
more appropriate. I'm thinking along the same line of adding timestamp. It
wouldn't add much overhead this way, would it?

Regards,



On 8/11/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> : Actually, just thinking about this a bit more, perhaps adding a function
> : call such as parseDate() might add too much overhead to the actual
> query,
> : perhaps it would be better to first convert the date to a timestamp at
> index
> : time and store it in a field type slong?  This might be more efficient
> but
>
> i would agree with you there, this is where a more robust (ie:
> less efficient) DateField-ish class that supports configuration options
> to specify:
>   1) the output format
>   2) the input format(s)
>   3) the indexed format
> ...as SimpleDateFormatter pattern strings would be handy.  The
> ValueSource it uses could return seconds (or some other unit based on
> another config option) since epoch as the intValue.
>
> it's been discussed before, but there are a lot of tricky issues involved
> which is probably why no one has really tackled it.
>
> : that still leaves the problem of obtaining the current timestamp to use
> in
> : the boost function.
>
> it would be pretty easy to write a ValueSource that just knew about "now"
> as seconds since epoch.
>
> : > While it seems to work pretty well, I've realised that this may not be
> : > quite as effective as i had hoped given that the calculation is based
> on the
> : > ordinal of the field value rather than the value of the field
> itself.  In
> : > cases where the field type is 'date' and the actual field values are
> not
> : > distributed evenly across all documents in the index, the value
> returned by
> : > rord() is not going to give a true reflection of document age.  For
> example,
>
> be careful what you wish for.  you are 100% correct that functions using
> hte (r)ord value of a DateField aren't a function of true age, but
> dependong on how you look at it that may be better then using the real age
> (i think so anyway).  Why it sounds appealing to say that docA should
> score half as high as docB if it is twice as old, that typically isn't all
> that important when dealing with recent dates; and when dealing with older
> dates the ordinal value tends to approximate it decently well ... where a
> true measure of age might screw you up is when you have situations where
> few/no new articles get published on weekends (or late at night).  it's
> also very confusing to people when the ordering of documents changes even
> though no new documents have been published -- that can easily happen if
> you are heavily boosting on a true age calculation but will never happen
> when dealing with an ordinal ranking of documents by age.
>
> (allthough, this could be compensated by doing all of your true age
> calculations relative the "min age" of all articles in your index -- but
> you would still get really weird 'big' shifts in scores as soon as that
> first article gets published on monday morning.
>
>
> -Hoss
>
>


-- 
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-08-11 Thread climbingrose
Yeah. How stable is the patch Karl? Is it possible to use it in product
environment?

On 8/12/07, karl wettin <[EMAIL PROTECTED]> wrote:
>
>
> 11 aug 2007 kl. 10.36 skrev climbingrose:
>
> > There is an issue on
> > Lucene issue tracker regarding multi-word spellchecker:
> > https://issues.apache.org/jira/browse/LUCENE-550
>
> I think you mean LUCENE-626 that sort of depends on LUCENE-550.
>
>
> --
> karl
>
>
>
>


-- 
Regards,

Cuong Hoang


Re: FunctionQuery and boosting documents using date arithmetic

2007-08-12 Thread climbingrose
We add around 10,000 docs during week days and 5,000 during weekends.

On 8/12/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
>
> Do you consistently add 10,000 documents to your index every day or does
> the
> number of new documents added per day vary?
>
>
> On 11/08/07, climbingrose <[EMAIL PROTECTED]> wrote:
> >
> > I'm having the date boosting function as well. I'm using this function:
> > F = recip(rord(creationDate),1,1000,1000)^10. However, since I have
> around
> > 10,000 of documents added in one day, rord(createDate) returns very
> > different values for the same createDate. For example, the last document
> > added with have rord(createdDate) =1 while the last document added will
> > have
> > rord(createdDate) = 10,000. When createDate > 10,000, value of F is
> > approaching 0. Therefore, the boost query doesn't make any difference
> > between the the last document added today and the document added 10 days
> > ago. Now if I replace 1000 in F with a large number, say 10,  the
> > boost
> > function  suddenly gives the last few documents enormous boost and make
> > the
> > other query scores irrelevant.
> >
> > So in my case (and many others' I believe), the "true" date value would
> be
> > more appropriate. I'm thinking along the same line of adding timestamp.
> It
> > wouldn't add much overhead this way, would it?
> >
>



-- 
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-08-12 Thread climbingrose
I'm happy to contribute code for the SpellCheckerRequestHandler. I'll post
the code once I strip off stuff related to our product.

On 8/12/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
>
> <http://issues.apache.org/jira/browse/LUCENE-626>On 11/08/07,
> climbingrose<
> [EMAIL PROTECTED]> wrote:
> >
> > That's exactly what I did with my custom version of the
> > SpellCheckerHandler.
> > However, I didn't handle suggestionCount and only returned the one
> > corrected
> > phrase which contains the "best" corrected terms. There is an issue on
> > Lucene issue tracker regarding multi-word spellchecker:
> >
> https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> >
>
>
> I'd be interested to take a look at your modifications to the
> SpellCheckerHandler, how did you handle phrase queries? maybe we can open
> a
> JIRA issue to expand the spell checking functionality to perform analysis
> on
> multi-word input values.
>
> I did find http://issues.apache.org/jira/browse/LUCENE-626 after looking
> at
> LUCENE-550, but since these patches are not yet included in the Lucene
> trunk
> yet it might be a little difficult to justify implementing them in Solr.
>



-- 
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-08-17 Thread climbingrose
Thanks Karl. I'll check it out!

On 8/18/07, karl wettin <[EMAIL PROTECTED]> wrote:
>
> I updated LUCENE-626 last night. It should now run smooth without
> LUCENE-550, but smoother with.
>
> Perhaps it is something you can use.
>
>
> 12 aug 2007 kl. 14.24 skrev climbingrose:
>
> > I'm happy to contribute code for the SpellCheckerRequestHandler.
> > I'll post
> > the code once I strip off stuff related to our product.
> >
> > On 8/12/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
> >>
> >> <http://issues.apache.org/jira/browse/LUCENE-626>On 11/08/07,
> >> climbingrose<
> >> [EMAIL PROTECTED]> wrote:
> >>>
> >>> That's exactly what I did with my custom version of the
> >>> SpellCheckerHandler.
> >>> However, I didn't handle suggestionCount and only returned the one
> >>> corrected
> >>> phrase which contains the "best" corrected terms. There is an
> >>> issue on
> >>> Lucene issue tracker regarding multi-word spellchecker:
> >>>
> >> https://issues.apache.org/jira/browse/LUCENE-550?
> >> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> >>>
> >>
> >>
> >> I'd be interested to take a look at your modifications to the
> >> SpellCheckerHandler, how did you handle phrase queries? maybe we
> >> can open
> >> a
> >> JIRA issue to expand the spell checking functionality to perform
> >> analysis
> >> on
> >> multi-word input values.
> >>
> >> I did find http://issues.apache.org/jira/browse/LUCENE-626 after
> >> looking
> >> at
> >> LUCENE-550, but since these patches are not yet included in the
> >> Lucene
> >> trunk
> >> yet it might be a little difficult to justify implementing them in
> >> Solr.
> >>
> >
> >
> >
> > --
> > Regards,
> >
> > Cuong Hoang
>
>


-- 
Regards,

Cuong Hoang


Re: Embedded about 50% faster for indexing

2007-08-27 Thread climbingrose
Haven't tried the embedded server but I think I have to agree with Mike.
We're currently sending 2000 job batches to SOLR server and the amount of
time required to transfer documents over http is insignificant compared with
the time required to index them. So I do think unless you are sending
document one by one, embedded SOLR shouldn't give you much more performance
boost.

On 8/25/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
>
> On 24-Aug-07, at 2:29 PM, Wu, Daniel wrote:
>
> >> -Original Message-
> >> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf
> >> Of Yonik Seeley
> >> Sent: Friday, August 24, 2007 2:07 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Embedded about 50% faster for indexing
> >>
> >> One thing I'd like to avoid is everyone trying to embed just
> >> for performance gains. If there is really that much
> >> difference, then we need a better way for people to get that
> >> without resorting to Java code.
> >>
> >> -Yonik
> >>
> >
> > Theoretically and practically, embedded solution will be faster than
> > going through http/xml.
>
> This is only true if the http interface adds significant overhead to
> the cost of indexing a document, and I don't see why this should be
> so, as indexing is relatively heavyweight.  setting up the connection
> could be expensive, but this can be greatly mitigated by sending more
> than one doc per http request, using persistent connections, and
> threading.
>
> -Mike
>



-- 
Regards,

Cuong Hoang


Re: Embedded about 50% faster for indexing

2007-08-27 Thread climbingrose
Agree. I was actually thinking of developing the embedded version early this
year for one of my projects. I'm sure it will be needed in cases where
running another web server is an overkill.

On 8/28/07, Jonathan Woods <[EMAIL PROTECTED]> wrote:
>
> I don't think you should apologise for highlighting embedded usage.  For
> circumstances in which you're at liberty to run a Solr instance in the
> same
> JVM as an app which uses it, I find it very strange that you should have
> to
> use anything _other_ than embedded, and jump through all the unnecessary
> hoops (XML conversion, HTTP transport) that this implies.  It's a bit like
> suggesting you should throw away Java method invocations altogether, and
> write everything in XML-RPC.
>
> Bit of a pet issue of mine!  I'll be creating a JIRA issue on the subject
> soon.
>
> Jon
>
> > -Original Message-
> > From: Sundling, Paul [mailto:[EMAIL PROTECTED]
> > Sent: 28 August 2007 03:24
> > To: solr-user@lucene.apache.org
> > Subject: RE: Embedded about 50% faster for indexing
> >
> > At this point I think I'm going recommend against embedded,
> > regardless of any performance advantage.  The level of
> > documentation is just too low, while the XML API is clearly
> > documented.  It's clear that XML is preferred.
> >
> > The embedded example on the wiki is pretty good, but until
> > mutliple core support comes out in the next version, you have
> > to use multiple SolrCore.  If they are accessed in the same
> > webapp, then you can't just set JNDI (since you can only have
> > one value).  So you have to use a Config object as alluded to
> > in the example.  However, you look at the code and there is
> > no javadoc for the constructor.  The constructor args are
> > (String name, InputStream is, String prefix).  I think name
> > is a unique name for the solr core, but that is a guess.
> > Inputstream may be a stream to the solr home, but it could be
> > anything.  Prefix may be a URI prefix.  These are all guesses
> > without trying to read through the code.
> >
> > When I look at SolrCore, it looks like it's a singleton, so
> > maybe I can't even access more than one SolrCore using
> > embedded anyway.  :(  So I apologize for highlighting Embedded.
> >
> > Anyway it's clear how to do multiple solr cores using XML.
> > You just have different post URI for the difference cores.
> > You can easily inject that with Spring and externalize the
> > config.  Simple and easy.  So I concede XML is the way to go. :)
> >
> > Paul Sundling
> >
> > -Original Message-
> > From: Mike Klaas [mailto:[EMAIL PROTECTED]
> > Sent: Monday, August 27, 2007 5:50 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Embedded about 50% faster for indexing
> >
> >
> > On 27-Aug-07, at 12:44 PM, Sundling, Paul wrote:
> >
> > > Whether embedded solr should give me a performance boost or not, it
> > > did.
> > > :)  I'm not surprised, since it skips XML parsing.
> > Although you never
> > > know where cycles are used for sure until you profile.
> >
> > It certainly is possible that XML parsing dwarfs indexing, but I'd
> > expect that only to occur under very light analysis and field
> > storage
> > workloads.
> >
> > > I tried doing more records per post (200) and it was
> > actually slightly
> >
> > > slower and seemed to require more memory.  This makes sense because
> > > you
> > > have to take up more memory for the StringBuilder to store the much
> > > larger XML.  For 10,000 it was much slower.  For that size I would
> > > need
> > > to XML streaming or something to make it work.
> > >
> > > The solr war was on the same machine, so network overhead was only
> > > from
> > > using loopback.
> >
> > The big question is still your connection handling strategy:
> > are you
> > using persistent http connections?  Are you threadedly indexing?
> >
> > cheers,
> > -Mike
> >
> > > Paul Sundling
> > >
> > > -Original Message-
> > > From: climbingrose [mailto:[EMAIL PROTECTED]
> > > Sent: Monday, August 27, 2007 12:22 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Embedded about 50% faster for indexing
> > >
> > >
> > > Haven't tried the embedded server but I think I have to agree with
> > > Mike.
> > > We're currently sending 2000 job ba

Re: Searching Versioned Resources

2007-09-12 Thread climbingrose
I think you can use the CollapseFilter to collapse on "version" field.
However, I think you need to modify the CollapseFilter code to sort by
"version" and get the latest version returned.

On 9/13/07, Adrian Sutton <[EMAIL PROTECTED]> wrote:
>
> Hi all,
> The document's we're indexing are versioned and generally we only
> want search results to return the latest version of a document,
> however there's a couple of scenarios where I'd like to be able to
> include previous versions in the search result.
>
> It feels like a straight-forward case of a filter, but given that
> each document has independent version numbers it's hard to know what
> to filter on. The only solution I can think of at the moment is to
> index each new version twice - once with the version and once with
> version=latest. We'd then tweak the ID field in such a way that there
> is only one version of each document with version=latest. It's then
> simple to use a filter for version=latest when we search.
>
> Is there a better way? Is there a way to achieve this without having
> to index the document twice?
>
> Thanks in advance,
>
> Adrian Sutton
> http://www.symphonious.net
>
>
>
>


-- 
Regards,

Cuong Hoang


Synchronize large number of records with Solr

2007-09-14 Thread climbingrose
Hi all,

I've been struggling to find a good way to synchronize Solr with a large
number of records. We collect our data from a number of sources and each
source produces around 50,000 docs. Each of these document has a "sourceId"
field indicating the source of the document. Now assuming we're indexing all
documents from SourceA (sourceId="SourceA"), majority of these docs are
already in Solr and we don't want to update them. However, there might be
some docs in Solr that are not in the and we do want to delete them from the
index. So in summary:

1) If a doc is already in Solr, do nothing
2) If a doc is in the batch but not in Solr, index it
3) If a doc is in Solr but not in the batch, remove it from Solr.

The trick part is 1) because if not for that requirement, I can just simply
delete all documents with sourceId="SourceA" and reindex all documents from
SourceA. Any suggestions?

Thanks.

-- 
Regards,

Cuong Hoang


Re: Synchronize large number of records with Solr

2007-09-14 Thread climbingrose
Hi Erik,

>>So in your case #1, documents are reindexed with this scheme - so if you
>>truly need to skip a reindexing for some reason (why, though?) you'll
>>need to come up with some other mechanism.  [perhaps update could be
>>enhanced to allow ignoring a duplicate id rather than reindexing?]

It's pretty easy to ignore duplicate id during indexing but it won't solve
my problem. I think the batch number works well in your case because you
reindex existing documents which will get the updated batch number. In my
case, I can't update existing documents and therefore, even if I use this
approach, there is no way to know if an document is to be deleted. I think I
will need to store all ids in the batch in a DocSet and then compare with
the list of all ids after indexing. That way I can at least get rid of all
expired documents. It's just not as elegant as using a batch identifier.


Re: can solr do it?

2007-09-25 Thread climbingrose
I don't think you can with the current Solr because each instance runs in a
separate web app.

On 9/25/07, James liu <[EMAIL PROTECTED]> wrote:
>
> if use multi solr with one index, it will cache individually.
>
> so i think can it share their cache.(they have same config)
>
> --
> regards
> jl
>



-- 
Regards,

Cuong Hoang


Re: Solr replication

2007-10-01 Thread climbingrose
1)On solr.master:
+Edit scripts.conf:
solr_hostname=localhost
solr_port=8983
rsyncd_port=18983
+Enable and start rsync:
rsyncd-enable; rsyncd-start
+Run snapshooter:
snapshooter
After running this, you should be able to see a new folder named snapshot.*
in data/index folder.
You can can solrconfig.xml to trigger snapshooter after a commit or
optimise.

2) On slave:
+Edit scripts.conf:
solr_hostname=solr.master
solr_port=8986
rsyncd_port=18986
data_dir=
webapp_name=solr
master_host=localhost
master_data_dir=$MASTER_SOLR_HOME/data/
master_status_dir=$MASTER_SOLR_HOME/logs/clients/
+Run snappuller:
snappuller -P 18983
+Run snapinstaller:
snapinstaller

You should setup crontab to run snappuller and snapinstaller periodically.



On 10/1/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
> Hi !
>
> I'm really new to Solr !
>
> Could anybody please explain me with a short example how I can setup a
> simple Solr replication with 3 machines (a master node and 2 slaves) ?
>
> This is my conf:
>
> * master (linux 2.6.20) :
> - Hostname "solr.master" with IP "192.168.1.1"
> * 2 slaves (linux 2.6.20) :
> - Hostname "solr.slave1" with IP "192.168.1.2"
> - Hostname "solr.slave2" with IP "192.168.1.3"
>
> N.B: sorry if the question was already asked before, but I could't find
> anything better than the "CollectionDistribution" on the Wiki.
>
> Regards
> Y.
>
>


-- 
Regards,

Cuong Hoang


Re: Re: Re: Solr replication

2007-10-01 Thread climbingrose
sh /bin/commit should trigger a refresh. However, this command should be
executed as part of snapinstaller so you should have to run it manually.

On 10/1/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
> One more question about replication.
>
> Now that the replication is working, how can I see the changes on slave
> nodes ?
>
> The page statistics :
>
> "http://solr.slave1:8983/solr/admin/stats.jsp";
>
> doesn't reflect the correct number of indexed documents and still shows
> numDocs=0.
>
> Is there any command to tell Solr (on slave node) to sync itself with
> disk ?
>
> cheers
> Y.
>
> Message d'origine
> >De: [EMAIL PROTECTED]
> >A: solr-user@lucene.apache.org
> >Sujet: Re: Re: Solr replication
> >Date: Mon,  1 Oct 2007 15:00:46 +0200
> >
> >Works like a charm. Thanks very much.
> >
> >cheers
> >Y.
> >
> >Message d'origine
> >>Date: Mon, 1 Oct 2007 21:55:30 +1000
> >>De: climbingrose
> >>A: solr-user@lucene.apache.org
> >>Sujet: Re: Solr replication
> >>  boundary="=_Part_10345_13696775.1191239730731"
> >>
> >>1)On solr.master:
> >>+Edit scripts.conf:
> >>solr_hostname=localhost
> >>solr_port=8983
> >>rsyncd_port=18983
> >>+Enable and start rsync:
> >>rsyncd-enable; rsyncd-start
> >>+Run snapshooter:
> >>snapshooter
> >>After running this, you should be able to see a new folder named
> snapshot.*
> >>in data/index folder.
> >>You can can solrconfig.xml to trigger snapshooter after a commit or
> >>optimise.
> >>
> >>2) On slave:
> >>+Edit scripts.conf:
> >>solr_hostname=solr.master
> >>solr_port=8986
> >>rsyncd_port=18986
> >>data_dir=
> >>webapp_name=solr
> >>master_host=localhost
> >>master_data_dir=$MASTER_SOLR_HOME/data/
> >>master_status_dir=$MASTER_SOLR_HOME/logs/clients/
> >>+Run snappuller:
> >>snappuller -P 18983
> >>+Run snapinstaller:
> >>snapinstaller
> >>
> >>You should setup crontab to run snappuller and snapinstaller
> periodically.
> >>
> >>
> >>
> >>On 10/1/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> >>>
> >>> Hi !
> >>>
> >>> I'm really new to Solr !
> >>>
> >>> Could anybody please explain me with a short example how I can setup a
> >>> simple Solr replication with 3 machines (a master node and 2 slaves) ?
> >>>
> >>> This is my conf:
> >>>
> >>> * master (linux 2.6.20) :
> >>> - Hostname "solr.master" with IP "192.168.1.1"
> >>> * 2 slaves (linux 2.6.20) :
> >>> - Hostname "solr.slave1" with IP "192.168.1.2"
> >>> - Hostname "solr.slave2" with IP "192.168.1.3"
> >>>
> >>> N.B: sorry if the question was already asked before, but I could't
> find
> >>> anything better than the "CollectionDistribution" on the Wiki.
> >>>
> >>> Regards
> >>> Y.
> >>>
> >>>
> >>
> >>
> >>--
> >>Regards,
> >>
> >>Cuong Hoang
> >>
> >>
> >
>
>


-- 
Regards,

Cuong Hoang


Re: getting number of stored documents via rest api

2007-10-10 Thread climbingrose
I think search for "*:*" is the optimal code to do it. I don't think you can
do anything faster.

On 10/11/07, Stefan Rinner <[EMAIL PROTECTED]> wrote:
>
> Hi
>
> for some tests I need to know how many documents are stored in the
> index - is there a fast & easy way to retrieve this number (instead
> of searching for "*:*" and counting the results)?
> I already took a look at the stats.jsp code - but there the number of
> documents is retrieved via an api call to SolrInfoRegistry and not
> the webservice.
>
> thanks
>
> - stefan
>



-- 
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-10-11 Thread climbingrose
Hi all,

I've been so busy the last few days so I haven't replied to this email. I
modified SpellCheckerHandler a while ago to include support for multiword
query. To be honest, I didn't have time to write unit test for the code.
However, I deployed it in a production environment and it has been working
for me so far. My version, however, has two assumptions:

1) I assumpt that when user enter a misspelled multiword query, we should
only check for words that are actually misspelled. For example, if user
enter "life expectancy calculatar", which has "calculator" misspelled, we
should only spellcheck "calculatar".
2) I only return the best string for a mispelled query.

I guess I can just directly paste the code here so that others can adapt for
their own purposes. If you have any question, just send me an email. I'll
happy to help  you.

StringBuffer buf = null;
if (null != words && !"".equals(words.trim())) {
Analyzer analyzer = req.getSchema
().getField(field).getType().getAnalyzer();

TokenStream source = analyzer.tokenStream(field, new
StringReader(words));
Token t;
boolean hasSuggestion = false;
boolean termExists = false;
while (true) {
try {
t = source.next();
} catch (IOException e) {
t = null;
}
if (t == null)
break;

String termText = t.termText();
String[] suggestions = spellChecker.suggestSimilar(termText,
numSug, req.getSearcher().getReader(), restrictToField, true);
if (suggestions != null && suggestions.length > 0) {
if (!suggestions[0].equals(termText)) {
hasSuggestion = true;
}
if (buf == null) {
buf = new StringBuffer(suggestions[0]);
} else
buf.append(" ").append(suggestions[0]);
} else if (spellChecker.exist(termText)){
termExists = true;
if (buf == null) {
buf = new StringBuffer(termText);
} else
buf.append(" ").append(termText);
} else {
hasSuggestion = false;
termExists= false;
break;
}
}
try {
source.close();
} catch (IOException e) {
// ignore
}
// String[] suggestions = spellChecker.suggestSimilar(words,
numSug,
// nullReader, restrictToField, onlyMorePopular);
if (hasSuggestion || (!hasSuggestion && termExists))
rsp.add("suggestions", buf.toString());
else
rsp.add("suggestions", null);



On 10/11/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
> Hoss,
>
> I had a feeling someone would be quoting Yonik's Law of Patches!  ;-)
>
> For now, this is done.
>
> I created the changes, created JavaDoc comments on the various settings
> and their expected output, created a JUnit test for the
> SpellCheckerRequestHandler
> which tests various components of the handler, and I also created the
> supporting configuration files for the JUnit tests (schema and solrconfig
> files).
>
> I attached the patch to the JIRA issue so now we just have to wait until
> it gets
> added back in to the main code stream.
>
> For anyone who is interested, here is a link to the JIRA:
> https://issues.apache.org/jira/browse/SOLR-375
>
> Could someone please drop me a hint on how to update the wiki or any other
> documentation that could benefit to being updated; I'll like to help out
> as much
> as possible, but first I need to know "how". ;-)
>
> When these changes do get committed back in to the daily build, please
> review the generated JavaDoc for information on how to utilize these new
> features.
> If anyone has any questions, or comments, please do not hesitate to ask.
>
> As a general note of a self-critique on these changes, I am not 100% sure
> of the way I
> implemented the "nested" structure when the "multiWords" parameter is
> used.  My interest
> is that it should work smoothly with some other technology such as
> Prototype using the
> JSon output type.  Unfortunately, I will not be getting a chance to start
> on that coding until
> next week so it is up in the air as to if this structure will be conducive
> or not.  I am planning
> on providing more details in the documentations as far as how to utilize
> these modifications
> in Prototype and AJax when I get a chance (even provide links to a
> production site so you
> can see it in action and view the source if interested).  So stay tuned...
>
>Thanks for everyones time,
>   Scott Tabar
>
>  Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : If you like, I can p

Re: Spell Check Handler

2007-10-11 Thread climbingrose
Just to clarify this line of code:

String[] suggestions = spellChecker.suggestSimilar(termText, numSug,
req.getSearcher().getReader(), restrictToField, true);

I only return suggestions if they are more popular than termText. You
probably need to use code in Scott's patch to make this behaviour
configurable.

On 10/11/07, climbingrose <[EMAIL PROTECTED]> wrote:
>
> Hi all,
>
> I've been so busy the last few days so I haven't replied to this email. I
> modified SpellCheckerHandler a while ago to include support for multiword
> query. To be honest, I didn't have time to write unit test for the code.
> However, I deployed it in a production environment and it has been working
> for me so far. My version, however, has two assumptions:
>
> 1) I assumpt that when user enter a misspelled multiword query, we should
> only check for words that are actually misspelled. For example, if user
> enter "life expectancy calculatar", which has "calculator" misspelled, we
> should only spellcheck "calculatar".
> 2) I only return the best string for a mispelled query.
>
> I guess I can just directly paste the code here so that others can adapt
> for their own purposes. If you have any question, just send me an email.
> I'll happy to help  you.
>
> StringBuffer buf = null;
> if (null != words && !"".equals(words.trim())) {
> Analyzer analyzer = req.getSchema
> ().getField(field).getType().getAnalyzer();
>
> TokenStream source = analyzer.tokenStream(field, new
> StringReader(words));
> Token t;
> boolean hasSuggestion = false;
> boolean termExists = false;
> while (true) {
> try {
> t = source.next();
> } catch (IOException e) {
> t = null;
> }
> if (t == null)
> break;
>
> String termText = t.termText();
> String[] suggestions = spellChecker.suggestSimilar(termText,
> numSug, req.getSearcher().getReader(), restrictToField, true);
> if (suggestions != null && suggestions.length > 0) {
> if (!suggestions[0].equals(termText)) {
> hasSuggestion = true;
> }
> if (buf == null) {
> buf = new StringBuffer(suggestions[0]);
> } else
> buf.append(" ").append(suggestions[0]);
> } else if (spellChecker.exist(termText)){
> termExists = true;
> if (buf == null) {
> buf = new StringBuffer(termText);
> } else
> buf.append(" ").append(termText);
> } else {
> hasSuggestion = false;
> termExists= false;
> break;
> }
> }
> try {
> source.close();
> } catch (IOException e) {
> // ignore
> }
> // String[] suggestions = spellChecker.suggestSimilar(words,
> numSug,
> // nullReader, restrictToField, onlyMorePopular);
> if (hasSuggestion || (!hasSuggestion && termExists))
> rsp.add("suggestions", buf.toString());
> else
> rsp.add("suggestions", null);
>
>
>
> On 10/11/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> >
> > Hoss,
> >
> > I had a feeling someone would be quoting Yonik's Law of Patches!  ;-)
> >
> > For now, this is done.
> >
> > I created the changes, created JavaDoc comments on the various settings
> > and their expected output, created a JUnit test for the
> > SpellCheckerRequestHandler
> > which tests various components of the handler, and I also created the
> > supporting configuration files for the JUnit tests (schema and
> > solrconfig files).
> >
> > I attached the patch to the JIRA issue so now we just have to wait until
> > it gets
> > added back in to the main code stream.
> >
> > For anyone who is interested, here is a link to the JIRA:
> > https://issues.apache.org/jira/browse/SOLR-375
> >
> > Could someone please drop me a hint on how to update the wiki or any
> > other
> > documentation that could benefit to being updated; I'll like to help out
> > as much
> > as possible, but first I need to know "how". ;-)

Re: multiple delete by id in one delete command?

2007-11-18 Thread climbingrose
The easiest solution I know is:
id:1 OR id:2 OR ...
If you know that all of these ids can be found by issuing a query, you
can do delete by query:
YOUR_DELETE_QUERY_HERE

Cheers

On Nov 19, 2007 4:18 PM, Norberto Meijome <[EMAIL PROTECTED]> wrote:
> Hi everyone,
>
> I'm trying to issue, via curl to SOLR (testing at the moment), 3 deletes by 
> id.
> I tried sending :
>
> 123
>
> and solr didn't like it at all.
>
> When I changed it to :
>
> 123
>
> as in :
>
> curl http://localhost:8983/vcs/update -H "Content-Type: text/xml" 
> --data-binary 
> '816bc47fd52ffb9c6059e6975eafa168949d51dfa93dbe3c1eca169edd19b353f3f80e65482a5be353e7110f5308949d51dfa93dbe3c1eca169edd19b3'
>
> only the 1st ( id =1 , or id = 
> 816bc47fd52ffb9c6059e6975eafa168949d51dfa93dbe3c1eca169edd19b3 gets deleted 
> (after a commit, of course).
>
> So i figure I will have to issue a series of independent  
> xxx commandsIs it not possible to bunch them 
> all together as it's possible with . ?
>
>
> thanks!!
> Beto
> _
> {Beto|Norberto|Numard} Meijome
>
> "Imagination is more important than knowledge."
>   Albert Einstein, On Science
>
> I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
> Reading disclaimers makes you go blind. Writing them is worse. You have been 
> Warned.
>



-- 
Regards,

Cuong Hoang


Re: Pagination with Solr

2007-11-19 Thread climbingrose
Hi David,

Do you use one of Solr client available
http://wiki.apache.org/solr/IntegratingSolr? These clients should
probably have done all the XML parsing jobs for you. I speak from
Solrj experience.

IMO, your approach is probably most commonly used when it comes to
pagination. Solr caching mechanisms should speed up the request for
next page.

Cheers,

On Nov 20, 2007 10:27 AM, Dave C. <[EMAIL PROTECTED]> wrote:
> Hello again,
>
> I'm trying to accomplish very basic pagination with my Solr search results.
>
> What I'm trying is to parse the response for "numFound:" and if 
> this number is greater than the "rows" parameter, I send another search 
> request to Solr with a new "start" parameter.
> Is there a better way to do this?  Specifically, is there another way to 
> obtain the "numFound" rather than parsing the response stream/string?
>
> Thanks a lot,
> David
>
> _
> Share life as it happens with the new Windows Live.Download today it's FREE!
> http://www.windowslive.com/share.html?ocid=TXT_TAGLM_Wave2_sharelife_112007



-- 
Regards,

Cuong Hoang


Re: Finding all possible synonyms for a word

2007-11-19 Thread climbingrose
One approach is to extend SynonymFilter so that it reads synonyms from
database instead of a file. SynonymFilter is just a Java class so you
can do whatever you want with it :D. From what I remember, the filter
initialises a list of all input synonyms and store them in memory.
Therefore, you need to make sure that all the synonyms can fit into
memory at runtime.

On Nov 20, 2007 1:54 AM, Kishore AVK. Veleti <[EMAIL PROTECTED]> wrote:
> Hi Eswar,
>
> Thanks for the update.
>
> I have gone through the below link provided by you and what I understood from 
> it is, we need to have all possible synonyms in a text file. This file need 
> to be given as input for "SynonymFilterFactory" to work. If my understanding 
> is right then the approach may not suit my requirement. Reason is I need to 
> find synonyms of all the keywords in category description and store those 
> synonyms in the above said input file. The file may be too big.
>
> Let me know if my understanding is wrong.
>
>
> Thanks,
> Kishore Veleti A.V.K.
>
>
>
>
> -Original Message-
> From: Eswar K [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 19, 2007 11:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Finding all possible synonyms for a word
>
> Kishore,
>
> Solr has a SynonymFilterFactory which might be off use to you (
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46)
>
>
> Regards,
> Eswar
>
> On Nov 18, 2007 10:39 PM, Kishore AVK. Veleti <[EMAIL PROTECTED]>
> wrote:
>
> > Hi All,
> >
> > I am new to Lucene / SOLR and developing a POC as part of research. Check
> > below my requirement and problem statement. Need help on how I can index the
> > data such data I have a very good search functionality in my POC.
> >
> > --
> > Requirement:
> > --
> >
> > Assume my web application is an Online book store and it sell all
> > categories of books like Computers, Social Studies, Physical Sciences etc.
> > Each of these categories has sub-categories. For example Computers has
> > sub-categories like Software Engineering, Java, SQL Server etc
> >
> > I have a database table called Categories and it contains both Parent
> > Category descriptions and also Child Category descriptions.
> >
> > Data structure of Category table is:
> >
> > Category_ID_Primay_Key  integer
> > Parent_Category_ID  integer
> > Category_Name varchar(100)
> > Category_Description varchar(1000)
> >
> >
> > --
> > My Search UI:
> > --
> >
> > My search page is very simple. We have a text field with "Search" button.
> >
> > --
> > User Action:
> > --
> >
> > User enter below search text in above text field and clicks on "Search"
> > button.
> >
> > "Books on Data Center"
> >
> > --
> > What is my expected behavior:
> > --
> >
> > Since the word "Data Center" more relevant computers I should show books
> > related to computers.
> >
> > --
> > My Problem statement and Question to you all:
> > --
> >
> > To have a better search in my web applications what kind of strategy
> > should I have and index the data accordingly in SOLR/Lucene.
> >
> > In my Lucene Index I may or may not have the word "data center". Still I
> > should be able to return "data center"
> >
> > One thought I have is as follows:
> >
> > Modify the Category table by adding one more column to it:
> >
> > Category_ID_Primay_Key  integer
> > Parent_Category_ID  integer
> > Category_Name varchar(100)
> > Category_Description varchar(1000)
> > Category_Description_Keywords varchar(8000)
> >
> > Now take each word in "Category_description", find synonyms of it and
> > store that data in Category_Description_Keywords column. After doing it,
> > index the Category table records in SOLR/Lucene.
> >
> > Below are my questions to you all:
> >
> > Question 1:
> > Need your feedbacks on above approach or any other approach which help me
> > to make my search better that returns most relevant results to the user.
> >
> > Question 2:
> > Can you suggest me Java based best Open Source or commercial synonym
> > engines. I want such a best synonym engine that gives me all possible
> > synonyms of a word.
> >
> >
> >
> > Thanks in Advance,
> > Kishore Veleti A.V.K.
> >
>



-- 
Regards,

Cuong Hoang


Re: Finding all possible synonyms for a word

2007-11-19 Thread climbingrose
Correction for last message: you need to modify or extend
SynonymFilterFactory instead of SynonymFilter. SynonmFilterFactory is
responsible for initialising SynonymFilter and populating the list of
synonyms. Have a look at the source code. I think it's pretty easy to
understand. What you probably need to do is to add more parameters
such as database host, username, password and the actual database in
init() method.

On Nov 20, 2007 3:18 PM, climbingrose <[EMAIL PROTECTED]> wrote:
> One approach is to extend SynonymFilter so that it reads synonyms from
> database instead of a file. SynonymFilter is just a Java class so you
> can do whatever you want with it :D. From what I remember, the filter
> initialises a list of all input synonyms and store them in memory.
> Therefore, you need to make sure that all the synonyms can fit into
> memory at runtime.
>
>
> On Nov 20, 2007 1:54 AM, Kishore AVK. Veleti <[EMAIL PROTECTED]> wrote:
> > Hi Eswar,
> >
> > Thanks for the update.
> >
> > I have gone through the below link provided by you and what I understood 
> > from it is, we need to have all possible synonyms in a text file. This file 
> > need to be given as input for "SynonymFilterFactory" to work. If my 
> > understanding is right then the approach may not suit my requirement. 
> > Reason is I need to find synonyms of all the keywords in category 
> > description and store those synonyms in the above said input file. The file 
> > may be too big.
> >
> > Let me know if my understanding is wrong.
> >
> >
> > Thanks,
> > Kishore Veleti A.V.K.
> >
> >
> >
> >
> > -Original Message-
> > From: Eswar K [mailto:[EMAIL PROTECTED]
> > Sent: Monday, November 19, 2007 11:22 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Finding all possible synonyms for a word
> >
> > Kishore,
> >
> > Solr has a SynonymFilterFactory which might be off use to you (
> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46)
> >
> >
> > Regards,
> > Eswar
> >
> > On Nov 18, 2007 10:39 PM, Kishore AVK. Veleti <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi All,
> > >
> > > I am new to Lucene / SOLR and developing a POC as part of research. Check
> > > below my requirement and problem statement. Need help on how I can index 
> > > the
> > > data such data I have a very good search functionality in my POC.
> > >
> > > --
> > > Requirement:
> > > --
> > >
> > > Assume my web application is an Online book store and it sell all
> > > categories of books like Computers, Social Studies, Physical Sciences etc.
> > > Each of these categories has sub-categories. For example Computers has
> > > sub-categories like Software Engineering, Java, SQL Server etc
> > >
> > > I have a database table called Categories and it contains both Parent
> > > Category descriptions and also Child Category descriptions.
> > >
> > > Data structure of Category table is:
> > >
> > > Category_ID_Primay_Key  integer
> > > Parent_Category_ID  integer
> > > Category_Name varchar(100)
> > > Category_Description varchar(1000)
> > >
> > >
> > > --
> > > My Search UI:
> > > --
> > >
> > > My search page is very simple. We have a text field with "Search" button.
> > >
> > > --
> > > User Action:
> > > --
> > >
> > > User enter below search text in above text field and clicks on "Search"
> > > button.
> > >
> > > "Books on Data Center"
> > >
> > > --
> > > What is my expected behavior:
> > > --
> > >
> > > Since the word "Data Center" more relevant computers I should show books
> > > related to computers.
> > >
> > > --
> > > My Problem statement and Ques

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
The duplication detection mechanism in Nutch is quite primitive. I
think it uses a MD5 signature generated from the content of a field.
The generation algorithm is described here:
http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.

The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on near duplication
detection techniques and then implement one of them. I know Google has
some papers that describe a technique called minhash. I read the paper
and found it's very interesting. I'm not sure if you can implement the
algorithm because they have patented it. That said, there are plenty
literature on near dup detection so you should be able to get one for
free!

On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> Otis,
>
> Thanks for your response.
>
> I just gave a quick look to the Nutch Forum and find that there is an
> implementation to obtain de-duplicate documents/pages but none for Near
> Duplicates documents. Can you guide me a little further as to where exactly
> under Nutch I should be concentrating, regarding near duplicate documents?
>
> Regards,
> Rishabh
>
> On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
>
>
> > To whomever started this thread: look at Nutch.  I believe something
> > related to this already exists in Nutch for near-duplicate detection.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > - Original Message 
> > From: Mike Klaas <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Sunday, November 18, 2007 11:08:38 PM
> > Subject: Re: Near Duplicate Documents
> >
> > On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> >
> > > Is there any idea implementing that feature in the up coming
> >  releases?
> >
> > Not currently.  Feel free to contribute something if you find a good
> > solution .
> >
> > -Mike
> >
> >
> > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> > >
> > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > >>> We have a scenario, where we want to find out documents which are
> > >> similar in
> > >>> content. To elaborate a little more on what we mean here, lets
> > >>> take an
> > >>> example.
> > >>>
> > >>> The example of this email chain in which we are interacting on,
> > >>> can be
> > >> best
> > >>> used for illustrating the concept of near dupes (We are not getting
> > >> confused
> > >>> with threads, they are two different things.). Each email in this
> > >>> thread
> > >> is
> > >>> treated as a document by the system. A reply to the original mail
> > >>> also
> > >>> includes the original mail in which case it becomes a near
> > >>> duplicate of
> > >> the
> > >>> orginal mail (depending on the percentage of similarity).
> > >>> Similarly it
> > >> goes
> > >>> on. The near dupes need not be limited to emails.
> > >>
> > >> I think this is what's known as "shingling."  See
> > >> http://en.wikipedia.org/wiki/W-shingling
> > >> Lucene (and therefore Solr) does not implement shingling.  The
> > >> "MoreLikeThis" query might be close enough, however.
> > >>
> > >> -Stuart
> > >>
> >
> >
> >
> >
> >
>



-- 
Regards,

Cuong Hoang


Re: Help with Debian solr/jetty install?

2007-11-21 Thread climbingrose
Make sure you have JDK installed not just JRE. Also try to set
JAVA_HOME directory.

apt-get install sun-java5-jdk




On Nov 21, 2007 5:50 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> Phillip,
>
> I won't go into details, but I'll point out that the Java compiler is called 
> javac and if memory serves me well, it is defined in one of Jetty's XML 
> config files in its etc/ dir.  The java compiler is used to compile JSPs that 
> Solr uses for the admin UI.  So, make sure you have javac and make sure Jetty 
> can find it.
>
> Otis
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message 
> From: Phillip Farber <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, November 20, 2007 5:55:27 PM
> Subject: Help with Debian solr/jetty install?
>
>
> Hi,
>
> I've successfully run as far as the example admin page on Debian linux
>  2.6.
>
> So I installed the solr-jetty packaged for Debian testing which gives
>  me
> Jetty 5.1.14-1 and Solr 1.2.0+ds1-1.  Jetty starts fine and so does the
>
> Solr home page at http://localhost:8280/solr
>
> But I get an error when I try to run http://localhost:8280/solr/admin
>
> HTTP ERROR: 500
> No Java compiler available
>
> I have sun-java6-jre and sun-java6-jdk packages installed.  I'm new to
> servlet containers and java webapps.  What should I be looking for to
> fix this or what information could I provide the list to get me moving
> forward from here?
>
> I've included the trace from the Jetty log, and the java properties
>  dump
> from the example below.
>
> Thanks,
> Phil
>
> ---
>
> Java properties (from the example):
> --
>
> sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386
> java.vm.version = 1.6.0-b105
> java.vm.name = Java HotSpot(TM) Client VM
> user.dir = /tmp/apache-solr-1.2.0/example
> java.runtime.version = 1.6.0-b105
> os.arch = i386
> java.io.tmpdir = /tmp
>
> java.library.path =
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
> java.class.version = 50.0
> jetty.home = /tmp/apache-solr-1.2.0/example
> sun.management.compiler = HotSpot Client Compiler
> os.version = 2.6.22-2-686
> java.class.path =
> /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar
> java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre
> java.version = 1.6.0
> java.ext.dirs =
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext
> sun.boot.class.path =
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes
>
>
>
>
> Jetty log (from the error under Debian Solr/Jetty):
> 
>
> org.apache.jasper.JasperException: No Java compiler available
> at
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460)
> at
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367)
> at
>  org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
> at
>  org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
>  org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
> at
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473)
> at
>  org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286)
> at
>  org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171)
> at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302)
> at org.mortbay.jetty.servlet.Default.service(Default.java:223)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
>  org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
> at
> org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:830)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185)
> at
> org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:821)
> at
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.j

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
Hi Ken,

It's correct that uncommon words are most likely not showing up in the
signature. However, I was trying to say that if two documents has 99%
common tokens and differ in one token with frequency > quantised
frequency, the two resulted hashes are completely different. If you
want true near dup detection, what you would like to have is two
hashes that differ only in 1-2 bytes. That way, the signatures will
truely reflect the content of the document they present. However, with
this approach, you need a bit more work to cluster near dup documents.
Basically, once you have the hash function as I describe above,
finding similar documents comes down to Hamming distance problem: two
docs are near dup if ther hashes different in k positions (with k
small, might be < 3).


On Nov 22, 2007 2:35 AM, Ken Krugler <[EMAIL PROTECTED]> wrote:
> >The duplication detection mechanism in Nutch is quite primitive. I
> >think it uses a MD5 signature generated from the content of a field.
> >The generation algorithm is described here:
> >http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.
> >
> >The problem with this approach is MD5 hash is very sensitive: one
> >letter difference will generate completely different hash.
>
> I'm confused by your answer, assuming it's based on the page
> referenced by the URL you provided.
>
> The approach by TextProfileSignature would only generate a different
> MD5 hash with a single letter change if that change resulted in a
> change in the quantized frequency for that word. And if it's an
> uncommon word, then it wouldn't even show up in the signature.
>
> -- Ken
>
>
> >You
> >probably have to roll your own near duplication detection algorithm.
> >My advice is have a look at existing literature on near duplication
> >detection techniques and then implement one of them. I know Google has
> >some papers that describe a technique called minhash. I read the paper
> >and found it's very interesting. I'm not sure if you can implement the
> >algorithm because they have patented it. That said, there are plenty
> >literature on near dup detection so you should be able to get one for
> >free!
> >
> >On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> >>  Otis,
> >>
> >>  Thanks for your response.
> >>
> >  > I just gave a quick look to the Nutch Forum and find that there is an
> >>  implementation to obtain de-duplicate documents/pages but none for Near
> >>  Duplicates documents. Can you guide me a little further as to where 
> >> exactly
> >  > under Nutch I should be concentrating, regarding near duplicate 
> > documents?
> >  >
> >  > Regards,
> >>  Rishabh
> >>
> >>  On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> >>  wrote:
> >>
> >>
> >>  > To whomever started this thread: look at Nutch.  I believe something
> >>  > related to this already exists in Nutch for near-duplicate detection.
> >>  >
> >>  > Otis
> >>  > --
> >>  > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >>  >
> >>  > - Original Message 
> >>  > From: Mike Klaas <[EMAIL PROTECTED]>
> >>  > To: solr-user@lucene.apache.org
> >>  > Sent: Sunday, November 18, 2007 11:08:38 PM
> >>  > Subject: Re: Near Duplicate Documents
> >>  >
> >>  > On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> >>  >
> >>  > > Is there any idea implementing that feature in the up coming
> >>  >  releases?
> >>  >
> >  > > Not currently.  Feel free to contribute something if you find a good
> >>  > solution .
> >  > >
> >>  > -Mike
> >>  >
> >>  >
> >>  > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> >>  > >
> >>  > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> >>  > >>> We have a scenario, where we want to find out documents which are
> >>  > >> similar in
> >>  > >>> content. To elaborate a little more on what we mean here, lets
> >>  > >>> take an
> >>  > >>> example.
> >>  > >>>
> >>  > >>> The example of this email chain in which we are interacting on,
> >>  > >>> can be
> >>  > >> best
> >>  > >>> used for illustrating the concept of near dupes (We are not getting
> >>  > >> confused
> >>  > >>> with threads, they are two different things.). Each email in this
> >>  > >>> thread
> >>  > >> is
> >>  > >>> treated as a document by the system. A reply to the original mail
> >>  > >>> also
> >>  > >>> includes the original mail in which case it becomes a near
> >>  > >>> duplicate of
> >>  > >> the
> >>  > >>> orginal mail (depending on the percentage of similarity).
> >>  > >>> Similarly it
> >>  > >> goes
> >>  > >>> on. The near dupes need not be limited to emails.
> >>  > >>
> >>  > >> I think this is what's known as "shingling."  See
> >>  > >> http://en.wikipedia.org/wiki/W-shingling
> >>  > >> Lucene (and therefore Solr) does not implement shingling.  The
> >>  > >> "MoreLikeThis" query might be close enough, however.
> >>  > >>
> >  > > >> -Stuart
>
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you c

Re: Get last updated/committed document

2007-11-23 Thread climbingrose
Assuming that you have the timestamp field defined:
q=*:*&sort=timestamp desc

On Nov 23, 2007 10:43 PM, Thorsten Scherler
<[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I need to ask solr to return me the id of the last committed document.
>
> Is there a way to archive this via a standard lucene query or do I need
> a custom connector that gives me this information?
>
> TIA for any information
>
> salu2
> --
> Thorsten Scherler thorsten.at.apache.org
> Open Source Java  consulting, training and solutions
>
>



-- 
Regards,

Cuong Hoang


Access to SolrIndexSearcher in UpdateProcessor

2007-12-02 Thread climbingrose
Hi all,

I'm trying to implement a custom UpdateProcessor which requires access to
SolrIndexSearcher. However, I'm constantly running into "Too many open
files" exception. I'm confused about which is the correct way to get access
to SolrIndexSearcher in UpdateProcessor:

1) req.getSearcher()
2) req.getCore().getSearcher()
3) req.getCore().newSearcher("MyCustomerProcessorFactory");

I have tried 1) & 3) but both produce "Too many open files". The weird thing
with 3) is the SolrIndexSearcher created gets set to null automatically by
Solr so I didn't have a chance to call searcher.close() method. I suspect
all searchers open this way will be set to null when a commit is made. Any
recommendation?

-- 
Regards,

Cuong Hoang


Re: SOLR sorting - question

2007-12-04 Thread climbingrose
I don't think you have to. Just try the query on the REST interface and you
will know.

On Dec 5, 2007 9:56 AM, Kasi Sankaralingam <[EMAIL PROTECTED]> wrote:

> Do I need to select the fields in the query that I am trying to sort on?,
> for example if I want sort on update date then do I need to select that
> field?
>
> Thanks,
>



-- 
Regards,

Cuong Hoang


Re: solr + maven?

2007-12-05 Thread climbingrose
Hi Ryan,

I'm using solr with Maven 2 in our project. Here is how my pom.xml looks
like:



org.apache.solr
solr-solrj
1.3.0


Since I have all solrj dependencies declared by other artifacts, I don't
need to declare any of solrj dependencies. You'll probably have to add
commons-httpclient artifacts.

On Dec 5, 2007 10:08 AM, Ryan McKinley <[EMAIL PROTECTED]> wrote:

> Is anyone managing solr projects with maven?  I see:
> https://issues.apache.org/jira/browse/SOLR-19
> but that is >1 year old
>
> If someone has a current pom.xml, can you post it on SOLR-19?
>
> I just started messing with maven, so I don't really know what I am
> doing yet.
>
> thanks
> ryan
>



-- 
Regards,

Cuong Hoang


Re: Replication hooks

2007-12-10 Thread climbingrose
I think there is a event listener interface for hooking into Solr events
such as post commit, post optimise and open new searcher. I can't remember
on top of my head but if you do a search for *EventListener in Eclipse,
you'll find it.
The Wiki shows how to trigger snapshooter after each commit and optimise.
You should be able to follow this example to create your own listener.

On Dec 11, 2007 1:03 PM, Tracy Flynn <[EMAIL PROTECTED]>
wrote:

> Hi,
>
> I'm interested in setting up simple replication. I've reviewed all the
> Wiki information, looked at the scripts etc. and understand most of
> what I see.
>
> There are some references to  'hooks in the code'  for both the master
> and slave nodes for handling replication. I've searched the 1.2 and
> trunk code bases for obvious phrases, but I can't identify these hooks.
>
> Can someone please point me to the correct place(s) to look?
>
> Thanks,
>
> Tracy Flynn
>



-- 
Regards,

Cuong Hoang


Re: Issues with postOptimize

2007-12-17 Thread climbingrose
Make sure that the user running Solr has permission to execute snapshooter.
Also, try ./snapshooter instead of snapshooter.

Good luck.

On Dec 18, 2007 10:57 AM, Sunny Bassan <[EMAIL PROTECTED]> wrote:

> I've set up solrconfig.xml to create a snap shot of an index after doing
> a optimize, but the snap shot cannot be created because of permission
> issues. I've set permissions to the bin, data and log directories to
> read/write/execute for all users. Even with these settings I cannot seem
> to be able to run snapshooter on the postOptimize event. Any ideas?
> Could it be a java permissions issue? Thanks.
>
> Sunny
>
> Config settings:
>
> 
>  snapshooter
>  /search/replication_test/0/index/solr/bin
>  true
> 
>
> Error:
>
> Dec 17, 2007 7:45:19 AM org.apache.solr.core.RunExecutableListener exec
> FINE: About to exec snapshooter
> Dec 17, 2007 7:45:19 AM org.apache.solr.core.SolrException log
> SEVERE: java.io.IOException: Cannot run program "snapshooter" (in
> directory "/search/replication_test/0/index/solr/bin"):
> java.io.IOException: error=13, Permission denied
>  at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>  at java.lang.Runtime.exec(Runtime.java:593)
>  at
> org.apache.solr.core.RunExecutableListener.exec(RunExecutableListener.ja
> va:70)
>  at
> org.apache.solr.core.RunExecutableListener.postCommit(RunExecutableListe
> ner.java:97)
>  at
> org.apache.solr.update.UpdateHandler.callPostOptimizeCallbacks(UpdateHan
> dler.java:105)
>  at
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.
> java:516)
>  at
> org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestH
> andler.java:214)
>  at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpd
> ateRequestHandler.java:84)
>  at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
> ase.java:77)
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
> va:191)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
> ava:159)
>  at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
> tionFilterChain.java:235)
>  at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
> erChain.java:206)
>  at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
> e.java:233)
>  at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
> e.java:175)
>  at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
> :128)
>  at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
> :102)
>  at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
> java:109)
>  at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2
> 63)
>  at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84
> 4)
>  at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
> Http11Protocol.java:584)
>  at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>  at java.lang.Thread.run(Thread.java:619)
> Caused by: java.io.IOException: java.io.IOException: error=13,
> Permission denied
>  at java.lang.UNIXProcess.(UNIXProcess.java:148)
>  at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>  at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>  ... 23 more
>
>
>
>


-- 
Regards,

Cuong Hoang


Merry Christmas and happy new year

2007-12-24 Thread climbingrose
Good day all Solr users & developers,

May I wish you and your family a merry Xmas and happy new year. Hope that
new year brings you all health, wealth and peace. It's been my pleasure to
be on this mailing list and working with Solr. Thank you all!

-- 
Cheers,

Cuong Hoang


Filter Query and query score

2008-01-01 Thread climbingrose
Hi all,

Here is my situation:

I'm implementing some geographical search functions that allows user to
search for documents close to a location. Because not all documents have
proper location information that can be converted to (latitude, longitude)
coordinate, I also have to use normal full text search to filter out
documents. For example, if users search for "Java" in "Parramatta, NSW" and
the following documents are in the index:

Doc1: [Title=Java; Location=Parramatta, NSW; latitude=x1; longitude=x2]
Doc2: [Title=Java; Location=North Ryde, NSW; latitude=x3; longitude=x4]
Doc3: [Title=Java; Location=Parramatta]

My filter query looks like this:

fq= (+latitude[lat1 TO lat2] +longitude[lng1 TO lng2]) location:Parramatta
location:NSW

Now assuming that my query matches those three documents, I need to make
sure that documents with "Parramatta" in their locations have some kind of
boost to the final score. For example, I'd like to have Doc3 listed before
Doc2 because it has "Parramatta" in location field.

Is this possible? Thanks in advance!

-- 
Regards,

Cuong Hoang


Re: solr 1.3

2008-01-20 Thread climbingrose
I don't think they (Solr developers) have a time frame for 1.3 release.
However, I've been using the latest code from the trunk and I can tell you
it's quite stable. The only problem is the documentation sometimes doesn't
cover lastest changes in the code. You'll probably have to dig into the code
itself or post a question here and many people will be happy to help you.

On Jan 21, 2008 12:07 PM, anuvenk <[EMAIL PROTECTED]> wrote:

>
> when will this be released? where can i find the list of
> improvements/enhancements in 1.3 if its been documented already?
> --
> View this message in context:
> http://www.nabble.com/solr-1.3-tp14989395p14989395.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Regards,

Cuong Hoang


Re: solr 1.3

2008-01-20 Thread climbingrose
I'm using code pulled directly from Subversion.

On Jan 21, 2008 12:34 PM, anuvenk <[EMAIL PROTECTED]> wrote:

>
> Thanks. Would this be the latest code from the trunk that you mentioned?
> http://people.apache.org/builds/lucene/solr/nightly/solr-2008-01-19.zip
>
>
> climbingrose wrote:
> >
> > I don't think they (Solr developers) have a time frame for 1.3 release.
> > However, I've been using the latest code from the trunk and I can tell
> you
> > it's quite stable. The only problem is the documentation sometimes
> doesn't
> > cover lastest changes in the code. You'll probably have to dig into the
> > code
> > itself or post a question here and many people will be happy to help
> you.
> >
> > On Jan 21, 2008 12:07 PM, anuvenk <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> when will this be released? where can i find the list of
> >> improvements/enhancements in 1.3 if its been documented already?
> >> --
> >> View this message in context:
> >> http://www.nabble.com/solr-1.3-tp14989395p14989395.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Cuong Hoang
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/solr-1.3-tp14989395p14989689.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Regards,

Cuong Hoang


Accented search

2008-03-10 Thread climbingrose
Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to
hear some ideas about how to use Solr with those languages. Basically, I
want to achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:Lập Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters "Lập Trình Viên", then Doc B is also matched and "Lập
Trình Viên" is highlighted.
  On the other hand, if the query is "Lap Trinh Vien", Doc A is also
matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters "Lập Trình Viên", then Doc A should be given higher
score than DOC B.
  if the query is "Lap Trinh Vien", Doc A should be given higher score.

Any ideas guys? Thanks in advance!

-- 
Regards,

Cuong Hoang


Re: Accented search

2008-03-11 Thread climbingrose
Hi Peter,

It looks like a very promising approach for us. I'm going to implement an
custom Tokeniser based on your suggestions and see how it goes. Thank you
all for your comments!

Cheers

On Wed, Mar 12, 2008 at 2:37 AM, Binkley, Peter <[EMAIL PROTECTED]>
wrote:

> We've done this in a pre-Solr Lucene context by using the position
> increment: when a token contains accented characters, you add a stripped
> version of that token with a zero increment, so that for matching purposes
> the original and the stripped version are at the same position. Accents are
> not stripped from queries. The effect is that an accented search matches
> your Doc A, and an unaccented search matches Docs A and B. We do that after
> lower-casing the token.
>
> There are some limitations: users might start to expect that they can
> freely add accents to restrict their search to accented hits, but if they
> don't match the accents exactly they won't get any hits: e.g. if a word
> contains two accented characters and the user only accents one of them in
> their query, they won't match the accented or the unaccented version.
>
> Peter
>
> Peter Binkley
> Digital Initiatives Technology Librarian
> Information Technology Services
> 4-30 Cameron Library
> University of Alberta Libraries
> Edmonton, Alberta
> Canada T6G 2J8
> Phone: (780) 492-3743
> Fax: (780) 492-9243
> e-mail: [EMAIL PROTECTED]
>
> ~ The code is willing, but the data is weak. ~
>
>
> -Original Message-
> From: climbingrose [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 10, 2008 10:01 PM
> To: solr-user@lucene.apache.org
> Subject: Accented search
>
> Hi guys,
>
> I'm running to some problems with accented (UTF-8) language. I'd love to
> hear some ideas about how to use Solr with those languages. Basically, I
> want to achieve what Google did with UTF-8 language.
>
> My requirements including:
> 1) Accent insensitive search and proper highlighting:
>  For example, we have 2 documents:
>
>  Doc A (title:Lập Trình Viên)
>  Doc B (title:Lap Trinh Vien)
>
>  if the user enters "Lập Trình Viên", then Doc B is also matched and "Lập
> Trình Viên" is highlighted.
>  On the other hand, if the query is "Lap Trinh Vien", Doc A is also
> matched.
> 2) Assign proper scores to accented or non-accented searches:
>  if the user enters "Lập Trình Viên", then Doc A should be given higher
> score than DOC B.
>  if the query is "Lap Trinh Vien", Doc A should be given higher score.
>
> Any ideas guys? Thanks in advance!
>
> --
> Regards,
>
> Cuong Hoang
>



-- 
Regards,

Cuong Hoang


Minimum should match and PhraseQuery

2008-03-19 Thread climbingrose
Hi all,

I thought many people would encounter the situation I'm having here.
Basically, we'd like to have a PhraseQuery with "minimum should match"
property similar to BooleanQuery. Consider the query "Senior Java
Developer":

1) I'd like to do a PhraseQuery on "Senior Java Developer" with a slop of
say 2, so that the query only matches documents with these words located in
proximity. I don't want to match documents like "Senior 
Java  Developer".
2) I also want to relax PhraseQuery a bit so that it not only match "Senior
Java Developer"~2 but also matches "Java Developer"~2 but of course with a
lower score. I can programmatically generate on the combination but it's not
gonna be efficient if user issues query with many terms.

Is it possible to do this with Solr and Lucene?

-- 
Cheers,

Cuong Hoang


Re: Minimum should match and PhraseQuery

2008-03-23 Thread climbingrose
Thanks Christ. I probably have to repost this in Lucene mailing list.


On Sun, Mar 23, 2008 at 9:49 AM, Chris Hostetter <[EMAIL PROTECTED]>
wrote:

>
> the topic has come up before on the lucene java lists (allthough i can't
> think of any good search terms to find the old threads .. I can't really
> remember how people have discribed this idea in the past)
>
> I don't remember anyone ever suggesting/sharing a general purpose
> solution intrinsicly more efficient then if you just generated all the
> permutations yourself
>
> : 2) I also want to relax PhraseQuery a bit so that it not only match
> "Senior
> : Java Developer"~2 but also matches "Java Developer"~2 but of course with
> a
> : lower score. I can programmatically generate on the combination but it's
> not
> : gonna be efficient if user issues query with many terms.
>
>
>
> -Hoss
>
>


-- 
Regards,

Cuong Hoang


Re: Simple Solr POST using java

2008-05-10 Thread climbingrose
Agree. I've been using Solrj on product site for 9 months without any
problem at all. You should probably give it a try instead of dealing with
all those low level details.


On Sun, May 11, 2008 at 4:14 AM, Chris Hostetter <[EMAIL PROTECTED]>
wrote:

>
> : please post a snippet of Java code to add a document to the Solr index
> that
> : includes the URL reference as a String?
>
> you mean like this one...   :)
>
>
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/util/SimplePostTool.java?view=markup
>
> FWIW: if you want to talk to Solr from a Java app, the SolrJ client API
> is probably worth looking into rather then dealing with the HTTP
> connections and XML formating directly...
>
> http://wiki.apache.org/solr/Solrj
>
>
> -Hoss
>
>


-- 
Regards,

Cuong Hoang


Re: query for number of field entries in a multivalued field?

2008-05-23 Thread climbingrose
Probably the easiest way to do this is keep track of the number of items
yourself then retrieve it later on.

On Wed, May 21, 2008 at 7:57 AM, Brian Whitman <[EMAIL PROTECTED]>
wrote:

> Any way to query how many items are in a multivalued field? (Or use a
> functionquery against that # or anything?)
>
>


-- 
Regards,

Cuong Hoang


Re: Announcement of Solr Javascript Client

2008-05-25 Thread climbingrose
Hi Matthias,

How would you prevent Solr server from being exposed to outside world with
this javascript client? I prefer running Solr behind firewall and access it
from server side code.

Cheers.

On Mon, May 26, 2008 at 7:27 AM, Matthias Epheser <[EMAIL PROTECTED]>
wrote:

> Hi users,
>
> As initially described in this thread [1] I am currently working on a
> javascript client library for solr. The idea is based on a demo [2] that
> introduces a reusable javascript widget client.
>
> I spent the last weeks evaluating the best fitting technologies that ensure
> a clean generic port of the demo into the solr project. The goal is to make
> it easy to use and include in webpages on the one hand, and creating a clean
> interface to the solr server on the other hand.
>
> With this announcement, I want to ask the community for their experience
> with solr and javascript and would appreciate feedback about this proposal:
>
> - javascript toolkit: JQuery, because it is already shipped with the solr
> webapp
>
> - Using a manager object on the client that holds all widgets and takes
> care of the communication to the solr server.
>
> - Using the JSONResponsewriter to get the data to the widgets so they could
> update their ui.
>
> These technologies seem to be the currently best ones IMHO, any
> feedback/experiences welcome.
>
> Regards,
> matthias
>
>
>
>
>
>
>
>
> [1]
> http://www.nabble.com/-GSOC-proposal-%3A-Solr-javascript-client-library-to16422808.html#a16430329
> [2] http://lovo.test.dev.indoqa.com/mepheser/moobrowser/
>



-- 
Regards,

Cuong Hoang


Ideas on how to implement "sponsored results"

2008-06-03 Thread climbingrose
Hi all,

I'm trying to implement "sponsored results" in Solr search results similar
to that of Google. We index products from various sites and would like to
allow certain sites to promote their products. My approach is to query a
slave instance to get sponsored results for user queries in addition to the
normal search results. This part is easy. However, since the number of
products indexed for each sites can be very different (100, 1000, 1 or
6 products), we need a way to fairly distribute the sponsored results
among sites.

My initial thought is utilising field collapsing patch to collapse the
search results on siteId field. You can imagine that this will create a
series of "buckets of results", each bucket representing results from a
site. After that, 2 or 3 buckets will randomly be selected from which I will
randomly select one or two results from. However, since I want these
sponsored results to be relevant to user queries, I'd like only want to have
the first 30 results in each buckets.

Obviously, it's desirable that if the user refreshes the page, new sponsored
results will be displayed. On the other hand, I also want to have the
advantages of Solr cache.

What would be the best way to implement this functionality? Thanks.

Cheers,
Cuong


Re: Ideas on how to implement "sponsored results"

2008-06-03 Thread climbingrose
Hi Alexander,

Thanks for your suggestion. I think my problem is a bit different from
yours. We don't have any sponsored words but we have to retrieve sponsored
results directly from the index. This is because a site can have 60,000
products which is hard to insert/update keywords. I can live with that by
issuing a separate query to fetch sponsored results. My problem is to
equally distribute sponsored results between sites so that each site will
have an opportunity to show their sponsored results no matter how many
products they have. For example, if site A has 6 products, site B has
only 2000 then sponsored products from site B will have a very small chance
to be displayed.


On Wed, Jun 4, 2008 at 2:56 AM, Alexander Ramos Jardim <
[EMAIL PROTECTED]> wrote:

> Cuong,
>
> I have implemented sponsored words for a client. I don't know if my working
> can help you but I will expose it and let you decide.
>
> I have an index containing products entries that I created a field called
> sponsored words. What I do is to boost this field , so when these words are
> matched in the query that products appear first on my result.
>
> 2008/6/3 climbingrose <[EMAIL PROTECTED]>:
>
> > Hi all,
> >
> > I'm trying to implement "sponsored results" in Solr search results
> similar
> > to that of Google. We index products from various sites and would like to
> > allow certain sites to promote their products. My approach is to query a
> > slave instance to get sponsored results for user queries in addition to
> the
> > normal search results. This part is easy. However, since the number of
> > products indexed for each sites can be very different (100, 1000, 1
> or
> > 6 products), we need a way to fairly distribute the sponsored results
> > among sites.
> >
> > My initial thought is utilising field collapsing patch to collapse the
> > search results on siteId field. You can imagine that this will create a
> > series of "buckets of results", each bucket representing results from a
> > site. After that, 2 or 3 buckets will randomly be selected from which I
> > will
> > randomly select one or two results from. However, since I want these
> > sponsored results to be relevant to user queries, I'd like only want to
> > have
> > the first 30 results in each buckets.
> >
> > Obviously, it's desirable that if the user refreshes the page, new
> > sponsored
> > results will be displayed. On the other hand, I also want to have the
> > advantages of Solr cache.
> >
> > What would be the best way to implement this functionality? Thanks.
> >
> > Cheers,
> > Cuong
> >
>
>
>
> --
> Alexander Ramos Jardim
>



-- 
Regards,

Cuong Hoang


Re: Multiple Schema File

2008-06-04 Thread climbingrose
Hi Sachit,

I think what you could do is to create all the "core fields" of your models
such as username, role, title, body, images... You can name them with prefix
like user.username, user.role, article.title, article.body... If you want to
dynamically add more fields to your schema, you can use dynamic fields and
keep a mapping between your model's properties and these fields somewhere.
Have a look at the default schema.xml for examples. I did use this approach
for a previous project and it worked fine for me.

Cheers,
Cuong

On Thu, Jun 5, 2008 at 3:43 PM, Sachit P. Menon <[EMAIL PROTECTED]>
wrote:

> Hi folks,
>
>
>
> I have a scenario as follows:
>
>
>
> I have a CMS where in I'm storing all the contents. I need to index all
> these
> contents and have a search on these indexes. For indexing, I can define a
> schema for all the contents. Some of the properties are like title,
> headline,
> body, keywords, images, etc.
>
> Now I have a user management wherein I store all the user information. I
> need
> to index this also. This may have properties like user name, role, joining
> date, etc.
>
>
>
> I want to use only one Solr instance. That means I can have only one schema
> file.
>
> How can I define all these totally different properties in one schema file?
>
> The unique id storage for content and user management may also be
> different.
> How can I achieve this?
>
>
>
>
>
> Thanks and Regards
>
> Sachit P. Menon| Programmer Analyst| MindTree Ltd. |West Campus, Phase-1,
> Global Village, RVCE Post, Mysore Road, Bangalore-560 059, INDIA |Voice +91
> 80 26264000 |Extn  64872|Fax +91 80 26264100 | Mob : +91
> 9986747356|www.mindtree.com
> <
> https://indiamail.mindtree.com/exchweb/bin/redir.asp?URL=http://www.mindtree
> .com/>  |
>
>
>
>
>
> DISCLAIMER:
> This message (including attachment if any) is confidential and may be
> privileged. If you have received this message by mistake please notify the
> sender by return e-mail and delete this message from your system. Any
> unauthorized use or dissemination of this message in whole or in part is
> strictly prohibited.
> E-mail may contain viruses. Before opening attachments please check them
> for viruses and defects. While MindTree Limited (MindTree) has put in place
> checks to minimize the risks, MindTree will not be responsible for any
> viruses or defects or any forwarded attachments emanating either from within
> MindTree or outside.
> Please note that e-mails are susceptible to change and MindTree shall not
> be liable for any improper, untimely or incomplete transmission.
> MindTree reserves the right to monitor and review the content of all
> messages sent to or from MindTree e-mail address. Messages sent to or from
> this e-mail address may be stored on the MindTree e-mail system or else
> where.
>



-- 
Regards,

Cuong Hoang


Re: searching only within allowed documents

2008-06-11 Thread climbingrose
It depends on your query. The second query is better if you know that
fieldb:bar filtered query will be reused often since it will be cached
separately from the query. The first query occuppies one cache entry while
the second one occuppies two cache entries, one in queryCache and one in
filteredCache. Therefore, if you're not going to reuse fieldb:bar, the
second query is better.

On Wed, Jun 11, 2008 at 10:53 PM, Geoffrey Young <[EMAIL PROTECTED]>
wrote:

>
>
>  Solr allows you to specify filters in separate parameters that are
>> applied to the main query, but cached separately.
>>
>> q=the user query&fq=folder:f13&fq=folder:f24
>>
>
> I've been wanting more explanation around this for a while, so maybe now is
> a good time to ask :)
>
> the "cached separately" verbiage here is the same as in the twiki, but I
> don't really understand what it means.  more precisely, I'm wondering what
> the real performance, caching, etc differences are between
>
>  q=fielda:foo+fieldb:bar&mm=100%
>
> and
>
>  q=fielda:foo&fq=fieldb:bar
>
> my situation is similar to the original poster's in that documents matching
> fielda is very large and common (say theaters across the world) while fieldb
> would narrow it considerably (one by country, then one by zipcode, etc).
>
> thanks
>
> --Geoff
>
>
>


-- 
Regards,

Cuong Hoang


Re: searching only within allowed documents

2008-06-11 Thread climbingrose
Just correct myself, in the last setence, the first query is better if
fieldb:bar isn't reused often

On Thu, Jun 12, 2008 at 2:02 PM, climbingrose <[EMAIL PROTECTED]>
wrote:

> It depends on your query. The second query is better if you know that
> fieldb:bar filtered query will be reused often since it will be cached
> separately from the query. The first query occuppies one cache entry while
> the second one occuppies two cache entries, one in queryCache and one in
> filteredCache. Therefore, if you're not going to reuse fieldb:bar, the
> second query is better.
>
>
> On Wed, Jun 11, 2008 at 10:53 PM, Geoffrey Young <
> [EMAIL PROTECTED]> wrote:
>
>>
>>
>>  Solr allows you to specify filters in separate parameters that are
>>> applied to the main query, but cached separately.
>>>
>>> q=the user query&fq=folder:f13&fq=folder:f24
>>>
>>
>> I've been wanting more explanation around this for a while, so maybe now
>> is a good time to ask :)
>>
>> the "cached separately" verbiage here is the same as in the twiki, but I
>> don't really understand what it means.  more precisely, I'm wondering what
>> the real performance, caching, etc differences are between
>>
>>  q=fielda:foo+fieldb:bar&mm=100%
>>
>> and
>>
>>  q=fielda:foo&fq=fieldb:bar
>>
>> my situation is similar to the original poster's in that documents
>> matching fielda is very large and common (say theaters across the world)
>> while fieldb would narrow it considerably (one by country, then one by
>> zipcode, etc).
>>
>> thanks
>>
>> --Geoff
>>
>>
>>
>
>
> --
> Regards,
>
> Cuong Hoang




-- 
Regards,

Cuong Hoang


Mobile phone shop + Solr

2006-09-12 Thread climbingrose

Hi all,

I've been watching the development of Solr in the last few months. I start
building a mobile phone shop with faceted browsing to allow users to filter
the catalogue in a friendly manner. Since the site will probably expand in a
year or two, I need some advice regarding the design and implementation of
the system.

At the moment, I have a sql table holding mobile phone information (id,
brand, name, size, screen, weight...). The total number of fields are around
41. I want to be able to facetedly browse the catalogue based on, for
example, the weight of the phone (100-150g, >150g) or color (red, black...).
Another feature of the system is that it allows users to sell their own
mobile phones on the website. For example, if they browse Nokia 6600 page,
there'll be a link for them to sell their own 6600 phones. The user can then
provide some more information about the phone he/she is selling such as
condition, description... Obviously, I need to publish the mobile phone
catalogue to  Solr. How about the information that users submit when they
sell the phone? Should I publish them to Solr as well?

Can Solr do facetted browsing based on content of a field? For example, our
mobile phone table has a field named color, which might look like this:

red
chocolate
red
black
red

Can Solr produce something like:

red(3)
chocolate(1)
black(1)


Because the website is currently being developed in Java, I'm a little bit
worried about the use of curl. I want to be able to programmatically
submit/delete document from Java code rather than through command line. I
read JavaSolr page but it seems that the code isn't stable enough for
production. Can anyone who has successfully developed a Java client for Solr
give me some suggestion regarding this matter?

Thanks.

--
Regards,

Cuong Hoang


Re: Mobile phone shop + Solr

2006-09-12 Thread climbingrose

Because the mobile phone info has many fields (>40), I don't want to
repeatedly submit it to Solr.

On 9/13/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 9/12/06, climbingrose <[EMAIL PROTECTED]> wrote:
> Obviously, I need to publish the mobile phone
> catalogue to  Solr. How about the information that users submit when
they
> sell the phone? Should I publish them to Solr as well?

It can often make a system simpler if the web front-end only has one
data source to worry about.

> Can Solr produce something like:
>
> red(3)
> chocolate(1)
> black(1)

Yes, it was always able to (with custom Java code), but it just got a
lot easier for simple things like this.  It's now supported
out-of-the-box:
http://wiki.apache.org/solr/SimpleFacetParameters

> Because the website is currently being developed in Java, I'm a little
bit
> worried about the use of curl.

Curl is just a tool that talks HTTP.. it's not "part" of Solr at all -
just a convenient way of testing things.

> I want to be able to programmatically
> submit/delete document from Java code rather than through command line.
I
> read JavaSolr page but it seems that the code isn't stable enough for
> production.

CNET's in-house client was too tied up with other stuff to cleanly
separate out and open-source.  As you can see, an open source client
is being worked on: http://issues.apache.org/jira/browse/SOLR-20


-Yonik





--
Regards,

Cuong Hoang


Re: Mobile phone shop + Solr

2006-09-13 Thread climbingrose

I probably need to visualise my models:

MobileInfo (1)(1...*) SellingItem

MobileInfo has many fields to describe the characteristics of a mobile phone
model (color, size..). SellingItem is an "instance" of MobileInfo that is
currently sold by a user. So in the ERD terms, SellingItem will probably
have foreign key call MobileInfoId that references the primary key of
MobileInfo. Now obviously, I need to index MobileInfo to support faceted
browsing. How should I index SellingItem? The simplest way probably is to
combile mobile phone specs in MobileInfo and and fields in SellingItem, and
then index all of them. In this case, if I have 1000 SellingItems
referencing a particular MobileInfo, I have to repeat the fields in
MobileInfo a thousand times.

On 9/13/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: Because the mobile phone info has many fields (>40), I don't want to
: repeatedly submit it to Solr.

i'm not really sure what you mean by "repeatedly submit to Solr" or how it
relates to haveing more then 40 fields.  40 fields really isn't that many.

To give you a basis of comparison: the last Solr index i built from
scratch had 47  declarations, and 4  declarations
...those 4 dynamic fields result in approximately 1200 'fields' in the
index -- not every document has a value for every field, but the average
is above 200 fields per document.



-Hoss





--
Regards,

Cuong Hoang


Multiple schemas

2006-09-26 Thread climbingrose

Hi all,

Am I right that we can only have one schema per solr server? If so, how
would you deal with the issue of submitting completely different data models
(such as clothes and cars)?
Thanks.

--
Regards,

Cuong Hoang


Solr use case

2006-10-11 Thread climbingrose

Hi all,

Is it true that Solr is mainly used for applications that rarely change the
underlying data? As I understand, if you submit new data or modify existing
data on Solr server, you would have to "refresh" the cache somehow to
display the updated data. If my application frequently gets new data/updates
from users, should I use Solr? I love faceted browsing and dynamic
properties so much but I need to justify the choice of Solr. Thanks. By the
way, does anyone have any performance measure that can be shared (apart from
the one on the Wiki)? As I estimated, my application probably has half a
million docs, each of which has around 15 properties, does anyone know the
type of hardware I would need for reasonable performance.

Thanks.

--
Regards,

Cuong Hoang


Dynamic fields performance question

2007-03-25 Thread climbingrose

Hi all,

I'm developing an application that potentially creates thousands of dynamic
fields.  Does anyone know if large number of dynamic fields will degrade
Solr performance?

Thanks.


--
Regards,

Cuong Hoang


Re: Dynamic fields performance question

2007-03-26 Thread climbingrose

Thanks Yonik. I think both of the conditions hold true for our application
;).

On 3/27/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 3/26/07, climbingrose <[EMAIL PROTECTED]> wrote:
> I'm developing an application that potentially creates thousands of
dynamic
> fields.  Does anyone know if large number of dynamic fields will degrade
> Solr performance?

Thousands of fields won't be a problem if
- you don't sort on most of them (sorting by a field takes up memory)
- you can omit norms on most of them

Provided the above is true, differences in searching + indexing
performance shouldn't be noticeable.

-Yonik





--
Regards,

Cuong Hoang


Re: history

2007-07-08 Thread climbingrose

Accidentally I have a very similar use case. Thanks for advice.

On 7/8/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 7/7/07, Brian Whitman <[EMAIL PROTECTED]> wrote:
> I have been trying to plan out a history function for Solr. When I
> update a document with an existing unique key, I would like the older
> version to stay around and get tagged with the date and some metadata
> to indicate it's not "live." Any normal search would not touch
> history documents.

Interesting...
One might be able to accomplish this with the update processors that
Ryan & I have been batting around for the last few days, in
conjunction with updateable documents, which is on-deck.

The first idea that comes to mind is that during an update, you could
change the id of the older document to be something like
id_, and reindex it with the addition of a live:false
field.

For normal queries, use a filter of -live:false filter.
For all old of a document, use a prefix query id:mydocid_*
for all versions of a document, use query id:mydocid*

So if you can hold off a little bit, you shouldn't need a custom query
handler.  This will be a good use case to ensure that our request
processors and updateable documents are powerful enough.

-Yonik





--
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-07-09 Thread climbingrose

Hi Tristan,

Is this spellchecker available in 1.2 release or I have to build the trunk.
I tried your instructions but Solr returns nothing:

http://localhost:8984/solr/select/?q=title_text:java&qt=spellchecker&cmd=rebuild

Result:



0
3

rebuild



Thanks.


On 7/8/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote:


Hi Otis,

I have written a draft wiki entry for the spell checker:
http://wiki.apache.org/solr/SpellCheckerRequestHandler

I've learned that my initial observation about the suggestion ordering was
incorrect, it does in fact order the results by popularity (or term
frequency) of the word in the termSourceField, the problem I experienced
was
caused by setting termSourceField to a field of type "text", which heavily
stemmed and analyzed the words.  I found that using the StandardTokenizer
and StandardFilter and removing the PorterStemmer and LowerCaseFilter from
the field schema really improved the spell checker performance.

I haven't included this info on the wiki page yet, I'll try to update it
soon when I have a bit more time.

cheers,
Tristan



On 7/8/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>
> Tristan - good summary - want to copy that to the Solr Wiki?
>
> Thanks,
> Otis
>
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> - Original Message 
> From: Tristan Vittorio <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Saturday, July 7, 2007 1:51:15 AM
> Subject: Re: Spell Check Handler
>
> I couldn't find any documention on the spell check handler either but
> found
> enough information from the solrconfig.xml file, simply search for
> "SpellCheckerRequestHandler" (online version here):
>
>
http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml
>
> You can view the original development discussion from JIRA (not sure how
> helpful that will be for you though):
> https://issues.apache.org/jira/browse/SOLR-81
>
> In a nutshell, the configuration parameters available are::
>
> suggestionCount: determines how many spelling suggestions are returned.
> accuracy: a float value between 1.0 and 0.0 on how close the suggested
> words
> should match the original word being checked.
> spellcheckerIndexDir and  termSourceField: check solrconfig.xml for a
full
> explanation.
>
> In order to use the spell checking hander for the first time, you need
to
> explicitly build the spelling index with a sample query something like
> this:
>
>
http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker&cmd=rebuild
> 
> Depending on how large you main index is, this rebuild operation could
> take
> a while.  Subsequent queries can omit '&cmd=rebuild' and will return
> results
> much faster:
>
> http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker
> 
> The order of the suggestions returned seems to be based on the accuracy
> figure (i.e. how close it matches the original word). it would be great
to
> be able to sort these suggested results based on term frequency /
document
> frequency of the suggested word in the main index, since the most
accurate
> suggestion may not always be the most relevant.
>
> As far as I can tell there is currently no way of doing this using the
> spellchecker handler alone (you could always run seperate standard
queries
> on each word suggestion and order by numDocs, but that would be very
> inefficient), has anybody else tried to achieve this?
>
> cheers,
> Tristan
>
>
>
> On 7/7/07, Andrew Nagy <[EMAIL PROTECTED] > wrote:
> >
> > Hello, is there any documentation on how to use the new spell check
> > module?
> >
> > Thanks
> > Andrew
> >
>
>
>
>





--
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-07-09 Thread climbingrose

Thanks for the quick reply. However, I'm still not able to setup
spellchecker. Solr does create spell directory under data but doesn't seem
to build the spellchecker index. Here are snippets of my schema.xml:




   

  1
  0.5







spell




title

  

I tried this url:
http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand
receive this:



0
2

rebuild




On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote:


The spellchecker should be available in 1.2 release, your query is
incorrect, try the following:


http://localhost:8984/solr/select/?q=java&qt=spellchecker&termSourceField=title_text&cmd=rebuild

the 'q' parameter must only contain the word being checked; you must
specify
the field separately.  You can set "termSourceField" in your
solrconfig.xmlfile so you do not need to explicitly set it each time
you want to run a
spell check query. Also make sure your field isn't heavily processed (i.e.
with porter stemmer analyzers) otherwise the suggestions will look a bit
weird / mangled.  Take a look at the wiki page for more info:

http://wiki.apache.org/solr/SpellCheckerRequestHandler

cheers,
Tristan



On 7/9/07, climbingrose <[EMAIL PROTECTED]> wrote:
>
> Hi Tristan,
>
> Is this spellchecker available in 1.2 release or I have to build the
> trunk.
> I tried your instructions but Solr returns nothing:
>
>
>
http://localhost:8984/solr/select/?q=title_text:java&qt=spellchecker&cmd=rebuild
>
> Result:
>
> 
> 
> 0
> 3
> 
> rebuild
> 
> 
>
> Thanks.
>
>
> On 7/8/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote:
> >
> > Hi Otis,
> >
> > I have written a draft wiki entry for the spell checker:
> > http://wiki.apache.org/solr/SpellCheckerRequestHandler
> >
> > I've learned that my initial observation about the suggestion ordering
> was
> > incorrect, it does in fact order the results by popularity (or term
> > frequency) of the word in the termSourceField, the problem I
experienced
> > was
> > caused by setting termSourceField to a field of type "text", which
> heavily
> > stemmed and analyzed the words.  I found that using the
> StandardTokenizer
> > and StandardFilter and removing the PorterStemmer and LowerCaseFilter
> from
> > the field schema really improved the spell checker performance.
> >
> > I haven't included this info on the wiki page yet, I'll try to update
it
> > soon when I have a bit more time.
> >
> > cheers,
> > Tristan
> >
> >
> >
> > On 7/8/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> > >
> > > Tristan - good summary - want to copy that to the Solr Wiki?
> > >
> > > Thanks,
> > > Otis
> > >
> > > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > > Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> > >
> > > - Original Message 
> > > From: Tristan Vittorio <[EMAIL PROTECTED]>
> > > To: solr-user@lucene.apache.org
> > > Sent: Saturday, July 7, 2007 1:51:15 AM
> > > Subject: Re: Spell Check Handler
> > >
> > > I couldn't find any documention on the spell check handler either
but
> > > found
> > > enough information from the solrconfig.xml file, simply search for
> > > "SpellCheckerRequestHandler" (online version here):
> > >
> > >
> >
>
http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml
> > >
> > > You can view the original development discussion from JIRA (not sure
> how
> > > helpful that will be for you though):
> > > https://issues.apache.org/jira/browse/SOLR-81
> > >
> > > In a nutshell, the configuration parameters available are::
> > >
> > > suggestionCount: determines how many spelling suggestions are
> returned.
> > > accuracy: a float value between 1.0 and 0.0 on how close the
suggested
> > > words
> > > should match the original word being checked.
> > > spellcheckerIndexDir and  termSourceField: check solrconfig.xml for
a
> > full
> > > explanation.
> > >
> > > In order to use the spell checking hander for the first time, you
need
> > to
> > > explicitly build the spelling index with a sample query something
like
> > > this:
> > >
> > >
> >
>
http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker&cmd=rebuild
> > > <http://localhost:8080/solr/select/?q=macroso

A few questions regarding multi-word synonyms and parameters encoding

2007-07-10 Thread climbingrose

Hi all,

I've been using Solr for the last few projects and the experience has been
great. I'll post the link to the website once it finishes. Just have a few
questions regarding synonyms and parameters encoding:

1) Is multi-word synonyms possible now in Solr? For example, can I have
things like synonyms like:
"I.T. & T", "IT & T", "Information Technologies", "Computer science"
I read the message on mailing list sometime ago (think back in mid 2006)
saying that there is no clean way to implement this. Is it possible now? In
my case, I have two field category and location in which category is of
default string type and location is of default text type:
+Category field is used only for faceting by category therefore, no anylasis
needs to be done. Can I use the synonyms config above to do facet query on
category field and the Solr will combine items having one of these category
into one facet category? For example:

I.T. & T (10)
IT & T (20)
Information Technologies (30)
Computer science (40)

Can I have something like:

I.T. & T (100)

Or do I have to manually filter query on for each category:"I.T. & T" and
count the results?

+Location field is used for searching by city, state and post code. Since I
collect the data from different sources, there might be mix & match
information. For example, on one record I might have "Inner Sydney, NSW"
while the other record I might have "Inner Sydney, New South Wales". In
Australia, NSW & New South Wales are interchangeable used so when the users
search for "NSW", I want "New South Wales" record to be returned and vice
versa. How could I achieve this? The "location" field is of the default text
type.

2) I'm having trouble with using facet values in my url. For example, I have
"title" facet field in my query and it returns something like:

Software engineer
C++ Programmer
C Programmer & PHP developer

Now I want create a link for each of these value so that the user can filter
the results by that title by clicking on the link. For example, if I click
on "Software Engineer", the results are now narrowed down to just include
records with "Software Engineer" in their title. Since "title" field can
contain special chars like '+', '&' ..., I really can't find a clean way to
do this. At the moment, I replace all the space by '+' and it seems to work
for words like "Software engineer" (converted to "Software+Engineer").
However, "C++ Programmer" is converted to "C+++Programmer", and it doesn't
seem to work (return no results). Any ideas?

Looking back, this is such a long email. If you reach this point, thanks a
lot for your time!!!

--
Regards,

Cuong Hoang


Slow facet with custom Analyser

2007-07-16 Thread climbingrose

Hi all,

My facet browsing performance has been decent on my system until I add my
custom Analyser. Initially, I facetted "title" field which is of default
string type (no analysers, tokenisers...) and got quick responses (first
query is just under 1s, subsequent queries are < 0.1s). I created a custom
analyser which is not much different from the DefaultAnalyzer in FieldType
class. Essentially, this analyzer will not do any tokonisations, but only
convert the value into lower case, remove spaces, unwanted chars and words.
After I applied the analyser to "title" field, facet performance degraded
considerably. Every query is now > 1.2s and the filterCache hit ratio is
extremely small:

lookups : 918485

hits : 23
hitratio : 0.00
inserts : 918487
evictions : 917971
size : 512
cumulative_lookups : 918485
cumulative_hits : 23
cumulative_hitratio : 0.00
cumulative_inserts : 918487
cumulative_evictions : 917971



Any idea? Here is my analyser code:

public class FacetTextAnalyser extends SolrAnalyzer {

final int maxChars;
final Set ignoredChars;
final Set ignoredWords;

public final static char[] IGNORED_CHARS = {'/', '\\', '\'', '\"',
'#', '&', '!', '?', '*', '>', '<', ','};
public static final String[] IGNORED_WORDS = {
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};

public FacetTextAnalyser() {
maxChars = 255;
ignoredChars = new HashSet();
for (int i = 0; i < IGNORED_CHARS.length; i++) {
ignoredChars.add(IGNORED_CHARS[i]);
}
ignoredWords = new HashSet();
for (int i = 0; i < IGNORED_WORDS.length; i++) {
ignoredWords.add(IGNORED_WORDS[i]);
}

}

public FacetTextAnalyser(int maxChars, Set ignoredChars,
Set ignoredWords) {
this.maxChars = maxChars;
this.ignoredChars = ignoredChars;
this.ignoredWords = ignoredWords;
}

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new Tokenizer(reader) {
char[] cbuf = new char[maxChars];

public Token next() throws IOException {
int n = input.read(cbuf, 0, maxChars);
if (n <= 0)
return null;
char[] temp = new char[n];
int index = 0;
boolean space = true;
for (int i = 0; i < n; i++) {
char c = cbuf[i];
if (ignoredChars.contains(cbuf[i])) {
c = ' ';
}
if (Character.isWhitespace(c)) {
if (space)
continue;
else {
temp[index] = ' ';
if (index > 0) {
int j = index - 1;
while (temp[j] != ' ' && j > 0) {
j--;
}
String str = (j == 0)? new String(temp, 0,
index): new String(temp, j + 1, index - j - 1);
System.out.println(str);
if (ignoredWords.contains(str))
index = j;
}
index++;
//Check ignored words
space = true;
}
} else {
temp[index] = Character.toLowerCase(c);
index++;
space = false;
}

}
temp[0] = Character.toUpperCase(temp[0]);
String s = new String(temp, 0, index);
return new Token(s, 0, n);
};
};
}
}



Here is how I declare the analyser:


  






--
Regards,

Cuong Hoang


Re: Slow facet with custom Analyser

2007-07-16 Thread climbingrose

Thanks Yonik. In my case, there is only one "title" field per document so is
there a way to force Solr to work the old way? My analyser doesn't break up
the "title" field into multiple tokens. It only tries to format the field
value (to lower case, remove unwanted chars and words). Therefore, it's no
difference from using "string" single-valued type.

I'll try your first recommendation to see how it goes.

Thanks again.

On 7/17/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


Since you went from a non multi-valued "string" type (which Solr knows
has at most one value per document) to a custom analyzer type (which
could produce multiple tokens per document), Solr switched tactics
from using the FieldCache for faceting to using the filterCache.

Right now, you could try to
1) use facet.enum.cache.minDf=1000 (don't use the fieldCache except
for large facets)
2) expand the size of the fieldcache to 100 if you have the memory

Optimizing your index should also speed up faceting (but that is a lot
of facets).

-Yonik

On 7/16/07, climbingrose <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> My facet browsing performance has been decent on my system until I add
my
> custom Analyser. Initially, I facetted "title" field which is of default
> string type (no analysers, tokenisers...) and got quick responses (first
> query is just under 1s, subsequent queries are < 0.1s). I created a
custom
> analyser which is not much different from the DefaultAnalyzer in
FieldType
> class. Essentially, this analyzer will not do any tokonisations, but
only
> convert the value into lower case, remove spaces, unwanted chars and
words.
> After I applied the analyser to "title" field, facet performance
degraded
> considerably. Every query is now > 1.2s and the filterCache hit ratio is
> extremely small:
>
> lookups : 918485
> > hits : 23
> > hitratio : 0.00
> > inserts : 918487
> > evictions : 917971
> > size : 512
> > cumulative_lookups : 918485
> > cumulative_hits : 23
> > cumulative_hitratio : 0.00
> > cumulative_inserts : 918487
> > cumulative_evictions : 917971





--
Regards,

Cuong Hoang


Re: Slow facet with custom Analyser

2007-07-16 Thread climbingrose

I've tried both of your recommendations (use facet.enum.cache.minDf=1000 and
optimise the index). The query time is around 0.4-0.5s now but it's still
slow compared to the old "string" type. I haven't tried to increase
filterCache but 100 of cached items looks a bit too much for my server
atm. It's quite pitty that we can't force Solr to use FieldCache. I think I
might pre-process "title" field and index it as "string" instead of using
analyser. However, it defeats the purpose of having pluggable analysers,
tokenisers...

On 7/17/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 7/16/07, climbingrose <[EMAIL PROTECTED]> wrote:
> Thanks Yonik. In my case, there is only one "title" field per document
so is
> there a way to force Solr to work the old way? My analyser doesn't break
up
> the "title" field into multiple tokens. It only tries to format the
field
> value (to lower case, remove unwanted chars and words). Therefore, it's
no
> difference from using "string" single-valued type.

There is currently no way to force Solr to use the FieldCache method.

Oh, and in
"2) expand the size of the fieldcache to 100 if you have the memory"
should have been filterCache, not fieldcache.

-Yonik

> I'll try your first recommendation to see how it goes.

faceting typically proceeds much faster on an optimized index too.

-Yonik





--
Regards,

Cuong Hoang


Re: Slow facet with custom Analyser

2007-07-16 Thread climbingrose

Thanks for the suggestion Chris. I modified SimpleFacets to check for
[f.foo.]facet.field.type==(single|multi)
and the performance has been improved significantly.

On 7/17/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: > ...but i don't understand why both checking isTokenized() ...
shouldn't
: > multiValued() be enough?
:
: A field could return "false" for multiValued() and still have multiple
: tokens per document for that field.

ah .. right ... sorry: multiValued() indicates wether multiple discreet
values can be added to the field (and stored if the field is stored) but
says nothing baout what the Analyzer may do with any single value.

perhaps we should really have an [f.foo.]facet.field.type=(single|multi)
param to let clients indicate when they know exactly which method they
wnat used (getFacetTermEnumCounts vs getFieldCacheCounts) ... if the
property is not set, the default can be determeined using the
"sf.multiValued() || ft.isTokenized() || ft instanceof BoolField" logic.


-Hoss





--
Regards,

Cuong Hoang