Implementing a facet search

2009-04-07 Thread Sajith Weerakoon
Hi all, I want to implement a facet search for my application. Can some1 of you help me out? Thanks, Regards, Sajith Vimukthi Weerakoon.

Re: Term Counts/Term Frequency Vector Info

2009-04-07 Thread Grant Ingersoll
You can send arbitrary requests via SolrJ, just use the parameter map via the query method: http://lucene.apache.org/solr/api/solrj/org/apache/solr/client/solrj/SolrServer.html . -Grant On Apr 7, 2009, at 1:52 PM, Fink, Clayton R. wrote: These URLs give me what I want - word completion and t

Re: Unexpected sorting results when sorting with mutivalued filed

2009-04-07 Thread Chris Hostetter
: > The last value is used for sorting in multi-valued fields. What is the : > reason behind sorting on a multi-valued field? strictly speaking the behavior is non-determinisitic. in most cases attempting to sort on a multi-valued field will generate an error. : Cant do much about it, that is

Re: Strange anomaly(?) with string matching in query

2009-04-07 Thread Chris Hostetter
: Does anybody have any further suggestions on what I might try in this : situation? Any tools perhaps that might help me put my finger on Solr's : pulse so I can figure out just what's going on in there at index and query : time? 1) FYI: you don't always need the settings on every filter to be

Re: using multisearcher

2009-04-07 Thread Chris Hostetter
If you've been using a MultiSearcher to query multiple *remote* searchers, then Distributed searching in solr should be a appropriate. if you're use to useing MultiSearcher as a way of aggregating from multiple *local* indexes distributed searching is probably going to seem slow compared to wh

Re: Birthday (that's "day" not "date") search query?

2009-04-07 Thread Chris Hostetter
leap years don't just complicate the calucation when a person was born on Feb 29 ... even if no one was born on feb 29, answering the question "who's birthday is in the next/last X days?" is complicated by needing to know whether the current year is a leap year... : Or have two fields, dayofy

Re: Birthday (that's "day" not "date") search query?

2009-04-07 Thread Walter Underwood
Or have two fields, dayofyear and dayofleapyear, then use the right field in the right year. --wunder On 4/7/09 4:32 PM, "Stephen Weiss" wrote: > If someone's birthday falls on a leap year, in most countries their > birthday is considered to be February 28th unless it happens to be a > leap year

Re: Birthday (that's "day" not "date") search query?

2009-04-07 Thread Stephen Weiss
If someone's birthday falls on a leap year, in most countries their birthday is considered to be February 28th unless it happens to be a leap year. You could make the field a float, encode the day number as 59.5, so it will match where it should, and write special handling along these line

Re: More than one language in the same document

2009-04-07 Thread Koji Sekiguchi
ashokc wrote: What I am doing right now is to capture all the content under "content_korea" for example, use 'copyField' to duplicate that content to "content_english". "content_korea" gets processed with CJK analyzers, and "content_english" gets processed with usual detailed index/query analyzer

Re: Some Kind of Crazy Histogram

2009-04-07 Thread Chris Hostetter
(for people who don't know, the schema browser and the lUke handler return a "histogram" for each field) : I have noticed that I can�t seem to make sense of the histogram. For every : field the x-axis shows powers of 2 which make no sense for things like brand : name. Am I looking at it wrong

Re: More than one language in the same document

2009-04-07 Thread ashokc
What I am doing right now is to capture all the content under "content_korea" for example, use 'copyField' to duplicate that content to "content_english". "content_korea" gets processed with CJK analyzers, and "content_english" gets processed with usual detailed index/query analyzers, filters, syn

Re: Birthday (that's "day" not "date") search query?

2009-04-07 Thread Chris Hostetter
: Hi everyone, : I have an index that stores birth-dates, and I would like to search for : anybody whose birth-date is within X days of a certain month/day. For : example, I'd like to know if anybody's birthday is coming up within a : certain number of days, regardless of what year they were

Re: More than one language in the same document

2009-04-07 Thread Chris Hostetter
: I have documents where text from two languages, e.g. (english & korean) or : (english & german) are mixed u p in a fairly intensive way. 20-30% of the if you search the list archives you'll find a lot of results for "languages" ... it's not something i deal with much but i believe using separ

Re: Incorrect sort with with function query in query parameters

2009-04-07 Thread Chris Hostetter
: Any documents marked deleted in this index are just the result of updates to : those documents. There are no purely deleted documents. Furthermore, the : field that I am ordering by in my function query remains untouched over the : updates. it doesn't matter wether it was an update or a true

Re: How to take Index Backup

2009-04-07 Thread Chris Hostetter
: 1. What is the userId to be given in scripts.conf file. it's just a username that the scripts will try to sudo to if specified ... it's a way of ensuring that all of the actions the script takes (logging, creating files, etc...) are executed by a specific unix user no matter who runs the scr

Re: Multicore Solr not showing Cache Stats

2009-04-07 Thread Chris Hostetter
: - Going to http://localhost:8983/core1/admin/stats.jsp#cache shows a : nearly empty Cache section. The only cache that shows up there is : fieldValueCache (which is really commented out in solrconfig.xml, but : Solr creates it anyway, which is normal). All other caches are missing. : : Any

Re: spectrum of Lucene queries in solr?

2009-04-07 Thread Chris Hostetter
: Sorry, I just realized I can use SolrIndexSearcher.search(Query, Hit)... : : that was my question basically. I wouldn't recommend it ... those methods bypass all of the goodness Solr adds on top of of Lucene (caching, etc...) if you're writing plugin/embedded code where you have access to th

Re: DIH; Hardcode field value/replacement based on source column

2009-04-07 Thread Chris Hostetter
: Indeed. I wrote the following test: : : Pattern p = Pattern.compile("(.*)"); : Matcher m = p.matcher("xyz"); : Assert.assertEquals("", "Video", m.replaceAll("Video")); : : The test fails. It gives "VideoVideo" as the actual result. I guess there is : something about Matcher.replaceAll that I d

RE: Not getting the proper result.

2009-04-07 Thread Chris Hostetter
StandardTokenizer is tricky. it does a lot of kooky things that probably made sense when it was written, you'll not in your output that the "term type" is getting set to "HOST" Standard Tokenizer has decided that L.I.C looks like a hostname, so it's not splitting on the periods. : analys

RE: Wildcard searches

2009-04-07 Thread Vauthrin, Laurent
Looks like I was using the wrong field when searching (tokenized instead of untokenized) and this approach actually worked. Sorry for the confusion. -Original Message- From: Vauthrin, Laurent Sent: Monday, April 06, 2009 10:03 AM To: solr-user@lucene.apache.org Subject: RE: Wildcard sear

RE: Term Counts/Term Frequency Vector Info

2009-04-07 Thread Fink, Clayton R.
These URLs give me what I want - word completion and term counts. What I don't see is a way to call these via SolrJ. I could call the server directly using java.net classes and process the XML myself, I guess. There needs to be an auto suggest request class. http://localhost:8983/solr/autoSugge

_val:ord(field) (from wiki LargeIndexes)

2009-04-07 Thread Joe Pollard
I see this interesting line in the wiki page LargeIndexes http://wiki.apache.org/solr/LargeIndexes (sorting section towards the bottom) Using _val:ord(field) as a search term will sort the results without incurring the memory cost. I'd like to know what this means, but I'm having a bit of trou

RE: Coming up with a model of memory usage

2009-04-07 Thread Joe Pollard
It does end up in the right order (sorted), but it's very expensive. Sorting by a couple fields that each have fewer unique index values seems to limit the memory consumption greatly. -Original Message- From: Walter Underwood [mailto:wunderw...@netflix.com] Sent: Tuesday, April 07, 2009

Re: Coming up with a model of memory usage

2009-04-07 Thread Walter Underwood
Why tokenize the date? It sorts just fine as a string. --wunder On 4/7/09 8:50 AM, "Erick Erickson" wrote: > Your observations about date sorting are probably correct. The > issue is that the sort caches in Lucene look at the unique terms. > There are many more unique terms (nearly every one) in

RE: Coming up with a model of memory usage

2009-04-07 Thread Joe Pollard
Good info to have. Thanks Erick. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, April 07, 2009 10:51 AM To: solr-user@lucene.apache.org Subject: Re: Coming up with a model of memory usage Your observations about date sorting are probably correct.

RE: Coming up with a model of memory usage

2009-04-07 Thread Joe Pollard
Cool, great resource, thanks. -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: Tuesday, April 07, 2009 10:13 AM To: solr-user@lucene.apache.org Subject: Re: Coming up with a model of memory usage On Tue, Apr 7, 2009 at 8:25 PM, Joe Pollard wrote: > It

Re: How could I avoid reindexing same files?

2009-04-07 Thread Fergus McMenemie
>Thank you much Fergus, > >I was considering implementing a database which would hold a path name >and an MD5 sum of each file. Snap. That is close to what we did. However due to our pervious duff full text search engine we had to hold this information in a separate checksums file. Solr is much bet

Re: Coming up with a model of memory usage

2009-04-07 Thread Erick Erickson
Your observations about date sorting are probably correct. The issue is that the sort caches in Lucene look at the unique terms. There are many more unique terms (nearly every one) in 2008-08-12T12:18:26.510 then when the field is split. You can reduce memory consumption when sorting even more by

Re: How could I avoid reindexing same files?

2009-04-07 Thread Erik Hatcher
Note that Solr (trunk, soon to be 1.4) has a duplicate detection feature that may work for your need. See http://wiki.apache.org/solr/Deduplication (looks like docs need updating to say 1.4 here) and http://issues.apache.org/jira/browse/SOLR-799 Erik On Apr 7, 2009, at 11:25 AM, Ves

Re: How could I avoid reindexing same files?

2009-04-07 Thread Veselin K
Thank you much Fergus, I was considering implementing a database which would hold a path name and an MD5 sum of each file. Then as a part of Solr indexing, one could check against the DB if a file path exists, if Yes, then compare MD5 and only index if different. Regards, Veselin K On Tue, Apr

Re: Coming up with a model of memory usage

2009-04-07 Thread Shalin Shekhar Mangar
On Tue, Apr 7, 2009 at 8:25 PM, Joe Pollard wrote: > It doesn't seem to matter whether fields are stored or not, but I've > found a rather striking difference in the memory requirements during > sorting. Sorting on a string field representing datetime like > '2008-08-12T12:18:26.510' is about twi

Re: Coming up with a model of memory usage

2009-04-07 Thread Joe Pollard
It doesn't seem to matter whether fields are stored or not, but I've found a rather striking difference in the memory requirements during sorting. Sorting on a string field representing datetime like '2008-08-12T12:18:26.510' is about twice as memory intense as sorting first by '2008-08-12' and th

using NGramTokenizerFactory for partial matching

2009-04-07 Thread Pete Smith
Hi, I want to use the NGramTokenizerFactory tokeniser to enable partial matching on a field in my index. For instance for the field: "Lorem ipsum" I want it to match "lor" "lorem" and "lorem i". However I am finding it matches the first two but not the third - the white space is causing problems

Re: solr 1.4 memory jvm

2009-04-07 Thread sunnyfr
Hi, So I did two test on two servers; First server : with just replication every 20mn like you can notice: http://www.nabble.com/file/p22930179/cpu_without_request.png cpu_without_request.png http://www.nabble.com/file/p22930179/cpu2_without_request.jpg cpu2_without_request.jpg Second server

Re: custom reranking

2009-04-07 Thread Grant Ingersoll
Yeah, that is a good idea. Some of it can be obtained already through the Editorial Boosting, some through function queries, similarity factory, custom sorting and other features. User feedback and click log analysis would be nice features to have as well. http://wiki.apache.org/solr/How

Regd. Difference check at the time of updation

2009-04-07 Thread Pooja Verlani
Hi all, I am looking for a mechanism to check the amount of difference between a document already in the index with the one updated with some new content. Basically, I want to design a criteria to decide whether or not to update the document with the new one. In case solr already has something lik

Re: ExtractingRequestHandler Question

2009-04-07 Thread Grant Ingersoll
Can you add the values as literals? http://wiki.apache.org/solr/ExtractingRequestHandler#head-88b9f55989c9878638e88be5d335b5126550f87c On Apr 3, 2009, at 8:29 PM, Venu Mittal wrote: Hi, I am using ExtractingRequestHandler to index rich text documents. The way I am doing it is I get some dat

Re: custom reranking

2009-04-07 Thread CIF Search
Would it not be a good idea to provide Ranking as solr plugin, in which users can write their custom ranking algorithms and reorder the results returned by Solr in whichever way they need. It may also help Solr users to incorporate learning (from search user feedback - such as click logs), and reor

Re: response time

2009-04-07 Thread CIF Search
yes, non cached. If I repeat a query the response is fast since the results are cached. 2009/4/7 Noble Paul നോബിള്‍ नोब्ळ् > are these the numbers for non-cached requests? > > On Tue, Apr 7, 2009 at 11:46 AM, CIF Search wrote: > > Hi, > > > > I have around 10 solr servers running indexes of aro

Re: response time

2009-04-07 Thread Noble Paul നോബിള്‍ नोब्ळ्
are these the numbers for non-cached requests? On Tue, Apr 7, 2009 at 11:46 AM, CIF Search wrote: > Hi, > > I have around 10 solr servers running indexes of around 80-85 GB each and > and with 16,000,000 docs each. When i use distrib for querying, I am not > getting a satisfactory response time.

Re: solr 1.4 memory jvm

2009-04-07 Thread Noble Paul നോബിള്‍ नोब्ळ्
Let me assume that the graph shows the CPU idle time. How do I know that the spikes are during the replication It is possible that you observe CPU spikes soon after the replication because that is when you will have very few cache hits . Because searches are done live. Even if the index is very l

Re: solr 1.4 memory jvm

2009-04-07 Thread sunnyfr
Hi Noble I turnd off autoWarming to zero. And yes it's during it replicate, it takes all the data index. Because it merges too much, too much update 2000docs every 30mn, it always merge my index. So the replication bring back all my data/index. which use a big part of the cpu like u can see on t

Re: How could I avoid reindexing same files?

2009-04-07 Thread Fergus McMenemie
Veselin, Well, as far as solr is concerned, there is two issues here:- 1) To stop the same document ending up in the indexes twice, use the document pathname as the unique ID. Then if you do index it twice, the previous index information will be discarded. Not very efficient, but it may be