Re: processing documents in solr

2013-07-29 Thread Joe Zhang
already processed. Fetch record from Solr, For each record, > > check the new DB, if the record is already processed. > > > > Regards > > Aditya > > www.findbestopensource.com > > > > > > > > > > > > On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang

Re: processing documents in solr

2013-07-28 Thread Joe Zhang
Basically, I was thinking about running a range query like Shawn suggested on the tstamp field, but unfortunately it was not indexed. Range queries only work on indexed fields, right? On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang wrote: > I've been thinking about tstamp solution int the

Re: processing documents in solr

2013-07-28 Thread Joe Zhang
better performance, first I'd load just all the IDs, > > after, during processing I'd load each document. > > For what concern the incremental requirement, it should not be difficult > to > > write an hash function which maps a non-numerical I'd to a value. > &

Re: processing documents in solr

2013-07-27 Thread Joe Zhang
etion. On Sat, Jul 27, 2013 at 10:28 AM, Shawn Heisey wrote: > On 7/27/2013 11:17 AM, Joe Zhang wrote: > > Thanks for sharing, Roman. I'll look into your code. > > > > One more thought on your suggestion, Shawn. In fact, for the id, we need > > more than "

Re: processing documents in solr

2013-07-27 Thread Joe Zhang
my > current workload time, it will take longer and also somebody else will > *have to* invest their time and energy in testing it, reporting, etc. Of > course, feel free to create the jira yourself or reuse the code - > hopefully, you will improve it and let me know ;-) > > Roman

Re: processing documents in solr

2013-07-26 Thread Joe Zhang
Thanks. On Fri, Jul 26, 2013 at 11:34 PM, Shawn Heisey wrote: > On 7/27/2013 12:30 AM, Joe Zhang wrote: > > ==> so a "url" field would work fine? > > As long as it's guaranteed unique on every document (especially if it is > your uniqueKey) and goes in

Re: processing documents in solr

2013-07-26 Thread Joe Zhang
On Fri, Jul 26, 2013 at 11:18 PM, Shawn Heisey wrote: > On 7/26/2013 11:50 PM, Joe Zhang wrote: > > ==> Essentially we are doing paigination here, right? If performance is > not > > the concern, given that the index is dynamic, does the order of > > entries remain stab

Re: processing documents in solr

2013-07-26 Thread Joe Zhang
On a related, inspired by what you said, Shawn, an auto increment id seems perfect here. Yet I found there is no such support in solr. The UUID only guarantees uniqueness. On Fri, Jul 26, 2013 at 10:50 PM, Joe Zhang wrote: > Thanks for your kind reply, Shawn. > > On Fri, Jul 26, 2013

Re: processing documents in solr

2013-07-26 Thread Joe Zhang
Thanks for your kind reply, Shawn. On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey wrote: > On 7/26/2013 11:02 PM, Joe Zhang wrote: > > I have an ever-growing solr repository, and I need to process every > single > > document to extract statistics. What would be a reaso

processing documents in solr

2013-07-26 Thread Joe Zhang
Dear list: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me

Re: Question about field boost

2013-07-23 Thread Joe Zhang
Erick > > On Tue, Jul 23, 2013 at 1:43 AM, Jack Krupansky > wrote: > > That means that for that document "china" occurs in the title vs. > "snowden" > > found in a document but not in the title. > > > > > > -- Jack Krupansky >

Re: Question about field boost

2013-07-22 Thread Joe Zhang
Is my reading correct that the boost is only applied on "china" but not "snowden"? How can that be? My query is: q=china+snowden&qf=title^10 content On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang wrote: > Thanks for your hint, Jack. Here is the debug results, which I&#x

Re: Question about field boost

2013-07-22 Thread Joe Zhang
score is dominated by your query terms in the non-title fields. > > -- Jack Krupansky > > -Original Message- From: Joe Zhang > Sent: Monday, July 22, 2013 11:06 PM > To: solr-user@lucene.apache.org > Subject: Question about field boost > > > Dear Solr experts: > >

Question about field boost

2013-07-22 Thread Joe Zhang
Dear Solr experts: Here is my query: defType=dismax&q=term1+term2&qf=title^100 content Apparently (at least I thought) my intention is to boost the title field. While I'm getting some non-trivial results, I'm surprised that the documents with both term1 and term2 in title (I know such docs do ex

Re: zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
g/apache/lucene/search/similarities/TFIDFSimilarity.html> > > You would need to talk to the Nutch guys to see why THEY are setting > document boost to 0.0. > > > -- Jack Krupansky > > -Original Message- From: Joe Zhang > Sent: Friday, July 12, 2013 11:57 PM > To:

Re: zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
ansky wrote: > Did you put a boost of 0.0 on the documents, as opposed to the default of > 1.0? > > x * 0.0 = 0.0 > > -- Jack Krupansky > > -Original Message- From: Joe Zhang > Sent: Friday, July 12, 2013 10:31 PM > To: solr-user@lucene.apache.org > Subject: zer

zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
when I search a keyword (such as "apple"), most of the docs carry 0.0 as score. Here is an example from explain: str name=" http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html";> 0.0 = (MATCH) fieldWeight(content:appl in 51), product of: 1.0 = tf(termFreq(content:appl)=1) 2.

Re: document id in nutch/solr

2013-06-23 Thread Joe Zhang
Can somebody help with this one, please? On Fri, Jun 21, 2013 at 10:36 PM, Joe Zhang wrote: > A quite standard configuration of nutch seems to autoamtically map "url" > to "id". Two questions: > > - Where is such mapping defined? I can't find it anywhere i

document id in nutch/solr

2013-06-21 Thread Joe Zhang
A quite standard configuration of nutch seems to autoamtically map "url" to "id". Two questions: - Where is such mapping defined? I can't find it anywhere in nutch-site.xml or schema.xml. The latter does define the "id" field as well as its uniqueness, but not the mapping. - Given that nutch nutc

Re: what does a zero score mean?

2013-06-21 Thread Joe Zhang
ce to see it structured > properly. > > Upayavira > > On Tue, Jun 18, 2013, at 02:52 PM, Joe Zhang wrote: > > I did include "debugQuery=on" in the query, but nothing extra showed up > > in > > the response. > > > > > > On Mon, Jun 17, 2013 at 10:29 PM,

Re: what does a zero score mean?

2013-06-18 Thread Joe Zhang
I did include "debugQuery=on" in the query, but nothing extra showed up in the response. On Mon, Jun 17, 2013 at 10:29 PM, Gora Mohanty wrote: > On 18 June 2013 10:49, Joe Zhang wrote: > > I issued a simple query ("apple") to my collection and got 201 documen

what does a zero score mean?

2013-06-17 Thread Joe Zhang
I issued a simple query ("apple") to my collection and got 201 documents back, all of which are scored 0. What does this mean? --- The documents do contain the query words.

Re: Internal statistics in Solr index?

2012-12-21 Thread Joe Zhang
Thank you very much! This is a good starting point! On Fri, Dec 21, 2012 at 6:15 AM, Erick Erickson wrote: > Have you seen the functions here: > http://wiki.apache.org/solr/FunctionQuery#Relevance_Functions > > Best > Erick > > > On Thu, Dec 20, 2012 at 1:18 PM, Joe Zhang

Re: search behavior on a case-sensitive field

2012-12-03 Thread Joe Zhang
the problem. > > You can also change splitOnCaseChange="1" to splitOnCaseChange="0" to > avoid the term splitting issue. > > Be sure to completely reindex in either case. > > -- Jack Krupansky > > -Original Message- From: Joe Zhang > Sent:

search behavior on a case-sensitive field

2012-12-03 Thread Joe Zhang
I have a search like this: When I query "COST", it gives reasonable results (n1); When I query "CoSt", however, it gives me n2 (>n1) results, and I can't locate actual

Re: behavior of solr.KeepWordFilterFactory

2012-12-03 Thread Joe Zhang
s are included. > > > On Mon, Dec 3, 2012 at 3:04 PM, Joe Zhang wrote: > > > In other words, what I wanted to achieve is case-senstive indexing on a > > small set of words. Can anybody help? > > > > On Sun, Dec 2, 2012 at 11:56 PM, Joe Zhang wrote: >

Re: behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
In other words, what I wanted to achieve is case-senstive indexing on a small set of words. Can anybody help? On Sun, Dec 2, 2012 at 11:56 PM, Joe Zhang wrote: > To be more specific, this is the data type I was using: > > positionIncremen

Re: behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
To be more specific, this is the data type I was using: On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang wrote: > yes, that is

Re: behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
ache/solr/analysis/KeepWordFilter.html > , > I am pretty sure it is the correct behavior of this filter :) > > I guess you are trying to this filter to index some special words in > Chinese? > > > On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang wrote: > > > I defined the

Re: duplicated URL sent from Nutch to solr index

2012-12-02 Thread Joe Zhang
Sorry I didn't make it perfectly clear. The "id" field is URL. On Sun, Dec 2, 2012 at 11:33 PM, Joe Zhang wrote: > Thanks! > > > On Sun, Dec 2, 2012 at 11:20 PM, Xi Shen wrote: > >> If the value for "id" field is the same, the old entry will b

Re: duplicated URL sent from Nutch to solr index

2012-12-02 Thread Joe Zhang
Thanks! On Sun, Dec 2, 2012 at 11:20 PM, Xi Shen wrote: > If the value for "id" field is the same, the old entry will be update; if > it is new, a new entry will be created & indexed. > > This is my experience. :) > > > On Mon, Dec 3, 2012 at 1:45 PM

behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
I defined the following data type in my solr schema.xml when I use the type "testkeep" to index a test field, my true expecation was to make sure solr indexes the uppercase form of a small list of words in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing the clo

Re: multiple indexes?

2012-12-02 Thread Joe Zhang
This is very helpful. Thanks a lot, Shaun and Dikchant! So in default single-core situation, the index would live in data/index, correct? On Fri, Nov 30, 2012 at 11:02 PM, Shawn Heisey wrote: > On 11/30/2012 10:11 PM, Joe Zhang wrote: > >> May I ask: how to set up multiple indexes,

Re: stopwords in solr

2012-11-27 Thread Joe Zhang
that is really strange. so basic stopwords such as "a" "the' are not eliminated from the index? On Tue, Nov 27, 2012 at 11:16 PM, 曹霖 wrote: > justt no stopwords are considered in that case > > 2012/11/28 Joe Zhang > > > t no stopwords are considered in > > this case > > >