Hi, Yes, the stemming and other features of Solr is nice. A search result from Solr gives you each occurence of X in Y through highlighting - the regex highlighter is programmable to extract e.g. a sentence as context. You can also get number of occurrences (term frequency TF) from the termvectors. TF also plays a role in scoring as you point out. It just sounds a bit overkill to me for your usecse.
I don't have enough hands-on experience with Hadoop yet to guide you tell you how to do it in M/R. I would suppose there exists good light-weight text extraction frameworks out there which could do much of what you need. Did you know you can also embed Solr through EmbeddedSolr to include it in a workflow (See SOLR-1301). Also, I just found http://sna-projects.com/azkaban/ which looks promising to control advanced Hadoop workflows. Just some pointers.. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 3. sep. 2010, at 19.53, Scott Gonyea wrote: > I've been considering the use of Hadoop, since that's what Nutch uses. > Unless I piggy-back onto Nutch's MR job, when creating a Solr index, I'm > wondering if it's overkill. I can see ways of working it into a MapReduce > workflow, but it would involve dumping the database onto HDFS beforehand. > I'm still debating that one, with myself. > > One other thing that I want to take advantage of is Lucene/Solr's filter > factories (?). I'm not sure if I have the terminology right, but there are > a lot of advanced text-parsing features. IE, a search for "reality" would > also turn up "reale." It seems that I would want to perform my "find words, > filter out any white-listed context, and re-inject" after Nutch stuffs Solr > with all of its crawl data. > > So, perhaps I can get help starting at #1 of your suggestion: > > How would I best extract a phrase from Solr? IE, can I tell Solr "give me > each occurence of X in document Y" or (and I'm guessing this is it) where > would I look to perform that kind of a search, myself? > > Thinking about it, I imagine that Solr might tend to "flatten" words in its > index. Ie, the string "reality" only really occurs once in a given page's > index, and (maybe?) it'll have some boost reflecting the number of times it > appeared. Please excuse my obscene generalizations :(. > > I'm going to do some more digging through the Solr. I appreciate your > help. I am a bit of a beggar when it comes to seeking out help on where to > start. But, as I mentioned on the Nutch list, I will contribute all of my > changes back to Solr. I'll also look to improve documentation, which I > still owe Nutch, but that's queueing up for when there's a lull. > > Thank you, - Scott > > On Fri, Sep 3, 2010 at 1:19 AM, Jan Høydahl / Cominvent < > jan....@cominvent.com> wrote: > >> Hi, >> >> This smells like a job for Hadoop and perhaps Mahout, unless your use cases >> are totally ad-hoc research. >> After Nutch has fetched the sites, kick off some MapReduce jobs for each >> case you wish to study: >> 1. Extract phrases/contexts >> 2. For each context, perform detection and whitelisting >> 3. In the reduce step, sum it all up, and write the results to some store >> 4. Now you may index a "report" per site into Solr, with links to the >> original pages for each context >> >> You may be able to represent your grammar as textual rules instead of code. >> Your latency may be minutes instead of milliseconds though... >> >> -- >> Jan Høydahl, search solution architect >> Cominvent AS - www.cominvent.com >> Training in Europe - www.solrtraining.com >> >> On 3. sep. 2010, at 01.03, Scott Gonyea wrote: >> >>> Hi Grant, >>> >>> Thanks for replying--sorry for sticking this on dev; I had imagined that >>> development against the Solr codebase would be inevitable. >>> >>> The application has to do with regulatory and legal compliance work by a >>> non-profit, and is "socially good," but I need to 'abstract' the >>> problem/goals--as it's not mine to disclose. >>> >>> Crawl several websites, ie: slashdot, engadget, etc., inject them into >> Solr, >>> and search for a given word. >>> >>> Issue 1: How many times did that word appear, on the URL returned by >> Solr? >>> >>> Suppose that word is "Linux" and you want to make sure that each >> occurence >>> of "Linux" also acknowledges that "Linux" is "GNU/Linux" (pedanticism >> gone >>> wild). Now, suppose that "GNU Linux" is ok. And even "GNU Projects such >> as >>> Linux" is OK too. So, now: >>> >>> Issue 2: Suppose that your goal is now to separate the noise from the >>> signal. You therefore "white list" occurrences in which "Linux" appears >>> without a "GNU/" prefix, yet which you've deemed acceptable within the >> given >>> context. "GNU/Linux" would be a starting point for any of your >>> white-listing tasks. >>> >>> Simply iterating over what is--and is not--a "white list" just doesn't >> scale >>> on a lot of levels. So my approach is to maintain a separate datastore, >>> which contains a list of phrases that are worthy of whomever's attention, >> as >>> well as a whole lot of "phrase-contexts"... Or the context in which the >>> phrase appeared. >>> >>> Suppose that one website lists "Linux" 20 times; the goal is to >> white-list >>> all 20 of those occurrences. Or perhaps "Linux" appears 20 times, within >>> the same context, then you might only need 1 "white list" to knock out >> all >>> 20. Further, the white-listing can generally be applied to other sites >> in >>> which they appear. >>> >>> I'd love to get some thoughts on how to tackle this problem, but I think >>> that kicking off separate documents, within Solr, for each specific >>> occurrence... would be the simplest path. But again, I'd love for some >>> thoughts on how else I might do this, or where I should start my coding >> :) >>> >>> Thank you very much, >>> Scott Gonyea >>> >>> On Thu, Sep 2, 2010 at 2:12 PM, Grant Ingersoll <gsing...@apache.org> >> wrote: >>> >>>> Dropping d...@lucene.a.o. >>>> >>>> How about we step back and please explain the problem you are trying to >>>> solve, as opposed to the proposed solution to the problem below. You >> can >>>> likely do what you want below in Solr/Lucene (modulo replacing the index >>>> with a new document), but the bigger question is "is that the best way >> to do >>>> it?" I think if you give us that context, then perhaps we can >> brainstorm on >>>> solutions. >>>> >>>> Thanks, >>>> Grant >>>> >>>> >>>> On Sep 1, 2010, at 8:29 PM, Scott Gonyea wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm looking to get some direction on where I should focus my attention, >>>> with regards to the Solr codebase and documentation. Rather than write >> a >>>> ton of stuff no one wants to read, I'll just start with a use-case. For >>>> context, the data originates from Nutch crawls and is indexed into Solr. >>>>> >>>>> Imagine a web page has the following content (4 occurences of "Johnson" >>>> are bolded): >>>>> >>>>> --content_-- >>>>> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean >>>> id urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla >>>> magna, nec sodales est. Sed johnSon sed elit non lorem sagittis >> fermentum. >>>> Mauris a arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi >> eget >>>> ligula nisi. Ut fringilla ullamcorper sem. >>>>> --_content-- >>>>> >>>>> First; I would like to have the entire "content" block be indexed >> within >>>> Solr. This is done and definitely not an issue. >>>>> >>>>> Second (+); during the injection of crawl data into Solr, I would like >> to >>>> grab every occurence of a specific word, or phrase, with "Johnson" being >> my >>>> example for the above. I want to take every such phrase (without >>>> collision), as well as its unique-context, and inject that into its own, >>>> separate Solr index. For example, the above "content" example, having >> been >>>> indexed in its entirety, would also be the source of 4 additional >> indexes. >>>> In each index, "Johnson" would only appear once. All of the text before >>>> and after "Johnson" would be BOUND BY any other occurrence of "Johnson." >>>> eg: >>>>> >>>>> --index1_-- >>>>> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean >>>> id urna et justo fringilla dictum >>>>> --_index1-- --index2_-- >>>>> sit amet, consectetur adipiscing elit. Aenean id urna et justo >> fringilla >>>> dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed >>>>> --_index2-- --index3_-- >>>>> in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed >> elit >>>> non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel >>>> malesuada >>>>> --_index3-- --index4_-- >>>>> sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis >>>> rhoncus vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla >>>> ullamcorper sem. >>>>> --_index4-- >>>>> >>>>> Q: >>>>> How much of this is feasible in "present-day Solr" and how much of it >> do >>>> I need to produce in a patch of my own? Can anyone give me some >> direction >>>> on where I should look, in approaching this problem (ie, libs / classes >> / >>>> confs)? I sincerely appreciate it. >>>>> >>>>> Third; I would later like to go through the above, child indexes and >>>> dismiss any that appear within a given context. For example, I may deem >>>> "ipsum dolor Johnson sit amet" as not being useful and I'd want to >> delete >>>> any indexes matching that particular phrase-context. The deletion is >>>> trivial and, with the 2nd item resolved--this becomes a fairly >> non-issue. >>>>> >>>>> Q: >>>>> The question, more or less, comes from the fact that my source data is >>>> from a web crawler. When recrawled, I need to repeat the process of >>>> dismissing phrase-contexts that are not relevant to me. Where is the >> best >>>> place to perform this work? I could easily perform queries, after >> indexing >>>> my crawl, but that seems needlessly intensive. I think the answer to >> that >>>> will be "wherever I implement #2", but assumptions can be painfully >>>> expensive. >>>>> >>>>> >>>>> Thank you for reading my bloated e-mail. Again, I'm mostly just >> looking >>>> to be pointed to various pieces of the Lucene / Solr code-base, and am >>>> trolling for any insight that people might share. >>>>> >>>>> Scott Gonyea >>>> >>>> -------------------------- >>>> Grant Ingersoll >>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct >> 7-8 >>>> >>>> >> >>