I've been considering the use of Hadoop, since that's what Nutch uses. Unless I piggy-back onto Nutch's MR job, when creating a Solr index, I'm wondering if it's overkill. I can see ways of working it into a MapReduce workflow, but it would involve dumping the database onto HDFS beforehand. I'm still debating that one, with myself.
One other thing that I want to take advantage of is Lucene/Solr's filter factories (?). I'm not sure if I have the terminology right, but there are a lot of advanced text-parsing features. IE, a search for "reality" would also turn up "reale." It seems that I would want to perform my "find words, filter out any white-listed context, and re-inject" after Nutch stuffs Solr with all of its crawl data. So, perhaps I can get help starting at #1 of your suggestion: How would I best extract a phrase from Solr? IE, can I tell Solr "give me each occurence of X in document Y" or (and I'm guessing this is it) where would I look to perform that kind of a search, myself? Thinking about it, I imagine that Solr might tend to "flatten" words in its index. Ie, the string "reality" only really occurs once in a given page's index, and (maybe?) it'll have some boost reflecting the number of times it appeared. Please excuse my obscene generalizations :(. I'm going to do some more digging through the Solr. I appreciate your help. I am a bit of a beggar when it comes to seeking out help on where to start. But, as I mentioned on the Nutch list, I will contribute all of my changes back to Solr. I'll also look to improve documentation, which I still owe Nutch, but that's queueing up for when there's a lull. Thank you, - Scott On Fri, Sep 3, 2010 at 1:19 AM, Jan Høydahl / Cominvent < jan....@cominvent.com> wrote: > Hi, > > This smells like a job for Hadoop and perhaps Mahout, unless your use cases > are totally ad-hoc research. > After Nutch has fetched the sites, kick off some MapReduce jobs for each > case you wish to study: > 1. Extract phrases/contexts > 2. For each context, perform detection and whitelisting > 3. In the reduce step, sum it all up, and write the results to some store > 4. Now you may index a "report" per site into Solr, with links to the > original pages for each context > > You may be able to represent your grammar as textual rules instead of code. > Your latency may be minutes instead of milliseconds though... > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Training in Europe - www.solrtraining.com > > On 3. sep. 2010, at 01.03, Scott Gonyea wrote: > > > Hi Grant, > > > > Thanks for replying--sorry for sticking this on dev; I had imagined that > > development against the Solr codebase would be inevitable. > > > > The application has to do with regulatory and legal compliance work by a > > non-profit, and is "socially good," but I need to 'abstract' the > > problem/goals--as it's not mine to disclose. > > > > Crawl several websites, ie: slashdot, engadget, etc., inject them into > Solr, > > and search for a given word. > > > > Issue 1: How many times did that word appear, on the URL returned by > Solr? > > > > Suppose that word is "Linux" and you want to make sure that each > occurence > > of "Linux" also acknowledges that "Linux" is "GNU/Linux" (pedanticism > gone > > wild). Now, suppose that "GNU Linux" is ok. And even "GNU Projects such > as > > Linux" is OK too. So, now: > > > > Issue 2: Suppose that your goal is now to separate the noise from the > > signal. You therefore "white list" occurrences in which "Linux" appears > > without a "GNU/" prefix, yet which you've deemed acceptable within the > given > > context. "GNU/Linux" would be a starting point for any of your > > white-listing tasks. > > > > Simply iterating over what is--and is not--a "white list" just doesn't > scale > > on a lot of levels. So my approach is to maintain a separate datastore, > > which contains a list of phrases that are worthy of whomever's attention, > as > > well as a whole lot of "phrase-contexts"... Or the context in which the > > phrase appeared. > > > > Suppose that one website lists "Linux" 20 times; the goal is to > white-list > > all 20 of those occurrences. Or perhaps "Linux" appears 20 times, within > > the same context, then you might only need 1 "white list" to knock out > all > > 20. Further, the white-listing can generally be applied to other sites > in > > which they appear. > > > > I'd love to get some thoughts on how to tackle this problem, but I think > > that kicking off separate documents, within Solr, for each specific > > occurrence... would be the simplest path. But again, I'd love for some > > thoughts on how else I might do this, or where I should start my coding > :) > > > > Thank you very much, > > Scott Gonyea > > > > On Thu, Sep 2, 2010 at 2:12 PM, Grant Ingersoll <gsing...@apache.org> > wrote: > > > >> Dropping d...@lucene.a.o. > >> > >> How about we step back and please explain the problem you are trying to > >> solve, as opposed to the proposed solution to the problem below. You > can > >> likely do what you want below in Solr/Lucene (modulo replacing the index > >> with a new document), but the bigger question is "is that the best way > to do > >> it?" I think if you give us that context, then perhaps we can > brainstorm on > >> solutions. > >> > >> Thanks, > >> Grant > >> > >> > >> On Sep 1, 2010, at 8:29 PM, Scott Gonyea wrote: > >> > >>> Hi, > >>> > >>> I'm looking to get some direction on where I should focus my attention, > >> with regards to the Solr codebase and documentation. Rather than write > a > >> ton of stuff no one wants to read, I'll just start with a use-case. For > >> context, the data originates from Nutch crawls and is indexed into Solr. > >>> > >>> Imagine a web page has the following content (4 occurences of "Johnson" > >> are bolded): > >>> > >>> --content_-- > >>> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean > >> id urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla > >> magna, nec sodales est. Sed johnSon sed elit non lorem sagittis > fermentum. > >> Mauris a arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi > eget > >> ligula nisi. Ut fringilla ullamcorper sem. > >>> --_content-- > >>> > >>> First; I would like to have the entire "content" block be indexed > within > >> Solr. This is done and definitely not an issue. > >>> > >>> Second (+); during the injection of crawl data into Solr, I would like > to > >> grab every occurence of a specific word, or phrase, with "Johnson" being > my > >> example for the above. I want to take every such phrase (without > >> collision), as well as its unique-context, and inject that into its own, > >> separate Solr index. For example, the above "content" example, having > been > >> indexed in its entirety, would also be the source of 4 additional > indexes. > >> In each index, "Johnson" would only appear once. All of the text before > >> and after "Johnson" would be BOUND BY any other occurrence of "Johnson." > >> eg: > >>> > >>> --index1_-- > >>> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean > >> id urna et justo fringilla dictum > >>> --_index1-- --index2_-- > >>> sit amet, consectetur adipiscing elit. Aenean id urna et justo > fringilla > >> dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed > >>> --_index2-- --index3_-- > >>> in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed > elit > >> non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel > >> malesuada > >>> --_index3-- --index4_-- > >>> sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis > >> rhoncus vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla > >> ullamcorper sem. > >>> --_index4-- > >>> > >>> Q: > >>> How much of this is feasible in "present-day Solr" and how much of it > do > >> I need to produce in a patch of my own? Can anyone give me some > direction > >> on where I should look, in approaching this problem (ie, libs / classes > / > >> confs)? I sincerely appreciate it. > >>> > >>> Third; I would later like to go through the above, child indexes and > >> dismiss any that appear within a given context. For example, I may deem > >> "ipsum dolor Johnson sit amet" as not being useful and I'd want to > delete > >> any indexes matching that particular phrase-context. The deletion is > >> trivial and, with the 2nd item resolved--this becomes a fairly > non-issue. > >>> > >>> Q: > >>> The question, more or less, comes from the fact that my source data is > >> from a web crawler. When recrawled, I need to repeat the process of > >> dismissing phrase-contexts that are not relevant to me. Where is the > best > >> place to perform this work? I could easily perform queries, after > indexing > >> my crawl, but that seems needlessly intensive. I think the answer to > that > >> will be "wherever I implement #2", but assumptions can be painfully > >> expensive. > >>> > >>> > >>> Thank you for reading my bloated e-mail. Again, I'm mostly just > looking > >> to be pointed to various pieces of the Lucene / Solr code-base, and am > >> trolling for any insight that people might share. > >>> > >>> Scott Gonyea > >> > >> -------------------------- > >> Grant Ingersoll > >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct > 7-8 > >> > >> > >