I've been considering the use of Hadoop, since that's what Nutch uses.
Unless I piggy-back onto Nutch's MR job, when creating a Solr index, I'm
wondering if it's overkill.  I can see ways of working it into a MapReduce
workflow, but it would involve dumping the database onto HDFS beforehand.
I'm still debating that one, with myself.

One other thing that I want to take advantage of is Lucene/Solr's filter
factories (?).  I'm not sure if I have the terminology right, but there are
a lot of advanced text-parsing features.  IE, a search for "reality" would
also turn up "reale."  It seems that I would want to perform my "find words,
filter out any white-listed context, and re-inject" after Nutch stuffs Solr
with all of its crawl data.

So, perhaps I can get help starting at #1 of your suggestion:

How would I best extract a phrase from Solr?  IE, can I tell Solr "give me
each occurence of X in document Y" or (and I'm guessing this is it) where
would I look to perform that kind of a search, myself?

Thinking about it, I imagine that Solr might tend to "flatten" words in its
index.  Ie, the string "reality" only really occurs once in a given page's
index, and (maybe?) it'll have some boost reflecting the number of times it
appeared.  Please excuse my obscene generalizations :(.

I'm going to do some more digging through the Solr.  I appreciate your
help.  I am a bit of a beggar when it comes to seeking out help on where to
start.  But, as I mentioned on the Nutch list, I will contribute all of my
changes back to Solr.  I'll also look to improve documentation, which I
still owe Nutch,  but that's queueing up for when there's a lull.

Thank you, - Scott

On Fri, Sep 3, 2010 at 1:19 AM, Jan Høydahl / Cominvent <
jan....@cominvent.com> wrote:

> Hi,
>
> This smells like a job for Hadoop and perhaps Mahout, unless your use cases
> are totally ad-hoc research.
> After Nutch has fetched the sites, kick off some MapReduce jobs for each
> case you wish to study:
> 1. Extract phrases/contexts
> 2. For each context, perform detection and whitelisting
> 3. In the reduce step, sum it all up, and write the results to some store
> 4. Now you may index a "report" per site into Solr, with links to the
> original pages for each context
>
> You may be able to represent your grammar as textual rules instead of code.
> Your latency may be minutes instead of milliseconds though...
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 3. sep. 2010, at 01.03, Scott Gonyea wrote:
>
> > Hi Grant,
> >
> > Thanks for replying--sorry for sticking this on dev; I had imagined that
> > development against the Solr codebase would be inevitable.
> >
> > The application has to do with regulatory and legal compliance work by a
> > non-profit, and is "socially good," but I need to 'abstract' the
> > problem/goals--as it's not mine to disclose.
> >
> > Crawl several websites, ie: slashdot, engadget, etc., inject them into
> Solr,
> > and search for a given word.
> >
> > Issue 1: How many times did that word appear, on the URL returned by
> Solr?
> >
> > Suppose that word is "Linux" and you want to make sure that each
> occurence
> > of "Linux" also acknowledges that "Linux" is "GNU/Linux" (pedanticism
> gone
> > wild).  Now, suppose that "GNU Linux" is ok.  And even "GNU Projects such
> as
> > Linux" is OK too.  So, now:
> >
> > Issue 2: Suppose that your goal is now to separate the noise from the
> > signal.  You therefore "white list" occurrences in which "Linux" appears
> > without a "GNU/" prefix, yet which you've deemed acceptable within the
> given
> > context.  "GNU/Linux" would be a starting point for any of your
> > white-listing tasks.
> >
> > Simply iterating over what is--and is not--a "white list" just doesn't
> scale
> > on a lot of levels.  So my approach is to maintain a separate datastore,
> > which contains a list of phrases that are worthy of whomever's attention,
> as
> > well as a whole lot of "phrase-contexts"... Or the context in which the
> > phrase appeared.
> >
> > Suppose that one website lists "Linux" 20 times; the goal is to
> white-list
> > all 20 of those occurrences.  Or perhaps "Linux" appears 20 times, within
> > the same context, then you might only need 1 "white list" to knock out
> all
> > 20.  Further, the white-listing can generally be applied to other sites
> in
> > which they appear.
> >
> > I'd love to get some thoughts on how to tackle this problem, but I think
> > that kicking off separate documents, within Solr, for each specific
> > occurrence... would be the simplest path.  But again, I'd love for some
> > thoughts on how else I might do this, or where I should start my coding
> :)
> >
> > Thank you very much,
> > Scott Gonyea
> >
> > On Thu, Sep 2, 2010 at 2:12 PM, Grant Ingersoll <gsing...@apache.org>
> wrote:
> >
> >> Dropping d...@lucene.a.o.
> >>
> >> How about we step back and please explain the problem you are trying to
> >> solve, as opposed to the proposed solution to the problem below.  You
> can
> >> likely do what you want below in Solr/Lucene (modulo replacing the index
> >> with a new document), but the bigger question is "is that the best way
> to do
> >> it?"  I think if you give us that context, then perhaps we can
> brainstorm on
> >> solutions.
> >>
> >> Thanks,
> >> Grant
> >>
> >>
> >> On Sep 1, 2010, at 8:29 PM, Scott Gonyea wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm looking to get some direction on where I should focus my attention,
> >> with regards to the Solr codebase and documentation.  Rather than write
> a
> >> ton of stuff no one wants to read, I'll just start with a use-case.  For
> >> context, the data originates from Nutch crawls and is indexed into Solr.
> >>>
> >>> Imagine a web page has the following content (4 occurences of "Johnson"
> >> are bolded):
> >>>
> >>> --content_--
> >>> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
> >> id urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla
> >> magna, nec sodales est. Sed johnSon sed elit non lorem sagittis
> fermentum.
> >> Mauris a arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi
> eget
> >> ligula nisi. Ut fringilla ullamcorper sem.
> >>> --_content--
> >>>
> >>> First; I would like to have the entire "content" block be indexed
> within
> >> Solr.  This is done and definitely not an issue.
> >>>
> >>> Second (+); during the injection of crawl data into Solr, I would like
> to
> >> grab every occurence of a specific word, or phrase, with "Johnson" being
> my
> >> example for the above.  I want to take every such phrase (without
> >> collision), as well as its unique-context, and inject that into its own,
> >> separate Solr index.  For example, the above "content" example, having
> been
> >> indexed in its entirety, would also be the source of 4 additional
> indexes.
> >> In each index, "Johnson" would only appear once.  All of the text before
> >> and after "Johnson" would be BOUND BY any other occurrence of "Johnson."
> >> eg:
> >>>
> >>> --index1_--
> >>> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
> >> id urna et justo fringilla dictum
> >>> --_index1-- --index2_--
> >>> sit amet, consectetur adipiscing elit. Aenean id urna et justo
> fringilla
> >> dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed
> >>> --_index2-- --index3_--
> >>> in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed
> elit
> >> non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel
> >> malesuada
> >>> --_index3-- --index4_--
> >>> sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis
> >> rhoncus vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla
> >> ullamcorper sem.
> >>> --_index4--
> >>>
> >>> Q:
> >>> How much of this is feasible in "present-day Solr" and how much of it
> do
> >> I need to produce in a patch of my own?  Can anyone give me some
> direction
> >> on where I should look, in approaching this problem (ie, libs / classes
> /
> >> confs)?  I sincerely appreciate it.
> >>>
> >>> Third; I would later like to go through the above, child indexes and
> >> dismiss any that appear within a given context.  For example, I may deem
> >> "ipsum dolor Johnson sit amet" as not being useful and I'd want to
> delete
> >> any indexes matching that particular phrase-context.  The deletion is
> >> trivial and, with the 2nd item resolved--this becomes a fairly
> non-issue.
> >>>
> >>> Q:
> >>> The question, more or less, comes from the fact that my source data is
> >> from a web crawler.  When recrawled, I need to repeat the process of
> >> dismissing phrase-contexts that are not relevant to me.  Where is the
> best
> >> place to perform this work?  I could easily perform queries, after
> indexing
> >> my crawl, but that seems needlessly intensive.  I think the answer to
> that
> >> will be "wherever I implement #2", but assumptions can be painfully
> >> expensive.
> >>>
> >>>
> >>> Thank you for reading my bloated e-mail.  Again, I'm mostly just
> looking
> >> to be pointed to various pieces of the Lucene / Solr code-base, and am
> >> trolling for any insight that people might share.
> >>>
> >>> Scott Gonyea
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
> 7-8
> >>
> >>
>
>

Reply via email to