Dropping d...@lucene.a.o. How about we step back and please explain the problem you are trying to solve, as opposed to the proposed solution to the problem below. You can likely do what you want below in Solr/Lucene (modulo replacing the index with a new document), but the bigger question is "is that the best way to do it?" I think if you give us that context, then perhaps we can brainstorm on solutions.
Thanks, Grant On Sep 1, 2010, at 8:29 PM, Scott Gonyea wrote: > Hi, > > I'm looking to get some direction on where I should focus my attention, with > regards to the Solr codebase and documentation. Rather than write a ton of > stuff no one wants to read, I'll just start with a use-case. For context, > the data originates from Nutch crawls and is indexed into Solr. > > Imagine a web page has the following content (4 occurences of "Johnson" are > bolded): > > --content_-- > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean id > urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla magna, > nec sodales est. Sed johnSon sed elit non lorem sagittis fermentum. Mauris a > arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi eget ligula > nisi. Ut fringilla ullamcorper sem. > --_content-- > > First; I would like to have the entire "content" block be indexed within > Solr. This is done and definitely not an issue. > > Second (+); during the injection of crawl data into Solr, I would like to > grab every occurence of a specific word, or phrase, with "Johnson" being my > example for the above. I want to take every such phrase (without collision), > as well as its unique-context, and inject that into its own, separate Solr > index. For example, the above "content" example, having been indexed in its > entirety, would also be the source of 4 additional indexes. In each index, > "Johnson" would only appear once. All of the text before and after "Johnson" > would be BOUND BY any other occurrence of "Johnson." eg: > > --index1_-- > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean id > urna et justo fringilla dictum > --_index1-- --index2_-- > sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla > dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed > --_index2-- --index3_-- > in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed elit non > lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada > --_index3-- --index4_-- > sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus > vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla ullamcorper > sem. > --_index4-- > > Q: > How much of this is feasible in "present-day Solr" and how much of it do I > need to produce in a patch of my own? Can anyone give me some direction on > where I should look, in approaching this problem (ie, libs / classes / > confs)? I sincerely appreciate it. > > Third; I would later like to go through the above, child indexes and dismiss > any that appear within a given context. For example, I may deem "ipsum dolor > Johnson sit amet" as not being useful and I'd want to delete any indexes > matching that particular phrase-context. The deletion is trivial and, with > the 2nd item resolved--this becomes a fairly non-issue. > > Q: > The question, more or less, comes from the fact that my source data is from a > web crawler. When recrawled, I need to repeat the process of dismissing > phrase-contexts that are not relevant to me. Where is the best place to > perform this work? I could easily perform queries, after indexing my crawl, > but that seems needlessly intensive. I think the answer to that will be > "wherever I implement #2", but assumptions can be painfully expensive. > > > Thank you for reading my bloated e-mail. Again, I'm mostly just looking to > be pointed to various pieces of the Lucene / Solr code-base, and am trolling > for any insight that people might share. > > Scott Gonyea -------------------------- Grant Ingersoll http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8