Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

2012-06-24 Thread Timothy Potter
Awesome find Jack - thanks! Copied the "replaceWith" bit from
http://lucidworks.lucidimagination.com/display/solr/CharFilterFactories

Cheers,
Tim

On Sat, Jun 23, 2012 at 8:16 PM, Jack Krupansky  wrote:
> The char filter's attribute name is "replacement", not "replaceWith". I
> tried it and it seems to work fine (with Solr 3.6).
>
>
>   pattern="(\w)\1{2,}+"
>  replacement="$1$1"/>
>
> See:
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html
>
> -- Jack Krupansky
>
> -Original Message- From: Timothy Potter
> Sent: Saturday, June 23, 2012 7:11 PM
> To: solr-user@lucene.apache.org
> Subject: Having an issue with the solr.PatternReplaceCharFilterFactory not
> replacing characters correctly
>
>
> Using 3.5 (also tried trunk), I have the following charFilter defined
> on my fieldType (just extended text_general to keep things simple):
>
>            pattern="(\w)\1{2,}+"
>           replaceWith="$1$1"/>
>
> The intent of this charFilter is to match any characters that are
> repeated in a string more than twice and collapse down to a max of
> two, i.e.
>
> foooba  =>  foobarr
>
> Using the analysis form, I end up with: fba
>
> Here is the full  definition (just the one addition of the
> leading ):
>
>    positionIncrementGap="100">
>     
>                 pattern="(\w)\1{2,}+"
>          replaceWith="$1$1"/>
>       
>        words="stopwords.txt" enablePositionIncrements="true" />
>       
>     
>     
>       
>        words="stopwords.txt" enablePositionIncrements="true" />
>        synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>       
>     
>   
>
> It seems like my regex and replacement strategy should work ... to
> prove it, I wrote a little Regex.java class in which I borrowed some
> from the PatternReplaceCharFilter class ... when I execute the
> following with my little hack, I get the expected results:
>
> [~/dev]$ java Regex "(\\w)\\1{2,}+" foooba "\$1\$1"
> result: foobarr
>
> Is this a known issue or does anyone know how to work-around this? If
> not, I'll open a JIRA but wanted to check here first.
>
> Cheers,
> Tim
>
> Regex.java  
>
>
> import java.util.regex.Pattern;
> import java.util.regex.Matcher;
>
> public class Regex {
>   public static void main(String[] args) throws Exception {
>       String toCompile = args[0];
>       Pattern p = Pattern.compile(toCompile);
>       System.out.println("result: "+processPattern(p, args[1], args[2]));
>   }
>
>  // borrowed from PatternReplaceCharFilter.java
>  private static CharSequence processPattern(Pattern pattern,
> CharSequence input, String replacement) {
>   final Matcher m = pattern.matcher(input);
>
>   final StringBuffer cumulativeOutput = new StringBuffer();
>   int cumulative = 0;
>   int lastMatchEnd = 0;
>   while (m.find()) {
>     final int groupSize = m.end() - m.start();
>     final int skippedSize = m.start() - lastMatchEnd;
>     lastMatchEnd = m.end();
>     final int lengthBeforeReplacement = cumulativeOutput.length() +
> skippedSize;
>     m.appendReplacement(cumulativeOutput, replacement);
>     final int replacementSize = cumulativeOutput.length() -
> lengthBeforeReplacement;
>     if (groupSize != replacementSize) {
>       if (replacementSize < groupSize) {
>         cumulative += groupSize - replacementSize;
>         int atIndex = lengthBeforeReplacement + replacementSize;
>         //System.err.println(atIndex + "!" + cumulative);
>         //addOffCorrectMap(atIndex, cumulative);
>       }
>     }
>   }
>   m.appendTail(cumulativeOutput);
>   return cumulativeOutput;
>  }
> }


Re: Store matching synonyms only

2012-06-24 Thread arc68274
Jack, Lee, thanks so much for your suggestions.

On Sat, Jun 23, 2012 at 11:25 PM, Lee Carroll
wrote:

> If you go down the keep-word route you can return the "tags" to the
> front end app using a facet field query. This often fits with many
> use-cases for doc tags.
>
> lee c
>
> On 23 June 2012 22:37, Jack Krupansky  wrote:
> > One important footnote: the "keep words/synonym analyzer" approach will
> > index the desired keywords for efficient search, but the stored value
> that
> > would be returned in response to a query request would be the full
> original
> > text. If you wish to return only the final list of matched synonyms, you
> > will need to go the custom update processor or preprocessor route.
> >
> > -- Jack Krupansky
> >
> > -Original Message- From: Jack Krupansky
> > Sent: Saturday, June 23, 2012 4:29 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Store matching synonyms only
> >
> >
> > There are a number of ways this can be accomplished, including as a
> > preprocessor or a custom update processor, but you may be able to get by
> > with a tokenized field without term vectors combined with a "keep words"
> > filter and an index-time synonym filter that uses "replace mode".
> >
> > So, in addition to storing the text in a normal text field, do a
> copyField
> > to a separate text field which has omitTermFreqAndPositions=true since
> this
> > field only needs to indicate the presence of a keyword and not its
> position
> > or frequency. It would have a custome field type which starts its index
> > analyzer with a "keep words" token filter (solr.KeepWordFilterFactory)
> with
> > a word list file which contains all words used in your synonyms. This
> > eliminates all words that do not match one of your synonym words.
> >
> > Then add a synonym filter that operates in replace mode - expand=true and
> > ignoreCase=true, with entries such as:
> >
> > feline,cat,lion,tiger
> >
> > See:
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> >
> > This would index "The cat sat on the  tiger's mat" as simply "feline"
> >
> > -- Jack Krupansky
> >
> > -Original Message- From: ben ausden
> > Sent: Saturday, June 23, 2012 1:21 PM
> > To: solr-user@lucene.apache.org
> > Subject: Store matching synonyms only
> >
> > Hi,
> >
> > Is it possible to store only the matching synonyms found in a piece of
> > text?
> >
> > A use case might be: automatically "tag" documents at index time based on
> > synonyms.txt, and then retrieve the stored tags at query time.
> >
> > For example, given the text field:
> >
> >  "The cat sat on the mat"
> >
> > and a synonyms.txt file containing:
> >
> > feline,cat,lion,tiger
> >
> > the resulting tag for this document would be "feline". Multiple synonym
> > matches would result in multiple tags.
> >
> > Is this possible with Solr by default, or is the classification/tagging
> > best done outside Solr before I store the document?
> >
> > Thanks.
>


Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

2012-06-24 Thread Jack Krupansky
Yeah, it was kind of unfortunate that the posted example in SOLR-1653 used 
"replaceWith" but the committed code used "replacement". The detailed 
commentary on the issue notes the change, but the change occurred between 
the last posted patch and the commit. The source code and javadoc "rule", 
but we tend to assume that the Jira is more accurate than it necessarily is.


-- Jack Krupansky

-Original Message- 
From: Timothy Potter

Sent: Sunday, June 24, 2012 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Having an issue with the solr.PatternReplaceCharFilterFactory 
not replacing characters correctly


Awesome find Jack - thanks! Copied the "replaceWith" bit from
http://lucidworks.lucidimagination.com/display/solr/CharFilterFactories

Cheers,
Tim

On Sat, Jun 23, 2012 at 8:16 PM, Jack Krupansky  
wrote:

The char filter's attribute name is "replacement", not "replaceWith". I
tried it and it seems to work fine (with Solr 3.6).




See:
http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html

-- Jack Krupansky

-Original Message- From: Timothy Potter
Sent: Saturday, June 23, 2012 7:11 PM
To: solr-user@lucene.apache.org
Subject: Having an issue with the solr.PatternReplaceCharFilterFactory not
replacing characters correctly


Using 3.5 (also tried trunk), I have the following charFilter defined
on my fieldType (just extended text_general to keep things simple):



The intent of this charFilter is to match any characters that are
repeated in a string more than twice and collapse down to a max of
two, i.e.

foooba  =>  foobarr

Using the analysis form, I end up with: fba

Here is the full  definition (just the one addition of the
leading ):

  

  
  
  
  


  
  
  
  

  

It seems like my regex and replacement strategy should work ... to
prove it, I wrote a little Regex.java class in which I borrowed some
from the PatternReplaceCharFilter class ... when I execute the
following with my little hack, I get the expected results:

[~/dev]$ java Regex "(\\w)\\1{2,}+" foooba "\$1\$1"
result: foobarr

Is this a known issue or does anyone know how to work-around this? If
not, I'll open a JIRA but wanted to check here first.

Cheers,
Tim


Regex.java  



import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Regex {
  public static void main(String[] args) throws Exception {
  String toCompile = args[0];
  Pattern p = Pattern.compile(toCompile);
  System.out.println("result: "+processPattern(p, args[1], args[2]));
  }

 // borrowed from PatternReplaceCharFilter.java
 private static CharSequence processPattern(Pattern pattern,
CharSequence input, String replacement) {
  final Matcher m = pattern.matcher(input);

  final StringBuffer cumulativeOutput = new StringBuffer();
  int cumulative = 0;
  int lastMatchEnd = 0;
  while (m.find()) {
final int groupSize = m.end() - m.start();
final int skippedSize = m.start() - lastMatchEnd;
lastMatchEnd = m.end();
final int lengthBeforeReplacement = cumulativeOutput.length() +
skippedSize;
m.appendReplacement(cumulativeOutput, replacement);
final int replacementSize = cumulativeOutput.length() -
lengthBeforeReplacement;
if (groupSize != replacementSize) {
  if (replacementSize < groupSize) {
cumulative += groupSize - replacementSize;
int atIndex = lengthBeforeReplacement + replacementSize;
//System.err.println(atIndex + "!" + cumulative);
//addOffCorrectMap(atIndex, cumulative);
  }
}
  }
  m.appendTail(cumulativeOutput);
  return cumulativeOutput;
 }
} 




Re: Solr 4.0 with Near Real Time and Faceted Search in Replicated topology

2012-06-24 Thread Mark Miller
Solrcloud won't change anything for you. 

Performance will depend as always. If the std faceting methods are too slow 
(really fast reopening can limit the use of caches), there is a faceting method 
that works per segment I'm told.  That may help a lot in some cases. There are 
trade offs, so you want to test and experiment.  

-- 
Mark Miller



On Thursday, June 21, 2012 at 12:42 PM, Niran Fajemisin wrote:

> Hi all,
> 
> We're thinking of moving forward with Solr 4.0 and we plan to have a master 
> index server and at least two slaves servers. The Master server will be used 
> primarily for indexing and the queries will be load balanced across to the 
> replicated slave servers. I would like to know if, with the current support 
> for Near Real Time search in 4.0, there's support for Faceted Search. Keeping 
> in mind that the searches will be performed against the Slave servers and not 
> the Master (indexing) server.
> 
> If it's not supported, will we need to use SolrCloud to gain the benefits of 
> Near Real Time search when performing Faceted Searches?
> 
> Any insight would be greatly appreciated.
> 
> Thanks all!  



Re: Custom close to index metadata / pass commit data to writer.commit

2012-06-24 Thread Jozef Vilcek
On Sun, Jun 24, 2012 at 1:18 AM, Erick Erickson  wrote:
> see: https://issues.apache.org/jira/browse/SOLR-2701.
>

Hey, that is what I want :) Thanks for the reference. I am unlucky
that there seems to be no progress on this ( as far as I can tell ).
I would be able to use commitData in rather non-invasive way in 3.6
release, but I fear of future and other releases ...


> But there's an easier alternative. Just have a _very special_ document
> with a known that you index at the end of the run that
> 1> has no fields in common with any other document (except uniqueKey)
> 2> contains whatever data you want to carry around in whatever format you 
> want.
>
> Now whenever you query for that document by ID, you get your info. And
> since you can't search the doc until after it's been committed, you know
> that the preceding documents have all been persisted
>
> Of course whenever you send a version of the doc it will overwrite the
> one before since it has the same 
>

Yes, we thought about having this data stored like a "special"
document type, but conceptually it just does not feel right. Also, I
fear of extra modifications and maintaining some query defaults to
never return this document for any kind of search query ...


> Best
> Erick
>
> On Fri, Jun 22, 2012 at 5:34 AM, Jozef Vilcek  wrote:
>> Hi everyone,
>>
>> I am seeking to solution to store some custom data very close to /
>> within index. I have found a possibility to pass commit "user" data to
>> IndexWriter:
>> http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter.html#commit(java.util.Map)
>> which are from what I understand stored somewhere close to segments
>> "metadata" like index version, generation, ...
>>
>> Now, I see no easy way to accumulate and pass along such data with
>> Solr 3.6. DirectUpdateHandler2 is committing implicitly via close
>> rather than invoking commit API. I can extend DirectUpdateHander2 and
>> alter closeWriter method but still ... I am not yet clear how to pass
>> along request level params which are not available at
>> DirectUpdateHandler2 level. It seems that passing commitData is not
>> supported ( maybe not wanted to by by design ) and not going to be as
>> when I look at Solr trunk, I see implicit commit removed,
>> writer.commit with passing commitData used but no easy way how to pass
>> custom commit data nor how to easily hook in.
>>
>> Any recommendations for how to store some data close to index?
>>
>> To throw some light why I what this ... Basically I want to store
>> there some kind of time stamp, which defines what is already in the
>> index with respect to feeding updates from external world. Now, my
>> index is replicated to other index instance in different data center
>> (serving traffic as well). When default document feed in DC1 go south
>> for some reason, backup in DC2 bumps in to keep updates alive ... but
>> it has to know from where the feed should start ... that would be that
>> kind of time stamp stored and replicated with index.
>>
>> Many thanks in advance.
>>
>> Best,
>> Jozef


Re: Custom close to index metadata / pass commit data to writer.commit

2012-06-24 Thread Erick Erickson
Yeah, it's a bit kludgy I admit. But it's usable right now, pragmatism
rules sometimes...

But the never returning this doc is actually relatively easy, just put
your data in a
field that no other document has. There's no requirement that any document in
Solr have any field in common with any other document, except when
required="true".

I guess you might see this doc in the case of *:* queries, but I _believe_ that
you'd be pretty safe here because this document will be at the end of
your insertions.
There is some possibility of order-shuffling when merging, but in the worst case
this doc would be at the end of its segment. So a simple test for the
"very special
" at the app level should insure at least nobody sees it.

You could also do something like encrypt it just in case and live with
the resulting
garbage display in the unlikely case someone actually saw it.

But personally I think that's all overkill, and if it's some simple
statistical data that the end
use might be puzzled at but wouldn't compromise your app, it might be worth it.

FWIW
Erick

On Sun, Jun 24, 2012 at 3:44 PM, Jozef Vilcek  wrote:
> On Sun, Jun 24, 2012 at 1:18 AM, Erick Erickson  
> wrote:
>> see: https://issues.apache.org/jira/browse/SOLR-2701.
>>
>
> Hey, that is what I want :) Thanks for the reference. I am unlucky
> that there seems to be no progress on this ( as far as I can tell ).
> I would be able to use commitData in rather non-invasive way in 3.6
> release, but I fear of future and other releases ...
>
>
>> But there's an easier alternative. Just have a _very special_ document
>> with a known that you index at the end of the run that
>> 1> has no fields in common with any other document (except uniqueKey)
>> 2> contains whatever data you want to carry around in whatever format you 
>> want.
>>
>> Now whenever you query for that document by ID, you get your info. And
>> since you can't search the doc until after it's been committed, you know
>> that the preceding documents have all been persisted
>>
>> Of course whenever you send a version of the doc it will overwrite the
>> one before since it has the same 
>>
>
> Yes, we thought about having this data stored like a "special"
> document type, but conceptually it just does not feel right. Also, I
> fear of extra modifications and maintaining some query defaults to
> never return this document for any kind of search query ...
>
>
>> Best
>> Erick
>>
>> On Fri, Jun 22, 2012 at 5:34 AM, Jozef Vilcek  wrote:
>>> Hi everyone,
>>>
>>> I am seeking to solution to store some custom data very close to /
>>> within index. I have found a possibility to pass commit "user" data to
>>> IndexWriter:
>>> http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter.html#commit(java.util.Map)
>>> which are from what I understand stored somewhere close to segments
>>> "metadata" like index version, generation, ...
>>>
>>> Now, I see no easy way to accumulate and pass along such data with
>>> Solr 3.6. DirectUpdateHandler2 is committing implicitly via close
>>> rather than invoking commit API. I can extend DirectUpdateHander2 and
>>> alter closeWriter method but still ... I am not yet clear how to pass
>>> along request level params which are not available at
>>> DirectUpdateHandler2 level. It seems that passing commitData is not
>>> supported ( maybe not wanted to by by design ) and not going to be as
>>> when I look at Solr trunk, I see implicit commit removed,
>>> writer.commit with passing commitData used but no easy way how to pass
>>> custom commit data nor how to easily hook in.
>>>
>>> Any recommendations for how to store some data close to index?
>>>
>>> To throw some light why I what this ... Basically I want to store
>>> there some kind of time stamp, which defines what is already in the
>>> index with respect to feeding updates from external world. Now, my
>>> index is replicated to other index instance in different data center
>>> (serving traffic as well). When default document feed in DC1 go south
>>> for some reason, backup in DC2 bumps in to keep updates alive ... but
>>> it has to know from where the feed should start ... that would be that
>>> kind of time stamp stored and replicated with index.
>>>
>>> Many thanks in advance.
>>>
>>> Best,
>>> Jozef


Re: How can I optimize Sorting on multiple text fields

2012-06-24 Thread Alok Bhandari
Thanks for the inputs.

Eric, Yes I was referring to the String data-type. The reason I was asking
this is that for a single customer we have multiple users and each user may
apply different search criteria before sorting on the field so if we can
cache the sorted results then it may improve the user experience with
performance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-optimize-Sorting-on-multiple-text-fields-tp3990874p3991129.html
Sent from the Solr - User mailing list archive at Nabble.com.