Yes, i guess that could be an option, but I'm not very experienced with Java 
development and SOLR modifications.
As my main goal was to create a similar sig in C#, I just use the c# method to 
create the sig myself before indexing instead of SOLR Deduplicate function.

That way, when searching I could use the same method with the certain the sig 
is the same. 
As the algorytm used is the same of textProfileSignature the result is the same 
as using SOLR deduplicate. 

Frederico 
 


-----Original Message-----
From: lboutros [mailto:boutr...@gmail.com] 
Sent: sexta-feira, 8 de Abril de 2011 10:11
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

Couldn't you extend the TextProfileSignature and modify the TokenComparator
class to use lexical order when token have the same frequency ?

Ludovic.

2011/4/8 Frederico Azeiteiro [via Lucene] <
ml-node+2794604-1683988626-383...@n3.nabble.com>

> Hi.
>
> Yes, I manage to create a stable comparator in c# for profile.
> The problem is before that on:
>
> ...
> tokens.put(s, tok);
> ...
>
> Imagine you have 2 tokens with the same frequency, on the stable sort
> comparator for profile it will maintain the original order.
> The problem is that the original order comes from the way they are
> inserted in hashmap 'tokens' and not from the order the tokens appear on
> original text.
>
> Frederico
>
> -----Original Message-----
> From: lboutros [mailto:[hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=0&by-user=t>]
>
> Sent: sexta-feira, 8 de Abril de 2011 09:49
> To: [hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=1&by-user=t>
> Subject: Re: Using MLT feature
>
> It seems that tokens are sorted by frequencies :
>
> ...
> Collections.sort(profile, new TokenComparator());
> ...
>
>
> and
>
> private static class TokenComparator implements Comparator<Token> {
>     public int compare(Token t1, Token t2) {
>       return t2.cnt - t1.cnt;
>     }
>
> and cnt is the token count.
>
> Ludovic.
>
> 2011/4/7 Frederico Azeiteiro [via Lucene] <
> [hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=2&by-user=t>>
>
>
> > Well at this point I'm more dedicated to the Deduplicate issue.
> >
> > Using a Min_token_len of 4 I'm getting nice comparison results. MLT
> returns
> > a lot of similar docs that I don't consider similar - even tuning the
> > parameters.
> >
> > Finishing this issue, I found out that the signature also contains the
> > field name meaning that if you wish to signature both title and text
> fields,
> > your signature will be a hash of ("text"+"text value"+"title"+"title
> > value").
> >
> > In any case, I found that the Hashmap used on the hash algorithm
> inserts
> > the tokens by some hashmap internal sort method that I can't
> understand :),
> > and so, impossible to copy to C# implementation.
> >
> > Thank you for all your help,
> > Frederico
> >
> >
>
>
> -----
> Jouve
> France.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794585.h<http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794585.h?by-user=t>
> tml
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794604.html
>  To start a new topic under Solr - User, email
> ml-node+472068-1765922688-383...@n3.nabble.com
> To unsubscribe from Solr - User, click 
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472068&code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=>.
>
>


-----
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794622.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to