Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-03-11 Thread epnRui
Hi Iorixxx!

I have not optimized the index but the day after this post I saw I didn't
have this problem anymore.

I will follow your advice next time!

Now I'm avoiding so much manipulation at indexation time and I'm doing more
work in the java code in the client side.

If I had time I would implement a new tokenizer...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4122862.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Implementing a customised tokenizer

2014-03-11 Thread epnRui
Hi Ahmet,

I think the expungesDelete is done automatically through SolrJ. So I don't
think it was that.
THe problem solved by itself apparently. I wonder if it has to do with an
automatic optimization of Solr indexes?
Otherwise it was something similar to XY problem :P

Thanks for the help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Implementing-a-customised-tokenizer-tp4121355p4122864.html
Sent from the Solr - User mailing list archive at Nabble.com.


Facets, termvectors, relevancy and Multi word tokenizing

2014-02-27 Thread epnRui
Hi everyone!

I'm having a problem and I have searched and Haven't found a solution yet
and am rather confused at the moment.

I have an application that stores human readable texts in my Solr index.
It finds the most relevant terms in that human readable text, I think using
termvectors and facets, and it stores the facets terms.

All works fine but now I need that the most relevant terms can also be terms
of at least two words, like "European Union", which is quite a frequent term
in my system...Still the system is getting into the facets "European"
"Union" as two separate terms.

So, questions are:
 - Is it possible to have facets of two or more words?
 - Can I tokenize a phrase into words, but when it comes accross "European
Union", it generates one token for "European Union" and not two tokens
"European Union"?
 - Can termvectors be used to find relevancy of multi-word terms like
"European Union" ?
 - Can I use SynonymFilterFactory that would transform: "EU, UE, European
Union, Union Europeene" into "European Union" ?

At the moment of indexation I have the following analyzer for english
language:


  






  



Thank you for the help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-02-28 Thread epnRui
Hi Ahmet!!

I went ahead and did something I thought it was not a clean solution and
then when I read your post and I found we thought of the same solution,
including the European_Parliament with the _  :)

So I guess there would be no way to do this more cleanly, maybe only
implementing my own Tokenizer and Filters, but I honestly couldn't find a
tutorial for implement a customized solr Tokenizer. If I end up needing to
do it I will write a tutorial.

So for now I'm doing PatternReplaceCharFilterFactory to replace "European
Parliament" with European_Parliament (initially I didnt use the
md5hash European_Parliament).

Then I replace it back after the StandardTokenizerFactory ran, into
"European Parliament". Well I guess I just found a way to do a 2 words token
:)

I had seen the ShingleFilterFactory but the problem is I don't need the
whole phrase in tokens of 2 words and I understood it's what it does. Of
course I would need some filter that would handle a .txt with the tokens to
merge, like "European" and "Parliament".

I'm still having some other problem now but maybe I find a solution after I
read the page you annexed which seems great. Solr is considering #European
as #European and European, meaning it does 2 facets for one token. I want it
to consider it only as #European. I ran the analyzer debugger in my Solr
admin console and I don't see how he can be doing that.
Would you know of a reason for this?

Thanks for your reply and that page you annexed seems excelent and I'll read
it through.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120361.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-03-03 Thread epnRui
Hi guys,

I'm on my way to solve it properly.

This is how my field looks like now:



  









  

I still have one case where I'm facing issues because in fact I want to
preserve the #:
 - #European Parliament is translated into one token instead of two:
"#European" and "Parliament"... anyway, I have some ideas on how to do it.
Ill let you know whatss the final solution



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120948.html
Sent from the Solr - User mailing list archive at Nabble.com.


Implementing a customised tokenizer

2014-03-05 Thread epnRui
I have managed to understand how to properly implement and change the words
on a CharFilter and a Filter, but I fail to understand how the Tokenizer
works...

I also fail to find any tutorials on the thing..
Could you provide some example implementation of incrementToken and how to
manipulate the tokens?
Is there any documentation on this?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Implementing-a-customised-tokenizer-tp4121355.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-03-05 Thread epnRui
Hi guys,

So, I keep facing this problem which I can't solve. I thought it was due to
HTML anchors containing the name of the hashtag, and thus repeating it, but
it's not.

So the use case is:
1 - I need to consider hashtags as tokens.
2 - The hashtag has to show up in the facets.

Right now if I index this text:
"Action, sanctions or diplomacy: which way forward for the  #EU
   &  #Ukraine
  ? Tell us  @LinkedIn
   debate  http://t.co/umf9olxH9f
  "

I get the tokens as follows (see image for more detail):
action  sanctiondiplomacy   forward #eu #ukrainetell
linkedindebate
umf9olxh9f
ace bate

 

Then, if I have a look at the facets after the indexation, I find that (for
ukraine), the facets counts is increased for both "Ukraine" and "#Ukraine",
isntead of only for #Ukraine.

Does anyone have any idea of why this is happening?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4121389.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Implementing a customised tokenizer

2014-03-07 Thread epnRui
Hi iorixxx!

Thanks for replying. I managed to get around well enough not to need a
tokenizer customized implementation. That would be a pain in ...

Anyway, now I have another problem, which is related to the following:

 - I had previously used replace chars and replace patterns, charfilters and
filters, at index time to replace "EP" by "European Parliament". At that
point, it increased the facet_field count for "European Parliament".
Well now I have a big problem which is: I have already deleted the document
which generated the "European Parliament" and still that facet_field.count
will not subtract!! Is there a way to either remove a facet_field or to
subtract its count manually?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Implementing-a-customised-tokenizer-tp4121355p4121957.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-03-07 Thread epnRui
Hi guys!

I solved my problem on the client side but at least I solved it...

Anyway, now I have another problem, which is related to the following:

 - I had previously used replace chars and replace patterns, charfilters and
filters, at index time to replace "EP" by "European Parliament". At that
point, it increased the facet_field count for "European Parliament".
Well now I have a big problem which is: I have already deleted the document
which generated the "European Parliament" and still that facet_field.count
will not subtract!! Is there a way to either remove a facet_field or to
subtract its count manually?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4121958.html
Sent from the Solr - User mailing list archive at Nabble.com.


setting up master and slave in same machine with diff ip's and same port

2013-01-23 Thread epnRui
Hi everyone 

its my first post here so I hope im doing it in the right place. 

Im a software developer and Im setting up a DEV environment in Ubuntu with
the same configuration as in PROD. (apparently this IT department doesnt
know the difference between a developer and a sys admin) 

In PROD we have Solr Master and Solr slave, on two different IPs. Lets say: 
Master 192.10.1.1 
Slave 192.10.1.2 

In DEV I have only one server: 
10.1.1.1 

All of them are Ubuntu servers. 

Can I put Master and Slave, without touching any configurations in Solr,no
IP change, no Port change, in 10.1.1.1 (DEV), and still make it work? 

Basically what Im looking for is what Ubuntu server configuration Id have to
do to make this work. 

Thanks a lot



--
View this message in context: 
http://lucene.472066.n3.nabble.com/setting-up-master-and-slave-in-same-machine-with-diff-ip-s-and-same-port-tp4035795.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: setting up master and slave in same machine with diff ip's and same port

2013-01-31 Thread epnRui
Hi,

I solved the issue by setting up two different virtual network adapters in
ubuntu server.

case closed ;)


thanks for the help!!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/setting-up-master-and-slave-in-same-machine-with-diff-ip-s-and-same-port-tp4035795p4037713.html
Sent from the Solr - User mailing list archive at Nabble.com.


Stopping solr

2013-01-31 Thread epnRui
Hi people,


First of all this forum is a god sent!!!

Second:

I have a master / slave configuration, using replication.

Currently in production I have only one server, there's no backup server
(really...).
The webapplication is a public webapplication, everyone can see it.

 - How often, in your experience, and why, would solr crash?
 - If I kill solr master and slave, usually do I need to also delete the
indexes? Or everything should be fine upon restarting?
 - If I want to upgrade solr master and slave, or patch them, is there a way
that the services feeding from them will not fail? Solr in my application is
being used for indexing social networks feeds, like facebook posts...what
I'm trying to achieve is that the user keeps seeing the webpage working
normally (of course, with old index feeds of solr) in case solr crashes.
Maybe I can setup a backup solr slave as a backup system?

I know these are "innocent" questions, but I am learning sys admin,
apparently my IT department thinks i'm the "do it all" guy and IT people
need to develop and sys admin. If I told where I work you would fall from
your chair.


Best regards,
Rui



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Stopping-solr-tp4037715.html
Sent from the Solr - User mailing list archive at Nabble.com.