Re: If zookeeper is down, SolrCloud nodes will not start correctly, even if zookeeper is started later

2015-10-07 Thread Shawn Heisey
On 10/6/2015 10:22 PM, Adrian Liew wrote:
> Hence, the issue is that upon startup of three machines, the startup of ZK 
> and Solr is out of sequence that causes SolrCloud to behave unexpectedly. 
> Noting there is Jira ticket addressed here for Solr 4.9 above to include an 
> improvement to the issue above. 
> (https://issues.apache.org/jira/browse/SOLR-5129) 

That issue is unresolved, so it has not been fixed in any Solr version.

At this time, if you do not have Zookeeper quorum (a majority of your ZK
nodes fully operational), you will not be able to successfully start
SolrCloud nodes.  The issue has low priority because there is a viable
workaround -- ensure that ZK has quorum before starting or restarting
any Solr node.

Thinking out loud:  Until this issue is fixed, I think this means that a
3-node setup where all three nodes use the zookeeper embedded in Solr
will require a strange startup sequence if none of the nodes are running:

* Start node 1. Solr will not start correctly -- no ZK quorum.
* Start node 2. Solr might start correctly, not sure.
* Start node 3. This should start correctly.
* Restart node 1. With ZK nodes 2 and 3 running, this will work.
* Restart node 2 if it did not start properly the first time.

I really have no idea whether the second node startup will work properly.

Thanks,
Shawn



RE: If zookeeper is down, SolrCloud nodes will not start correctly, even if zookeeper is started later

2015-10-07 Thread Adrian Liew
Hi Shawn

Thanks for informing me. I guess the worst case scenario is that all 3 ZK 
services are down and that may be unlikely the case. At this juncture, as you 
said the viable workaround is a manual approach to start up the services in 
sequence in ensuring a quorum can take place. So the proper sequence in a 3 ZK 
+ Solr (both ZK and Solr in each server) server setup will be as follows:

Downed situation with one or mode ZK services
1. Restart all ZK Services first on all three machines
2. Restart all Solr Services on all three machines

Please do clarify if the above is correct and I will be happy to take this 
approach and communicate to my customer.

Many thanks.

Regards,
Adrian 

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Wednesday, October 7, 2015 4:09 PM
To: solr-user@lucene.apache.org
Subject: Re: If zookeeper is down, SolrCloud nodes will not start correctly, 
even if zookeeper is started later

On 10/6/2015 10:22 PM, Adrian Liew wrote:
> Hence, the issue is that upon startup of three machines, the startup 
> of ZK and Solr is out of sequence that causes SolrCloud to behave 
> unexpectedly. Noting there is Jira ticket addressed here for Solr 4.9 
> above to include an improvement to the issue above. 
> (https://issues.apache.org/jira/browse/SOLR-5129)

That issue is unresolved, so it has not been fixed in any Solr version.

At this time, if you do not have Zookeeper quorum (a majority of your ZK nodes 
fully operational), you will not be able to successfully start SolrCloud nodes. 
 The issue has low priority because there is a viable workaround -- ensure that 
ZK has quorum before starting or restarting any Solr node.

Thinking out loud:  Until this issue is fixed, I think this means that a 3-node 
setup where all three nodes use the zookeeper embedded in Solr will require a 
strange startup sequence if none of the nodes are running:

* Start node 1. Solr will not start correctly -- no ZK quorum.
* Start node 2. Solr might start correctly, not sure.
* Start node 3. This should start correctly.
* Restart node 1. With ZK nodes 2 and 3 running, this will work.
* Restart node 2 if it did not start properly the first time.

I really have no idea whether the second node startup will work properly.

Thanks,
Shawn



Highlighting tag is not showing occasionally

2015-10-07 Thread Zheng Lin Edwin Yeo
Hi,

Has anyone face the problem of when using highlighting, sometimes there are
results which are returned, but there is no highlighting to the result (ie:
no  tag).

I found that there is a match in another field which I did not include in
my hl.fl parameters when I do fl=*, but that same word acutally does appear
in content.

Would like to find out, why sometimes there is a match in content, but it
is not highlighted (the word is not in the stopword list)? Did I make any
mistakes in my configuration?

I've include my highlighting request handler from solrconfig.xml here.



explicit
10
json
true
text
id, title, content_type, last_modified, url, score 

on
id, title, content, author, tag
  true
true
html
200

true
signature
true
100




Regards,
Edwin


Re: Pressed optimize and now SOLR is not indexing while optimize is going on

2015-10-07 Thread Eric Torti
Hello Shawn,

I'm sorry to diverge this thread a little bit. But could please point me to
resources that explain deeply how this process of OS using the non-java
memory to cache index data?

> Whatever RAM is left over after you give 12GB to Java for Solr will be
> used automatically by the operating system to cache index data on the
> disk.  Solr is completely reliant on that caching for good performance.

I'm puzzled as to why the physical memory of solr's host machine is always
used up and I think some resources on that would help me understand it.

Thanks

On Tue, Oct 6, 2015 at 5:07 PM, Siddhartha Singh Sandhu <
sandhus...@gmail.com> wrote:

> Thank you for helping out.
>
> Further inquiry: I am committing records to my solr implementation and they
> are not getting showing up in my search. I am search on the default id.
> Is this related to the fact that I dont have enough memory so my SOLR is
> taking a lot of time to actually making the indexed documents available
> instantly.
>
> I  also looked at the solr log when I sent in my curl commit with my
> record(which I can not see in the SOLR instance even after sending it
> repeatedly), but it didn't through an error.
>
> I got this as my response on insertion of that record:
>
> {"responseHeader":{"status":0,"QTime":57}}
>
> Thank you.
>
> Sid.
>
> On Tue, Oct 6, 2015 at 3:21 PM, Shawn Heisey  wrote:
>
> > On 10/6/2015 8:18 AM, Siddhartha Singh Sandhu wrote:
> > > A have a few questions about optimize. Is the search index fully
> > searchable
> > > after a commit?
> >
> > If openSearcher is true on the commit, then changes to the index
> > (additions, replacements, deletions) will be visible when the commit
> > completes.
> >
> > > How much time does one have to wait in case of a hard commit for the
> > index
> > > to be available?
> >
> > This is impossible to answer.  It will take as long as it takes, and the
> > time will depend on many factors, so it is nearly impossible to
> > predict.  The only way to know is to try it ... and the number you get
> > on one test may be very different than what you actually see once the
> > system is in production.
> >
> > > I have an index of 180G. Do I need to hit the optimize on this chunk.
> > This
> > > is a single core. Say I cannot get in a cloud env because of cost but
> > this
> > > is a fairly large
> > > amazon machine where I have given SOLR 12G of memory.
> >
> > Whatever RAM is left over after you give 12GB to Java for Solr will be
> > used automatically by the operating system to cache index data on the
> > disk.  Solr is completely reliant on that caching for good performance.
> > A perfectly ideal system for that index and heap size would have 192GB
> > of RAM, which is enough to entirely cache the index.  I personally
> > wouldn't expect good performance with less than 96GB.  Some systems with
> > a 180GB index and a 12GB heap might be OK with 64GBtotal memory, while
> > others with the same size index will require more.
> >
> >
> >
> https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >
> > If the index is on SSD, then RAM is *slightly* less important, and
> > performance usually goes up with SSD ... but an SSD cannot completely
> > replace RAM, because RAM is much faster.  With SSD, you can get away
> > with less RAM than you can on a spinning disk system, but depending on a
> > bunch of factors, it may not be a LOT less RAM.
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Optimizing the index is almost never necessary with recent versions.  In
> > almost all cases optimizing will get you a performance increase, but it
> > comes at a huge cost in terms of resource utilization to DO the
> > optimize.  While the optimize is happening performance will likely be
> > worse, possibly a LOT worse.  Newer versions of Solr (Lucene) have
> > closed the gap on performance with non-optimized indexes, so it doesn't
> > gain you as much in performance as it did in earlier versions.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Pivot facets

2015-10-07 Thread Alessandro Benedetti
I agree with Hoss, is this what you are expecting ?

Indexing ...

Doc 1 :
Country: England
Region: Greater London
City: London

Doc2 :
Country:England
City: Manchester

Query results

Country : England(2)
   Region : Greater london(1)
 City: London(1)
 Missing(1)
 City: Manchester(1)

If this is what you want, then go with Hoss suggestion and use the missing
count !

Cheers

On 6 October 2015 at 18:55, Chris Hostetter 
wrote:

>
> It's not entirely clear what your queries/data look like, orwhat results
> you are expecting to get back, please consider asking your question again
> with more details...
> https://wiki.apache.org/solr/UsingMailingLists
>
> ...in the mean time the best guess i can make is that perhaps you aren't
> familiar with the "facet.missing" request param?  try adding
> facet.missing=true to your request and see if that gives you what you are
> looking for...
>
>
> https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Thefacet.missingParameter
>
> https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Pivot%28DecisionTree%29Faceting
>
>
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: efficient sort by title (multi word field)

2015-10-07 Thread Alessandro Benedetti
Hi Gili,
if you want to use sorting you will need to index the extra string field.
And I suggest you to set only DocValues on that field.
Doc values are not compatible with textual analysed fields and neither sort
is.

[1]
https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-ThesortParameter

Cheers

On 6 October 2015 at 19:27, Gili Nachum  wrote:

> Hi, wanted to make sure I'm implementing sort in an efficient way...
>
> I need to allow users to sort by documents' title field. A title can
> contain 1-20 words.
> Title examples: "new project meeting minutes - Oct 2015 - new chance on the
> horizon" or "how to create a wonderful presentation".
>
> I'm already indexing title as a TextField, and I not comfortable with
> indexing it again as an extra StrField field I'll need plus the extra
> FieldCache memory.
> I can probably avoid the FieldCache by using docValues to reduce mem usage.
>
> *But... is there a more efficient way to provide this sort? perhaps
> something that takes advantage of title field is a chain of words with
> whitespaces between words?*
>
> My index is 100's of millions of documents over 8 shards.
>
> Thanks.
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Query to count matching terms and disable 'coord' multiplication

2015-10-07 Thread Alessandro Benedetti
Hi,
related 1 you should take a look to all the similarity implementation,
maybe there's some good fit there for your use case !

Another interesting reading could be :
http://opensourceconnections.com/blog/2014/01/20/build-your-own-custom-lucene-query-and-scorer/
fro Doug.
I remember i saw that kind of TF only scorer, but not recalling where.
You should take a look, if I find something better I let you no.

2) To disable norms you can omit norms at indexing time and avoid to use
them.  Or go again with a custom scorer.

Cheers

On 6 October 2015 at 20:30, Tim Hearn  wrote:

> Hello everyone,
>
> I have two questions
>
> 1) Is there a way to query solr to rank results based purely on the amount
> of terms in the query which are contained in the document?
> Example:
> doc1: 'foo bar poo car foo'
> q1: 'foo, car, two, start'
> score(doc1, q1) = 2 (since both foo and car both occur in doc1 - never mind
> that foo occurs twice)
>
> This is also the numerator in the coord query
>
> 2) Is there a way to disable the 'coord' and 'query norm' multiplication of
> query results all together?
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Pressed optimize and now SOLR is not indexing while optimize is going on

2015-10-07 Thread Toke Eskildsen
On Wed, 2015-10-07 at 07:03 -0300, Eric Torti wrote:
> I'm sorry to diverge this thread a little bit. But could please point me to
> resources that explain deeply how this process of OS using the non-java
> memory to cache index data?

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Shawn Heisey:
> > Whatever RAM is left over after you give 12GB to Java for Solr will be
> > used automatically by the operating system to cache index data on the
> > disk.  Solr is completely reliant on that caching for good performance.
> 
> I'm puzzled as to why the physical memory of solr's host machine is always
> used up and I think some resources on that would help me understand it.

It is not used up as such: Add "Disk cache" and "Free space" (or
whatever your monitoring tool calls them) and you will have the amount
of memory available for new processes. If you start a new and
memory-hungry process, it will take the memory from the free pool first,
then from the disk cache.


- Toke Eskildsen, State and University Library, Denmark




RE: Run Solr 5.3.0 as a Service on Windows using NSSM

2015-10-07 Thread Adrian Liew
Hi Edwin,

You may want to try explore some of the configuration properties to configure 
in zookeeper.

http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html#sc_zkMulitServerSetup

My recommendation is to try run your batch files outside of NSSM so it is 
easier to debug and observe what you see from the command window. I don't think 
ZK and Solr can be automated on startup well using NSSM due to the fact that ZK 
services need to be running before you start up Solr services. I just had 
conversation with Shawn on this topic. NSSM cannot do the magic startup in a 
cluster setup. In that, you may need to write custom scripting to get it right.

Back to your original issue, I guess it is worth exploring timeout values. Then 
again, I will leave the real Solr experts to chip in their thoughts.

Best regards,

Adrian Liew 


-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] 
Sent: Wednesday, October 7, 2015 1:40 PM
To: solr-user@lucene.apache.org
Subject: Re: Run Solr 5.3.0 as a Service on Windows using NSSM

Hi Adrian,

I've waited for more than 5 minutes and most of the time when I refresh it says 
that the page cannot be found. Got one or twice the main Admin page is loaded, 
but none of the cores are loaded.

I have 20 cores which I'm loading. The core are of various sizes, but the 
maximum one is 38GB. Others ranges from 10GB to 15GB, and there're some which 
are less than 1GB.

My overall core size is about 200GB.

Regards,
Edwin


On 7 October 2015 at 12:11, Adrian Liew  wrote:

> Hi Edwin,
>
> I have setup NSSM on Solr 5.3.0 in an Azure VM and can start up Solr 
> with a base standalone installation.
>
> You may have to give Solr some time to bootstrap things and wait for 
> the page to reload. Are you still seeing the page after 1 minute or so?
>
> What are your core sizes? And how many cores are you trying to load?
>
> Best regards,
> Adrian
>
> -Original Message-
> From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
> Sent: Wednesday, October 7, 2015 11:46 AM
> To: solr-user@lucene.apache.org
> Subject: Run Solr 5.3.0 as a Service on Windows using NSSM
>
> Hi,
>
> I tried to follow this to start my Solr as a service using NSSM.
> http://www.norconex.com/how-to-run-solr5-as-a-service-on-windows/
>
> Everything is fine when I start the services under Component Services.
> However, when I tried to point to the Solr Admin page, it says that 
> the page cannot be found.
>
> I have tried the same thing in Solr 5.1, and it was able to work. Not 
> sure why it couldn't work for Solr 5.2 and Solr 5.3.
>
> Is there any changes required to what is listed on the website?
>
> Regards,
> Edwin
>


Re: Run Solr 5.3.0 as a Service on Windows using NSSM

2015-10-07 Thread Upayavira
Wrap your script that starts Solr with one that checks it can access
Zookeeper before attempting to start Solr, that way, once ZK starts,
Solr will come up. Then, hand *that* script to NSSM.

And finally, when one of you has got a setup that works with NSSM
starting Solr via the default bin\solr.cmd script, create a patch and
upload it to JIRA. It would be a valuable thing for Solr to have a
*standard* way to start Solr on Windows as a service. I recall checking
the NSSM license and it wouldn't be an issue to include it within Solr -
or to have a script that assumes it is installed.

Upayavira

On Wed, Oct 7, 2015, at 11:49 AM, Adrian Liew wrote:
> Hi Edwin,
> 
> You may want to try explore some of the configuration properties to
> configure in zookeeper.
> 
> http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html#sc_zkMulitServerSetup
> 
> My recommendation is to try run your batch files outside of NSSM so it is
> easier to debug and observe what you see from the command window. I don't
> think ZK and Solr can be automated on startup well using NSSM due to the
> fact that ZK services need to be running before you start up Solr
> services. I just had conversation with Shawn on this topic. NSSM cannot
> do the magic startup in a cluster setup. In that, you may need to write
> custom scripting to get it right.
> 
> Back to your original issue, I guess it is worth exploring timeout
> values. Then again, I will leave the real Solr experts to chip in their
> thoughts.
> 
> Best regards,
> 
> Adrian Liew 
> 
> 
> -Original Message-
> From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] 
> Sent: Wednesday, October 7, 2015 1:40 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Run Solr 5.3.0 as a Service on Windows using NSSM
> 
> Hi Adrian,
> 
> I've waited for more than 5 minutes and most of the time when I refresh
> it says that the page cannot be found. Got one or twice the main Admin
> page is loaded, but none of the cores are loaded.
> 
> I have 20 cores which I'm loading. The core are of various sizes, but the
> maximum one is 38GB. Others ranges from 10GB to 15GB, and there're some
> which are less than 1GB.
> 
> My overall core size is about 200GB.
> 
> Regards,
> Edwin
> 
> 
> On 7 October 2015 at 12:11, Adrian Liew  wrote:
> 
> > Hi Edwin,
> >
> > I have setup NSSM on Solr 5.3.0 in an Azure VM and can start up Solr 
> > with a base standalone installation.
> >
> > You may have to give Solr some time to bootstrap things and wait for 
> > the page to reload. Are you still seeing the page after 1 minute or so?
> >
> > What are your core sizes? And how many cores are you trying to load?
> >
> > Best regards,
> > Adrian
> >
> > -Original Message-
> > From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
> > Sent: Wednesday, October 7, 2015 11:46 AM
> > To: solr-user@lucene.apache.org
> > Subject: Run Solr 5.3.0 as a Service on Windows using NSSM
> >
> > Hi,
> >
> > I tried to follow this to start my Solr as a service using NSSM.
> > http://www.norconex.com/how-to-run-solr5-as-a-service-on-windows/
> >
> > Everything is fine when I start the services under Component Services.
> > However, when I tried to point to the Solr Admin page, it says that 
> > the page cannot be found.
> >
> > I have tried the same thing in Solr 5.1, and it was able to work. Not 
> > sure why it couldn't work for Solr 5.2 and Solr 5.3.
> >
> > Is there any changes required to what is listed on the website?
> >
> > Regards,
> > Edwin
> >


Fuzzy search for names and phrases

2015-10-07 Thread vit
Could someone share experience on applying name fuzzy search using Solr. 
It should not be just the one which uses Edit Distance. I also want to cover
cases with split and merge like "OneIndustrial" vs "One Industrial", etc.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fuzzy-search-for-names-and-phrases-tp4233209.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: EdgeNGramFilterFactory question

2015-10-07 Thread vit
any experience with  EdgeNGramFilterFactory will be appreciated 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNGramFilterFactory-question-tp4233034p4233210.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: If zookeeper is down, SolrCloud nodes will not start correctly, even if zookeeper is started later

2015-10-07 Thread Shawn Heisey
On 10/7/2015 3:06 AM, Adrian Liew wrote:
> Thanks for informing me. I guess the worst case scenario is that all 3 ZK 
> services are down and that may be unlikely the case. At this juncture, as you 
> said the viable workaround is a manual approach to start up the services in 
> sequence in ensuring a quorum can take place. So the proper sequence in a 3 
> ZK + Solr (both ZK and Solr in each server) server setup will be as follows:
> 
> Downed situation with one or mode ZK services
> 1. Restart all ZK Services first on all three machines
> 2. Restart all Solr Services on all three machines
> 
> Please do clarify if the above is correct and I will be happy to take this 
> approach and communicate to my customer.

If zookeeper is external (not embedded in Solr), your procedure is
correct -- ensure enough ZK nodes are started to reach quorum, then
start Solr.  If using the embedded zookeeper (-DzkRun) then you would
want to follow the procedure I outlined in my last message.

Thanks,
Shawn



Unexpected delayed document deletion with atomic updates

2015-10-07 Thread John Smith
Hi,

I'm bumping on the following problem with update XML messages. The idea
is to record the number of clicks for a document: each time, a message
is sent to .../update such as this one:



abc
1
1.05



(Clicks is an int field; Boost is a float field, it's updated to reflect
the change in popularity using a formula based on the number of clicks).

At the moment in the dev environment, changes are committed immediately.


When a document is updated, the changes are indeed reflected in the
search results. If I click on the same document again, all goes well.
But  when I click on an other document, the latter gets updated as
expected but the former is plainly deleted. It can no longer be found
and the admin core Overview page counts 1 document less. If I click on a
3rd document, so goes the 2nd one.


The schema is the default one amended to remove unneeded fields and add
new ones, nothing fancy. All fields are stored="true" and there's no
. I've tried versions 5.2.1 & 5.3.1 in standalone mode, with
the same outcome. It looks like a bug to me but I might have overlooked
something? This is my first attempt at atomic updates.

Thanks,
John.



Re: Pressed optimize and now SOLR is not indexing while optimize is going on

2015-10-07 Thread Shawn Heisey
On 10/7/2015 4:03 AM, Eric Torti wrote:
> I'm sorry to diverge this thread a little bit. But could please point me to
> resources that explain deeply how this process of OS using the non-java
> memory to cache index data?
> 
>> Whatever RAM is left over after you give 12GB to Java for Solr will be
>> used automatically by the operating system to cache index data on the
>> disk.  Solr is completely reliant on that caching for good performance.
> 
> I'm puzzled as to why the physical memory of solr's host machine is always
> used up and I think some resources on that would help me understand it.

Toke's reply is excellent, and describes the situation from Lucene's
perspective.  Solr is a Lucene program, so the same information applies.

Here's more generic information on how the OS uses memory for caching
for most programs:

https://en.wikipedia.org/wiki/Page_cache

Note that some programs, like MySQL and Microsoft Exchange, skip the OS
cache and take care of caching internally.

Thanks,
Shawn



Re: Solr cross core join special condition

2015-10-07 Thread Ryan Josal
I developed a join transformer plugin that did that (although it didn't
flatten the results like that).  The one thing that was painful about it is
that the TextResponseWriter has references to both the IndexSchema and
SolrReturnFields objects for the primary core.  So when you add a
SolrDocument from another core it returned the wrong fields.  I worked
around that by transforming the SolrDocument to a NamedList.  Then when it
gets to processing the IndexableFields it uses the wrong IndexSchema, I
worked around that by transforming each field to a hard Java object
(through the IndexSchema and FieldType of the correct core).  I think it
would be great to patch TextResponseWriter with multi core writing
abilities, but there is one question, how can it tell which core a
SolrDocument or IndexableField is from?  Seems we'd have to add an
attribute for that.

The other possibly simpler thing to do is execute the join at index time
with an update processor.

Ryan

On Tuesday, October 6, 2015, Mikhail Khludnev 
wrote:

> On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian  > wrote:
>
> > it
> > seems there is not any way to do that right now and it should be
> developed
> > somehow. Am I right?
> >
>
> yep
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> >
>


Re: Pressed optimize and now SOLR is not indexing while optimize is going on

2015-10-07 Thread Eric Torti
Cool, Toke and Shawn!

That's exactly what I was looking for. I'll have a look at those resources
and if something is yet unclear I'll open a thread for it.

Thanks for the information,

Eric

On Wed, Oct 7, 2015 at 10:29 AM, Shawn Heisey  wrote:

> On 10/7/2015 4:03 AM, Eric Torti wrote:
> > I'm sorry to diverge this thread a little bit. But could please point me
> to
> > resources that explain deeply how this process of OS using the non-java
> > memory to cache index data?
> >
> >> Whatever RAM is left over after you give 12GB to Java for Solr will be
> >> used automatically by the operating system to cache index data on the
> >> disk.  Solr is completely reliant on that caching for good performance.
> >
> > I'm puzzled as to why the physical memory of solr's host machine is
> always
> > used up and I think some resources on that would help me understand it.
>
> Toke's reply is excellent, and describes the situation from Lucene's
> perspective.  Solr is a Lucene program, so the same information applies.
>
> Here's more generic information on how the OS uses memory for caching
> for most programs:
>
> https://en.wikipedia.org/wiki/Page_cache
>
> Note that some programs, like MySQL and Microsoft Exchange, skip the OS
> cache and take care of caching internally.
>
> Thanks,
> Shawn
>
>


Re: Pressed optimize and now SOLR is not indexing while optimize is going on

2015-10-07 Thread Siddhartha Singh Sandhu
Thanks from my end too. And thanks for the question Eric that added a lot
to my understanding as well.

Regards.
Sid.

On Wed, Oct 7, 2015 at 10:04 AM, Eric Torti  wrote:

> Cool, Toke and Shawn!
>
> That's exactly what I was looking for. I'll have a look at those resources
> and if something is yet unclear I'll open a thread for it.
>
> Thanks for the information,
>
> Eric
>
> On Wed, Oct 7, 2015 at 10:29 AM, Shawn Heisey  wrote:
>
> > On 10/7/2015 4:03 AM, Eric Torti wrote:
> > > I'm sorry to diverge this thread a little bit. But could please point
> me
> > to
> > > resources that explain deeply how this process of OS using the non-java
> > > memory to cache index data?
> > >
> > >> Whatever RAM is left over after you give 12GB to Java for Solr will be
> > >> used automatically by the operating system to cache index data on the
> > >> disk.  Solr is completely reliant on that caching for good
> performance.
> > >
> > > I'm puzzled as to why the physical memory of solr's host machine is
> > always
> > > used up and I think some resources on that would help me understand it.
> >
> > Toke's reply is excellent, and describes the situation from Lucene's
> > perspective.  Solr is a Lucene program, so the same information applies.
> >
> > Here's more generic information on how the OS uses memory for caching
> > for most programs:
> >
> > https://en.wikipedia.org/wiki/Page_cache
> >
> > Note that some programs, like MySQL and Microsoft Exchange, skip the OS
> > cache and take care of caching internally.
> >
> > Thanks,
> > Shawn
> >
> >
>


Is solr.StandardDirectoryFactory an MMapDirectory?

2015-10-07 Thread Eric Torti
Hello,

I'm running a 5.2.1 SolrCloud cluster and I see that one of my cores
is configured under solrconfig.xml to use



I'm just starting to grasp different strategies for Directory
implementation. Can I assume that solr.StandardDirectoryFactory is a
MMapDirectory as described by Uwe Schindler in this post about the use
of virtual memory?
[http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html]

Thanks!

Best,

Eric Torti


Re: Is solr.StandardDirectoryFactory an MMapDirectory?

2015-10-07 Thread Shawn Heisey
On 10/7/2015 8:48 AM, Eric Torti wrote:
>  class="${solr.directoryFactory:solr.StandardDirectoryFactory}"
> name="DirectoryFactory"/>
>
> I'm just starting to grasp different strategies for Directory
> implementation. Can I assume that solr.StandardDirectoryFactory is a
> MMapDirectory as described by Uwe Schindler in this post about the use
> of virtual memory?
> [http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html]

After a look at the code, I found that StandardDirectoryFactory should
use MMap if the OS and Java version support it.  If support isn't there,
it will use conventional file access methods.  As far as I know, all
64-bit Java versions and 64-bit operating systems will support MMap.

The factory you *should* be using is NRTCachingDirectoryFactory, and you
should enable the updateLog to ensure data reliability.

Thanks,
Shawn



Re: Pressed optimize and now SOLR is not indexing while optimize is going on

2015-10-07 Thread Walter Underwood
Unix has a “buffer cache”, often called a file cache. This chapter discusses 
the Linux buffer cache, which is very similar to other Unix implementations. 
Essentially, all unused RAM is used to make disk access faster.

http://www.tldp.org/LDP/sag/html/buffer-cache.html 


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 7, 2015, at 3:40 AM, Toke Eskildsen  wrote:
> 
> On Wed, 2015-10-07 at 07:03 -0300, Eric Torti wrote:
>> I'm sorry to diverge this thread a little bit. But could please point me to
>> resources that explain deeply how this process of OS using the non-java
>> memory to cache index data?
> 
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> 
> Shawn Heisey:
>>> Whatever RAM is left over after you give 12GB to Java for Solr will be
>>> used automatically by the operating system to cache index data on the
>>> disk.  Solr is completely reliant on that caching for good performance.
>> 
>> I'm puzzled as to why the physical memory of solr's host machine is always
>> used up and I think some resources on that would help me understand it.
> 
> It is not used up as such: Add "Disk cache" and "Free space" (or
> whatever your monitoring tool calls them) and you will have the amount
> of memory available for new processes. If you start a new and
> memory-hungry process, it will take the memory from the free pool first,
> then from the disk cache.
> 
> 
> - Toke Eskildsen, State and University Library, Denmark
> 
> 



Re: Solr cross core join special condition

2015-10-07 Thread Susheel Kumar
You may want to take a look at new Solr feature of Streaming API &
Expressions https://issues.apache.org/jira/browse/SOLR-7584?filter=12333278
for making joins between collections.

On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal  wrote:

> I developed a join transformer plugin that did that (although it didn't
> flatten the results like that).  The one thing that was painful about it is
> that the TextResponseWriter has references to both the IndexSchema and
> SolrReturnFields objects for the primary core.  So when you add a
> SolrDocument from another core it returned the wrong fields.  I worked
> around that by transforming the SolrDocument to a NamedList.  Then when it
> gets to processing the IndexableFields it uses the wrong IndexSchema, I
> worked around that by transforming each field to a hard Java object
> (through the IndexSchema and FieldType of the correct core).  I think it
> would be great to patch TextResponseWriter with multi core writing
> abilities, but there is one question, how can it tell which core a
> SolrDocument or IndexableField is from?  Seems we'd have to add an
> attribute for that.
>
> The other possibly simpler thing to do is execute the join at index time
> with an update processor.
>
> Ryan
>
> On Tuesday, October 6, 2015, Mikhail Khludnev 
> wrote:
>
> > On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian  > > wrote:
> >
> > > it
> > > seems there is not any way to do that right now and it should be
> > developed
> > > somehow. Am I right?
> > >
> >
> > yep
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > 
> > >
> >
>


Re: EdgeNGramFilterFactory question

2015-10-07 Thread Walter Underwood
You would need an analyzer or char filter factory that removed all spaces. But 
then you would only get one “edge”. That would make “to be or not to be” into 
the single token “tobeornottobe”. I don’t think that fixes anything.

Stemming and prefix matching do very different things. Use them in different 
analysis chains stored in separate fields.

The exact example you list will work fine with stemming and phrase search. 
Check out the phrase search support in the edismax query parser.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 6, 2015, at 1:47 PM, vit  wrote:
> 
> I have Solr 4.2
> 
> 1) Is it possible to somehow use EdgeNGramFilterFactory ignoring white
> spaces in n-grams?
> 
> 2) Is it possible to use EdgeNGramFilterFactory in combination with stemming
> ?
>Say applying this to "look for close hotel" instead of "looking for
> closest hotels"
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/EdgeNGramFilterFactory-question-tp4233034.html
> Sent from the Solr - User mailing list archive at Nabble.com.



how to deployed another web project into jetty server(solr inbuilt)

2015-10-07 Thread Mugeesh Husain
I am using Solr-5.3 with inbuilt jetty server. everything solr related is
working fine. my problem I have a spring based web application for admin
configuration.
i don't have a another other server, i want to deploy it to jetty server.

I have googled still could not fine suitable answer.

can we deploy another war file in this container ? Yes or NO


Thanks.
Mugeesh
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-deployed-another-web-project-into-jetty-server-solr-inbuilt-tp4233288.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to deployed another web project into jetty server(solr inbuilt)

2015-10-07 Thread Daniel Collins
The short answer is technically it might be possible but its not a
supported configuration.  As of Solr 5.x (I forget the exact version), the
use of Jetty is an implementation detail, you should treat Solr as a black
box, whether it uses Jetty or not is irrelevant, and not something you can
"piggy back" on.

It should be trivial to setup a small Jetty installation that deploys your
app, why can't you do that?


On 7 October 2015 at 17:37, Mugeesh Husain  wrote:

> I am using Solr-5.3 with inbuilt jetty server. everything solr related is
> working fine. my problem I have a spring based web application for admin
> configuration.
> i don't have a another other server, i want to deploy it to jetty server.
>
> I have googled still could not fine suitable answer.
>
> can we deploy another war file in this container ? Yes or NO
>
>
> Thanks.
> Mugeesh
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-deployed-another-web-project-into-jetty-server-solr-inbuilt-tp4233288.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Unexpected delayed document deletion with atomic updates

2015-10-07 Thread Erick Erickson
This certainly should not be happening. I'd
take a careful look at what you actually send.
My _guess_ is that you're not sending the update
command you think you are

As a test you could just curl (or use post.jar) to
send these types of commands up individually.

Perhaps looking at the solr log would help too...

Best,
Erick

On Wed, Oct 7, 2015 at 6:32 AM, John Smith  wrote:
> Hi,
>
> I'm bumping on the following problem with update XML messages. The idea
> is to record the number of clicks for a document: each time, a message
> is sent to .../update such as this one:
>
> 
> 
> abc
> 1
> 1.05
> 
> 
>
> (Clicks is an int field; Boost is a float field, it's updated to reflect
> the change in popularity using a formula based on the number of clicks).
>
> At the moment in the dev environment, changes are committed immediately.
>
>
> When a document is updated, the changes are indeed reflected in the
> search results. If I click on the same document again, all goes well.
> But  when I click on an other document, the latter gets updated as
> expected but the former is plainly deleted. It can no longer be found
> and the admin core Overview page counts 1 document less. If I click on a
> 3rd document, so goes the 2nd one.
>
>
> The schema is the default one amended to remove unneeded fields and add
> new ones, nothing fancy. All fields are stored="true" and there's no
> . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with
> the same outcome. It looks like a bug to me but I might have overlooked
> something? This is my first attempt at atomic updates.
>
> Thanks,
> John.
>


Instant Page Previews

2015-10-07 Thread Lewin Joy (TMS)
Hi,

Is there anyway we can implement instant page previews in solr?
Just saw that Google Search Appliance has this out of the box.
Just like what google.com had previously. We need to display the content of the 
result record when hovering over the link.

Thanks,
Lewin






Re: Is solr.StandardDirectoryFactory an MMapDirectory?

2015-10-07 Thread Eric Torti
Thanks, Shawn.

> After a look at the code, I found that StandardDirectoryFactory should
> use MMap if the OS and Java version support it.  If support isn't there,
> it will use conventional file access methods.  As far as I know, all
> 64-bit Java versions and 64-bit operating systems will support MMap.

Considering our JVM is 64-bit, that probably explains why we're
experiencing MMapDirectory like behaviour on our cluster (i.e. high
non-JVM related memory use).

As to NRTCachingDirectoryFactory, when looking up the docs we were in
doubt about what it means to have a "highish reopen rate".

> public class NRTCachingDirectory

> This class is likely only useful in a near-real-time context, where indexing 
> rate is lowish but reopen rate is highish, > resulting in many tiny files 
> being written.

Can we read "high reopen rate" as "frequent soft commits"? (In our
case, hard commits do not open a searcher. But soft commits do).

Considering it does mean "frequent soft commits", I'd say that it
doesn't fit our setup because we have an index rate of about 10
updates/s and we perform a soft commit at each 15min. So our scenario
is not near real time in that sense. In light of this, do you thing
using NRTCachingDirectory is still convenient?

Best,

Eric



On Wed, Oct 7, 2015 at 12:08 PM, Shawn Heisey  wrote:
> On 10/7/2015 8:48 AM, Eric Torti wrote:
>> > class="${solr.directoryFactory:solr.StandardDirectoryFactory}"
>> name="DirectoryFactory"/>
>>
>> I'm just starting to grasp different strategies for Directory
>> implementation. Can I assume that solr.StandardDirectoryFactory is a
>> MMapDirectory as described by Uwe Schindler in this post about the use
>> of virtual memory?
>> [http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html]
>
> After a look at the code, I found that StandardDirectoryFactory should
> use MMap if the OS and Java version support it.  If support isn't there,
> it will use conventional file access methods.  As far as I know, all
> 64-bit Java versions and 64-bit operating systems will support MMap.
>
> The factory you *should* be using is NRTCachingDirectoryFactory, and you
> should enable the updateLog to ensure data reliability.
>
> Thanks,
> Shawn
>


Re: Is solr.StandardDirectoryFactory an MMapDirectory?

2015-10-07 Thread Eric Torti
Correcting:

When I mentioned high non-JVM memory usage, what I probably meant was
high virtual memory allocation.

On Wed, Oct 7, 2015 at 3:00 PM, Eric Torti  wrote:
> Thanks, Shawn.
>
>> After a look at the code, I found that StandardDirectoryFactory should
>> use MMap if the OS and Java version support it.  If support isn't there,
>> it will use conventional file access methods.  As far as I know, all
>> 64-bit Java versions and 64-bit operating systems will support MMap.
>
> Considering our JVM is 64-bit, that probably explains why we're
> experiencing MMapDirectory like behaviour on our cluster (i.e. high
> non-JVM related memory use).
>
> As to NRTCachingDirectoryFactory, when looking up the docs we were in
> doubt about what it means to have a "highish reopen rate".
>
>> public class NRTCachingDirectory
>
>> This class is likely only useful in a near-real-time context, where indexing 
>> rate is lowish but reopen rate is highish, > resulting in many tiny files 
>> being written.
>
> Can we read "high reopen rate" as "frequent soft commits"? (In our
> case, hard commits do not open a searcher. But soft commits do).
>
> Considering it does mean "frequent soft commits", I'd say that it
> doesn't fit our setup because we have an index rate of about 10
> updates/s and we perform a soft commit at each 15min. So our scenario
> is not near real time in that sense. In light of this, do you thing
> using NRTCachingDirectory is still convenient?
>
> Best,
>
> Eric
>
>
>
> On Wed, Oct 7, 2015 at 12:08 PM, Shawn Heisey  wrote:
>> On 10/7/2015 8:48 AM, Eric Torti wrote:
>>> >> class="${solr.directoryFactory:solr.StandardDirectoryFactory}"
>>> name="DirectoryFactory"/>
>>>
>>> I'm just starting to grasp different strategies for Directory
>>> implementation. Can I assume that solr.StandardDirectoryFactory is a
>>> MMapDirectory as described by Uwe Schindler in this post about the use
>>> of virtual memory?
>>> [http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html]
>>
>> After a look at the code, I found that StandardDirectoryFactory should
>> use MMap if the OS and Java version support it.  If support isn't there,
>> it will use conventional file access methods.  As far as I know, all
>> 64-bit Java versions and 64-bit operating systems will support MMap.
>>
>> The factory you *should* be using is NRTCachingDirectoryFactory, and you
>> should enable the updateLog to ensure data reliability.
>>
>> Thanks,
>> Shawn
>>


Re: Instant Page Previews

2015-10-07 Thread Alexandre Rafalovitch
I don't think that particular functionality is anything directly to do
with Solr?

You will have server component that will index web page (I am
guessing) into Solr. That same component can generate preview image.
Your frontend UI will get the URL/id from Solr and display the related
image.

Solr will enable you to find those documents/links quickly, but the
rest of pipeline is not something it gives out of the box.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 7 October 2015 at 13:49, Lewin Joy (TMS)  wrote:
> Hi,
>
> Is there anyway we can implement instant page previews in solr?
> Just saw that Google Search Appliance has this out of the box.
> Just like what google.com had previously. We need to display the content of 
> the result record when hovering over the link.
>
> Thanks,
> Lewin
>
>
>
>


Lose Solr config on zookeeper when it is restarted

2015-10-07 Thread CrazyDiamond
 sometimes when Zookeeper( single mode) is restarted it lose solr
collections. Furthermore when i manually upload  it again then no state.json
is created in collection but clusterstate.json is created instead.
i use solr 5.1.0



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lose-Solr-config-on-zookeeper-when-it-is-restarted-tp421.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is solr.StandardDirectoryFactory an MMapDirectory?

2015-10-07 Thread Shawn Heisey
On 10/7/2015 12:00 PM, Eric Torti wrote:
> Can we read "high reopen rate" as "frequent soft commits"? (In our
> case, hard commits do not open a searcher. But soft commits do).
>
> Considering it does mean "frequent soft commits", I'd say that it
> doesn't fit our setup because we have an index rate of about 10
> updates/s and we perform a soft commit at each 15min. So our scenario
> is not near real time in that sense. In light of this, do you thing
> using NRTCachingDirectory is still convenient?

The NRT factory achieves high speed in NRT situations by flushing very
small updates to RAM instead of the disk.  As more updates come in,
older index segments sitting in RAM will eventually be flushed to disk,
so a sustained flood of updates doesn't really achieve a speed increase,
but a short burst of updates will be searchable *very* quickly.

NRTCachingDirectoryFactory was chosen for Solr examples (and I think
it's the Solr default) because it has no real performance downsides, but
has a strong possibility to be noticeably faster than the standard
factory in NRT situations.

The only problem with it is that small index segments from recent
updates might only exist in RAM, and not get flushed to disk, so they
would be lost if Solr dies or is killed suddenly.  This is part of why
the updateLog feature exists -- when Solr is started, the transaction
logs will be replayed, inserting/replacing (at a minimum) all documents
indexed since the last hard commit.  When the replay is finished, you
will not lose data.  This does require a defined uniqueKey to operate
correctly.

Thanks,
Shawn



admin-extra

2015-10-07 Thread Upayavira
Do you use admin-extra within the admin UI?

If so, please go to [1] and document your use case. The feature
currently isn't implemented in the new admin UI, and without use-cases,
it likely won't be - so if you want it in there, please help us
understand how you use it!

Thanks!

Upayavira

[1] https://issues.apache.org/jira/browse/SOLR-8140


Re: Lose Solr config on zookeeper when it is restarted

2015-10-07 Thread Upayavira


On Wed, Oct 7, 2015, at 09:42 PM, CrazyDiamond wrote:
>  sometimes when Zookeeper( single mode) is restarted it lose solr
> collections. Furthermore when i manually upload  it again then no
> state.json
> is created in collection but clusterstate.json is created instead.
> i use solr 5.1.0

How are you starting zookeeper? Embedded within Solr? Stand-alone?

If you are starting it embedded, check the solr/zoo_data directory -
that's where Zookeeper is writing its info. If that is getting lost
somehow, you could loose your collections/etc.

Upayavira


Re: Unexpected delayed document deletion with atomic updates

2015-10-07 Thread Upayavira
What ID are you using? Are you possibly using the same ID field for
both, so the second document you visit causes the first to be
overwritten?

Upayavira

On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote:
> This certainly should not be happening. I'd
> take a careful look at what you actually send.
> My _guess_ is that you're not sending the update
> command you think you are
> 
> As a test you could just curl (or use post.jar) to
> send these types of commands up individually.
> 
> Perhaps looking at the solr log would help too...
> 
> Best,
> Erick
> 
> On Wed, Oct 7, 2015 at 6:32 AM, John Smith 
> wrote:
> > Hi,
> >
> > I'm bumping on the following problem with update XML messages. The idea
> > is to record the number of clicks for a document: each time, a message
> > is sent to .../update such as this one:
> >
> > 
> > 
> > abc
> > 1
> > 1.05
> > 
> > 
> >
> > (Clicks is an int field; Boost is a float field, it's updated to reflect
> > the change in popularity using a formula based on the number of clicks).
> >
> > At the moment in the dev environment, changes are committed immediately.
> >
> >
> > When a document is updated, the changes are indeed reflected in the
> > search results. If I click on the same document again, all goes well.
> > But  when I click on an other document, the latter gets updated as
> > expected but the former is plainly deleted. It can no longer be found
> > and the admin core Overview page counts 1 document less. If I click on a
> > 3rd document, so goes the 2nd one.
> >
> >
> > The schema is the default one amended to remove unneeded fields and add
> > new ones, nothing fancy. All fields are stored="true" and there's no
> > . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with
> > the same outcome. It looks like a bug to me but I might have overlooked
> > something? This is my first attempt at atomic updates.
> >
> > Thanks,
> > John.
> >


Re: Numeric Sorting with 0 and NULL Values

2015-10-07 Thread Todd Long
Todd Long wrote
> I'm curious as to where the loss of precision would be when using
> "-(Double.MAX_VALUE)" as you mentioned? Also, any specific reason why you
> chose that over Double.MIN_VALUE (sorry, just making sure I'm not missing
> something)?

So, to answer my own question it looks like Double.MIN_VALUE is somewhat
misleading (or poorly named perhaps?)... from the javadoc it states "A
constant holding the smallest positive nonzero value of type double". In
this case, the cast to int/long would result in 0 with the loss of precision
which is definitely not what I want (and back to the original issue). It
would certainly seem that -Double.MAX_VALUE would be the way to go! This is
something that I was not aware of with Double... thank you.


Chris Hostetter-3 wrote
> ...i mention this as being a workarround for floats/doubles because the 
> functions are evaluated as doubles (no "casting" or "forced integer 
> context" type support at the moment), so with integer/float fields there 
> would be some loss of precision.

I'm still curious of whether or not there would be any cast issue going from
double to int/long within the "def()" function. Any additional details would
be greatly appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Numeric-Sorting-with-0-and-NULL-Values-tp4232654p4233361.html
Sent from the Solr - User mailing list archive at Nabble.com.


words n-gram analyser

2015-10-07 Thread vit
Does Solr 4.2 have n-gram filter over words, not symbols like
EdgeNGramFilterFactory.

I hoped NGramTokenFilterFactory serves this purposes but looks like it also
creates n-grams over symbols.

I used it this way 

in hope that I will get 3-words to 10-words



--
View this message in context: 
http://lucene.472066.n3.nabble.com/words-n-gram-analyser-tp4233362.html
Sent from the Solr - User mailing list archive at Nabble.com.


Scramble data

2015-10-07 Thread Tarala, Magesh
Folks,
I have a strange question. We have a Solr implementation that we would like to 
demo to external customers. But we don't want to display the real data, which 
contains our customer information and so is sensitive data. What's the best way 
to scramble the data of the Solr Query results? By best I mean the simplest way 
with least amount of work. BTW, we have a .NET front end application.

Thanks,
Magesh





Re: Lose Solr config on zookeeper when it is restarted

2015-10-07 Thread CrazyDiamond
zk is stand-alone. But i think solr node is ephimeral.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lose-Solr-config-on-zookeeper-when-it-is-restarted-tp421p4233376.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Lose Solr config on zookeeper when it is restarted

2015-10-07 Thread Erick Erickson
Sounds like you're somehow mixing old and new versions of the ZK state
when you restart. I have no idea how that would be happening, but...

It's consistent. If you're somehow creating collections with the new
format where state.json is kept per collection, but when you restart
you're somehow defaulting to the old format where there was one
gigantic clusterstate.json.

One deceptive thing is that with the new format, the clusterstate.json
node will exist but be empty, and underneath the collections node
there'll be a state.json for that collection.

Best,
Erick

On Wed, Oct 7, 2015 at 6:31 PM, CrazyDiamond  wrote:
> zk is stand-alone. But i think solr node is ephimeral.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Lose-Solr-config-on-zookeeper-when-it-is-restarted-tp421p4233376.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Scramble data

2015-10-07 Thread Erick Erickson
Probably sanitize the data on the front end? Something simple like put
"REDACTED" for all of the customer-sensitive fields.

You might also write a DocTransformer plugin, all you have to do is
implement subclass DocTransformer and override one
very simple "transform" method,

Best,
Erick

On Wed, Oct 7, 2015 at 5:09 PM, Tarala, Magesh  wrote:
> Folks,
> I have a strange question. We have a Solr implementation that we would like 
> to demo to external customers. But we don't want to display the real data, 
> which contains our customer information and so is sensitive data. What's the 
> best way to scramble the data of the Solr Query results? By best I mean the 
> simplest way with least amount of work. BTW, we have a .NET front end 
> application.
>
> Thanks,
> Magesh
>
>
>


Re: words n-gram analyser

2015-10-07 Thread Erick Erickson
I think that ShingleFilterFactory is what you're looking for, see:
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory

Best,
Erick

On Wed, Oct 7, 2015 at 4:29 PM, vit  wrote:
> Does Solr 4.2 have n-gram filter over words, not symbols like
> EdgeNGramFilterFactory.
>
> I hoped NGramTokenFilterFactory serves this purposes but looks like it also
> creates n-grams over symbols.
>
> I used it this way
>  maxGramSize="10"/>
> in hope that I will get 3-words to 10-words
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/words-n-gram-analyser-tp4233362.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Run Solr 5.3.0 as a Service on Windows using NSSM

2015-10-07 Thread Zheng Lin Edwin Yeo
Hi Adrian and Upayavira,

It works fine when I start Solr outside NSSM.
As for the NSSM, so far I haven't tried the automatic startup yet. I start
the services for ZooKeeper and Solr in NSSM manually from the Windows
Component Services, so the ZooKeeper will have been started before I start
Solr.

I'll also try to write the script for Solr that can check it can access
Zookeeper before attempting to start Solr.

Regards,
Edwin


On 7 October 2015 at 19:16, Upayavira  wrote:

> Wrap your script that starts Solr with one that checks it can access
> Zookeeper before attempting to start Solr, that way, once ZK starts,
> Solr will come up. Then, hand *that* script to NSSM.
>
> And finally, when one of you has got a setup that works with NSSM
> starting Solr via the default bin\solr.cmd script, create a patch and
> upload it to JIRA. It would be a valuable thing for Solr to have a
> *standard* way to start Solr on Windows as a service. I recall checking
> the NSSM license and it wouldn't be an issue to include it within Solr -
> or to have a script that assumes it is installed.
>
> Upayavira
>
> On Wed, Oct 7, 2015, at 11:49 AM, Adrian Liew wrote:
> > Hi Edwin,
> >
> > You may want to try explore some of the configuration properties to
> > configure in zookeeper.
> >
> >
> http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html#sc_zkMulitServerSetup
> >
> > My recommendation is to try run your batch files outside of NSSM so it is
> > easier to debug and observe what you see from the command window. I don't
> > think ZK and Solr can be automated on startup well using NSSM due to the
> > fact that ZK services need to be running before you start up Solr
> > services. I just had conversation with Shawn on this topic. NSSM cannot
> > do the magic startup in a cluster setup. In that, you may need to write
> > custom scripting to get it right.
> >
> > Back to your original issue, I guess it is worth exploring timeout
> > values. Then again, I will leave the real Solr experts to chip in their
> > thoughts.
> >
> > Best regards,
> >
> > Adrian Liew
> >
> >
> > -Original Message-
> > From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
> > Sent: Wednesday, October 7, 2015 1:40 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Run Solr 5.3.0 as a Service on Windows using NSSM
> >
> > Hi Adrian,
> >
> > I've waited for more than 5 minutes and most of the time when I refresh
> > it says that the page cannot be found. Got one or twice the main Admin
> > page is loaded, but none of the cores are loaded.
> >
> > I have 20 cores which I'm loading. The core are of various sizes, but the
> > maximum one is 38GB. Others ranges from 10GB to 15GB, and there're some
> > which are less than 1GB.
> >
> > My overall core size is about 200GB.
> >
> > Regards,
> > Edwin
> >
> >
> > On 7 October 2015 at 12:11, Adrian Liew  wrote:
> >
> > > Hi Edwin,
> > >
> > > I have setup NSSM on Solr 5.3.0 in an Azure VM and can start up Solr
> > > with a base standalone installation.
> > >
> > > You may have to give Solr some time to bootstrap things and wait for
> > > the page to reload. Are you still seeing the page after 1 minute or so?
> > >
> > > What are your core sizes? And how many cores are you trying to load?
> > >
> > > Best regards,
> > > Adrian
> > >
> > > -Original Message-
> > > From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
> > > Sent: Wednesday, October 7, 2015 11:46 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Run Solr 5.3.0 as a Service on Windows using NSSM
> > >
> > > Hi,
> > >
> > > I tried to follow this to start my Solr as a service using NSSM.
> > > http://www.norconex.com/how-to-run-solr5-as-a-service-on-windows/
> > >
> > > Everything is fine when I start the services under Component Services.
> > > However, when I tried to point to the Solr Admin page, it says that
> > > the page cannot be found.
> > >
> > > I have tried the same thing in Solr 5.1, and it was able to work. Not
> > > sure why it couldn't work for Solr 5.2 and Solr 5.3.
> > >
> > > Is there any changes required to what is listed on the website?
> > >
> > > Regards,
> > > Edwin
> > >
>


tlog replay

2015-10-07 Thread Rallavagu

Solr 4.6.1, single shard, 4 node cloud, 3 node zk

Like to understand the behavior better when large number of updates 
happen on leader and it generates huge tlog (14G sometimes in my case) 
on other nodes. At the same time leader's tlog is few KB. So, what is 
the rate at which the changes from transaction log are applied at nodes? 
The autocommit interval is set to 15 seconds after going through 
https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


Thanks


Re: tlog replay

2015-10-07 Thread Erick Erickson
Uhm, that's very weird. Updates are not applied from the tlog. Rather the
raw doc is forwarded to the replica which both indexes the doc and
writes it to the local tlog. So having a 14G tlog on a follower but a small
tlog on the leader is definitely strange, especially if it persists over time.

I assume the follower is healthy? And does this very large tlog disappear
after a while? I'd expect it to be aged out after a few commits of > 100 docs.

All that said, there have been a LOT of improvements since 4.6, so it might
be something that's been addressed in the intervening time.

Best,
Erick



On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu  wrote:
> Solr 4.6.1, single shard, 4 node cloud, 3 node zk
>
> Like to understand the behavior better when large number of updates happen
> on leader and it generates huge tlog (14G sometimes in my case) on other
> nodes. At the same time leader's tlog is few KB. So, what is the rate at
> which the changes from transaction log are applied at nodes? The autocommit
> interval is set to 15 seconds after going through
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> Thanks


Re: tlog replay

2015-10-07 Thread Rallavagu

Thanks Erick.

Eventually, followers caught up but the 14G tlog file still persists and 
they are healthy. Is there anything to look for? Will monitor and see 
how long will it take before it disappears.


Evaluating move to Solr 5.3.

On 10/7/15 7:51 PM, Erick Erickson wrote:

Uhm, that's very weird. Updates are not applied from the tlog. Rather the
raw doc is forwarded to the replica which both indexes the doc and
writes it to the local tlog. So having a 14G tlog on a follower but a small
tlog on the leader is definitely strange, especially if it persists over time.

I assume the follower is healthy? And does this very large tlog disappear
after a while? I'd expect it to be aged out after a few commits of > 100 docs.

All that said, there have been a LOT of improvements since 4.6, so it might
be something that's been addressed in the intervening time.

Best,
Erick



On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu  wrote:

Solr 4.6.1, single shard, 4 node cloud, 3 node zk

Like to understand the behavior better when large number of updates happen
on leader and it generates huge tlog (14G sometimes in my case) on other
nodes. At the same time leader's tlog is few KB. So, what is the rate at
which the changes from transaction log are applied at nodes? The autocommit
interval is set to 15 seconds after going through
https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks


Re: tlog replay

2015-10-07 Thread Erick Erickson
The only way I can account for such a large file off the top of my
head is if, for some reason,
the Solr on the node somehow was failing to index documents and kept
adding them to the
log for a lnnn time. But how that would happen without the
node being in recovery
mode I'm not sure. I mean the Solr instance would have to be healthy
otherwise but just not
able to index docs which makes no sense.

The usual question here is whether there were any messages in the solr
log file indicating
problems while this built up.

tlogs will build up to very large sizes if there are very long hard
commit intervals, but I don't
see how that interval would be different on the leader and follower.

So color me puzzled.

Best,
Erick

On Wed, Oct 7, 2015 at 8:09 PM, Rallavagu  wrote:
> Thanks Erick.
>
> Eventually, followers caught up but the 14G tlog file still persists and
> they are healthy. Is there anything to look for? Will monitor and see how
> long will it take before it disappears.
>
> Evaluating move to Solr 5.3.
>
> On 10/7/15 7:51 PM, Erick Erickson wrote:
>>
>> Uhm, that's very weird. Updates are not applied from the tlog. Rather the
>> raw doc is forwarded to the replica which both indexes the doc and
>> writes it to the local tlog. So having a 14G tlog on a follower but a
>> small
>> tlog on the leader is definitely strange, especially if it persists over
>> time.
>>
>> I assume the follower is healthy? And does this very large tlog disappear
>> after a while? I'd expect it to be aged out after a few commits of > 100
>> docs.
>>
>> All that said, there have been a LOT of improvements since 4.6, so it
>> might
>> be something that's been addressed in the intervening time.
>>
>> Best,
>> Erick
>>
>>
>>
>> On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu  wrote:
>>>
>>> Solr 4.6.1, single shard, 4 node cloud, 3 node zk
>>>
>>> Like to understand the behavior better when large number of updates
>>> happen
>>> on leader and it generates huge tlog (14G sometimes in my case) on other
>>> nodes. At the same time leader's tlog is few KB. So, what is the rate at
>>> which the changes from transaction log are applied at nodes? The
>>> autocommit
>>> interval is set to 15 seconds after going through
>>>
>>> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>>>
>>> Thanks


Exclude documents having same data in two fields

2015-10-07 Thread Aman Tandon
Hi,

Is there a way in solr to remove all those documents from the search
results in which two of the fields, *mapping* and  *title* is the exactly
same.

With Regards
Aman Tandon


Re: Unexpected delayed document deletion with atomic updates

2015-10-07 Thread John Smith
The ids are all different: they're unique numbers followed by a couple
of keywords. I've made a test with a small collection of 10 documents to
make sure I can manage them manually: all ids are confirmed as different.

I also dumped the exact command, here's one example:

101084385_Sebago_ sebago shoes11.8701925463775

It's sent as the body of a POST request to
http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a
Content-Type: text/xml header. I still noted the consistent loss of
another document with the update above.

John


On 08/10/15 00:38, Upayavira wrote:
> What ID are you using? Are you possibly using the same ID field for
> both, so the second document you visit causes the first to be
> overwritten?
>
> Upayavira
>
> On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote:
>> This certainly should not be happening. I'd
>> take a careful look at what you actually send.
>> My _guess_ is that you're not sending the update
>> command you think you are
>>
>> As a test you could just curl (or use post.jar) to
>> send these types of commands up individually.
>>
>> Perhaps looking at the solr log would help too...
>>
>> Best,
>> Erick
>>
>> On Wed, Oct 7, 2015 at 6:32 AM, John Smith 
>> wrote:
>>> Hi,
>>>
>>> I'm bumping on the following problem with update XML messages. The idea
>>> is to record the number of clicks for a document: each time, a message
>>> is sent to .../update such as this one:
>>>
>>> 
>>> 
>>> abc
>>> 1
>>> 1.05
>>> 
>>> 
>>>
>>> (Clicks is an int field; Boost is a float field, it's updated to reflect
>>> the change in popularity using a formula based on the number of clicks).
>>>
>>> At the moment in the dev environment, changes are committed immediately.
>>>
>>>
>>> When a document is updated, the changes are indeed reflected in the
>>> search results. If I click on the same document again, all goes well.
>>> But  when I click on an other document, the latter gets updated as
>>> expected but the former is plainly deleted. It can no longer be found
>>> and the admin core Overview page counts 1 document less. If I click on a
>>> 3rd document, so goes the 2nd one.
>>>
>>>
>>> The schema is the default one amended to remove unneeded fields and add
>>> new ones, nothing fancy. All fields are stored="true" and there's no
>>> . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with
>>> the same outcome. It looks like a bug to me but I might have overlooked
>>> something? This is my first attempt at atomic updates.
>>>
>>> Thanks,
>>> John.
>>>



Re: Exclude documents having same data in two fields

2015-10-07 Thread NutchDev
One option could be creating another boolean field field1_equals_field2 and
set it to true for documents matching it while indexing. Use this field as a
filter criteria while querying solr. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Exclude-documents-having-same-data-in-two-fields-tp4233408p4233411.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Fuzzy search for names and phrases

2015-10-07 Thread NutchDev
WordDelimiterFilterFactory can handle cases like,

wi-fi ==> wifi
SD500 ==> sd 500
PowerShot ==> Power Shot

you can get more information at wiki page here,
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fuzzy-search-for-names-and-phrases-tp4233209p4233413.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unexpected delayed document deletion with atomic updates

2015-10-07 Thread John Smith
Oh, I forgot Erick's mention of the logs: there's nothing unusual in
INFO level, the update request just gets mentioned. No exception. I
reran it with the DEBUG level, but most of the log was related to jetty.
Here's a line I noticed though:

org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest:
{wt=json&commit=true&update.chain=dedupe}

The update.chain parameter wasn't part of the original request, and
"dedupe" looks suspicious to me. Perhaps should I investigate further there?

Thanks,
John.


On 08/10/15 08:25, John Smith wrote:
> The ids are all different: they're unique numbers followed by a couple
> of keywords. I've made a test with a small collection of 10 documents to
> make sure I can manage them manually: all ids are confirmed as different.
>
> I also dumped the exact command, here's one example:
>
> 101084385_Sebago_ sebago shoes name="Clicks" update="set">1 update="set">1.8701925463775
>
> It's sent as the body of a POST request to
> http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a
> Content-Type: text/xml header. I still noted the consistent loss of
> another document with the update above.
>
> John
>
>
> On 08/10/15 00:38, Upayavira wrote:
>> What ID are you using? Are you possibly using the same ID field for
>> both, so the second document you visit causes the first to be
>> overwritten?
>>
>> Upayavira
>>
>> On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote:
>>> This certainly should not be happening. I'd
>>> take a careful look at what you actually send.
>>> My _guess_ is that you're not sending the update
>>> command you think you are
>>>
>>> As a test you could just curl (or use post.jar) to
>>> send these types of commands up individually.
>>>
>>> Perhaps looking at the solr log would help too...
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Oct 7, 2015 at 6:32 AM, John Smith 
>>> wrote:
 Hi,

 I'm bumping on the following problem with update XML messages. The idea
 is to record the number of clicks for a document: each time, a message
 is sent to .../update such as this one:

 
 
 abc
 1
 1.05
 
 

 (Clicks is an int field; Boost is a float field, it's updated to reflect
 the change in popularity using a formula based on the number of clicks).

 At the moment in the dev environment, changes are committed immediately.


 When a document is updated, the changes are indeed reflected in the
 search results. If I click on the same document again, all goes well.
 But  when I click on an other document, the latter gets updated as
 expected but the former is plainly deleted. It can no longer be found
 and the admin core Overview page counts 1 document less. If I click on a
 3rd document, so goes the 2nd one.


 The schema is the default one amended to remove unneeded fields and add
 new ones, nothing fancy. All fields are stored="true" and there's no
 . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with
 the same outcome. It looks like a bug to me but I might have overlooked
 something? This is my first attempt at atomic updates.

 Thanks,
 John.

>



Re: Lose Solr config on zookeeper when it is restarted

2015-10-07 Thread Upayavira
Are all instances of Solr the same version? Mixing versions could cause
what Erick describes.

On Thu, Oct 8, 2015, at 03:19 AM, Erick Erickson wrote:
> Sounds like you're somehow mixing old and new versions of the ZK state
> when you restart. I have no idea how that would be happening, but...
> 
> It's consistent. If you're somehow creating collections with the new
> format where state.json is kept per collection, but when you restart
> you're somehow defaulting to the old format where there was one
> gigantic clusterstate.json.
> 
> One deceptive thing is that with the new format, the clusterstate.json
> node will exist but be empty, and underneath the collections node
> there'll be a state.json for that collection.
> 
> Best,
> Erick
> 
> On Wed, Oct 7, 2015 at 6:31 PM, CrazyDiamond 
> wrote:
> > zk is stand-alone. But i think solr node is ephimeral.
> >
> >
> >
> > --
> > View this message in context: 
> > http://lucene.472066.n3.nabble.com/Lose-Solr-config-on-zookeeper-when-it-is-restarted-tp421p4233376.html
> > Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unexpected delayed document deletion with atomic updates

2015-10-07 Thread Upayavira
Look for the DedupUpdateProcessor in an update chain.

that is there, but commented out IIRC in the techproducts sample
configs.

Perhaps you uncommented it to use your own update processors, but didn't
remove that component?

On Thu, Oct 8, 2015, at 07:38 AM, John Smith wrote:
> Oh, I forgot Erick's mention of the logs: there's nothing unusual in
> INFO level, the update request just gets mentioned. No exception. I
> reran it with the DEBUG level, but most of the log was related to jetty.
> Here's a line I noticed though:
> 
> org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest:
> {wt=json&commit=true&update.chain=dedupe}
> 
> The update.chain parameter wasn't part of the original request, and
> "dedupe" looks suspicious to me. Perhaps should I investigate further
> there?
> 
> Thanks,
> John.
> 
> 
> On 08/10/15 08:25, John Smith wrote:
> > The ids are all different: they're unique numbers followed by a couple
> > of keywords. I've made a test with a small collection of 10 documents to
> > make sure I can manage them manually: all ids are confirmed as different.
> >
> > I also dumped the exact command, here's one example:
> >
> > 101084385_Sebago_ sebago shoes > name="Clicks" update="set">1 > update="set">1.8701925463775
> >
> > It's sent as the body of a POST request to
> > http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a
> > Content-Type: text/xml header. I still noted the consistent loss of
> > another document with the update above.
> >
> > John
> >
> >
> > On 08/10/15 00:38, Upayavira wrote:
> >> What ID are you using? Are you possibly using the same ID field for
> >> both, so the second document you visit causes the first to be
> >> overwritten?
> >>
> >> Upayavira
> >>
> >> On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote:
> >>> This certainly should not be happening. I'd
> >>> take a careful look at what you actually send.
> >>> My _guess_ is that you're not sending the update
> >>> command you think you are
> >>>
> >>> As a test you could just curl (or use post.jar) to
> >>> send these types of commands up individually.
> >>>
> >>> Perhaps looking at the solr log would help too...
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Wed, Oct 7, 2015 at 6:32 AM, John Smith 
> >>> wrote:
>  Hi,
> 
>  I'm bumping on the following problem with update XML messages. The idea
>  is to record the number of clicks for a document: each time, a message
>  is sent to .../update such as this one:
> 
>  
>  
>  abc
>  1
>  1.05
>  
>  
> 
>  (Clicks is an int field; Boost is a float field, it's updated to reflect
>  the change in popularity using a formula based on the number of clicks).
> 
>  At the moment in the dev environment, changes are committed immediately.
> 
> 
>  When a document is updated, the changes are indeed reflected in the
>  search results. If I click on the same document again, all goes well.
>  But  when I click on an other document, the latter gets updated as
>  expected but the former is plainly deleted. It can no longer be found
>  and the admin core Overview page counts 1 document less. If I click on a
>  3rd document, so goes the 2nd one.
> 
> 
>  The schema is the default one amended to remove unneeded fields and add
>  new ones, nothing fancy. All fields are stored="true" and there's no
>  . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with
>  the same outcome. It looks like a bug to me but I might have overlooked
>  something? This is my first attempt at atomic updates.
> 
>  Thanks,
>  John.
> 
> >
>