from:"dev"

Changing existing index to use block-join

2014-01-18 Thread dev


Hello,


I read about the possibility to have nested documents with solr  
block-join since version 4.5.


I’m wondering if I can change an existing index to use this new  
opportunity. Right now I’m having an index which stores informations  
about a journal and each of its articles. For example an journal has  
the id - and the articles have ids like --01,  
--02, ….
I also already using a field called j-id in all documents to refer to  
the id of the journal (so all articles of a the journal in the given  
example have the j-id -). I’m using this j-id to group all  
results of an journal with the group feature. Obviously this solution  
lacks of some features like faceting or finding the parent journal of  
an article without doing a second request.


So, the new block-joing feature seems to solve some of these problems  
(sadly not all – as far as I see, I can’t get the parent document and  
the articles where the search term was found in a nested result).
So, my question now: can I change my existing index in just adding a  
is_parent and a _root_ field and saving the journal id there like I  
did with j-id or do I have to reindex all my documents?


I made some test in adding the id of the parent journal in the _root_  
field of the articles and trying to make a query like q={!parent  
which='is_parent:true'}+description:test but it didn’t seem to work. I  
only got an error message:


java.lang.IllegalArgumentException: docID must be >= 0 and <  
maxDoc=1418849 (got docID=-1)\r\n\tat  
org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseCompositeReader.java:182)\r\n\tat org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:109)\r\n\tat org.apache.lucene.index.IndexReader.document(IndexReader.java:436)\r\n\tat org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:657)\r\n\tat org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:270)\r\n\tat org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:172)\r\n\tat org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)\r\n\tat org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)\r\n\tat org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)\r\n\tat org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:60)\r\n\tat org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:698)\r\n\tat org.apache.solr.

servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426)\r\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)\r\n\tat
 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)\r\n\tat
 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)\r\n\tat
 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)\r\n\tat
 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)\r\n\tat
 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)\r\n\tat
 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)\r\n\tat
 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:929)\r\n\tat
 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)\r\n\tat
 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)\r\n\tat
 org.apache.coyote.ht
tp11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1002)\r\n\tat org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:585)\r\n\tat org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)\r\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\r\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\r\n\tat  
java.lang.Thread.run(Thread.java:724)\r\n",

"code": 500


Do you have any advice how to fix this or to use block-join properly?

Thanks,
Gesh

Re: Changing existing index to use block-join

2014-01-20 Thread dev



Zitat von Mikhail Khludnev :


On Sat, Jan 18, 2014 at 11:25 PM,  wrote:


So, my question now: can I change my existing index in just adding a
is_parent and a _root_ field and saving the journal id there like I did
with j-id or do I have to reindex all my documents?



Absolutely, to use block-join you need to index nested documents as blocks,
as it's described at
http://blog.griddynamics.com/2013/09/solr-block-join-support.html eg
https://gist.github.com/mkhludnev/6406734#file-t-shirts-xml



Thank you for the clarification.
But there is no way to add new children without indexing the parent  
document and all existing childs again?


So, in the example on github, if I want to add new sizes and colors to  
an existing T-Shirt, I have to reindex the already existing T-Shirt  
and all it's variations again?


I understand that the blocks are created at index time, so I can't  
change an existing index to build blocks just in adding the _root_  
field, but I don't get why it's not possible to add new children or  
did I missinterpret your statement?


Thanks,
-Gesh

Searching and scoring with block join

2014-01-22 Thread dev


Hello again,

I'm using the solr block-join feature to index a journal and all of  
it's articles.

Here a short example:



527fcbf8-c140-4ae6-8f51-68cd2efc1343
Sozialmagazin
8
2008
0340-8469
...
juventa
...
true

527fcb34-4570-4a86-b9e7-68cd2efc1343
A World out of Balance
62
Amthor
...
...


527fcbf8-84ec-424f-9d58-68cd2efc1343
Die Philosophie des 
Helfens
50
Keck
...
...




I read about the search syntax in this article:  
http://blog.griddynamics.com/2013/09/solr-block-join-support.html
Yet I'm wondering, how to use it properly. If I want to make a  
"fulltext" search over all journals and their articles and getting the  
journals with the highest score as result, how should my query look  
like?
I know that I can't just make a query like this: {!parent  
which=is_parent:true}+Term, most likely I'll get this error: child  
query must only match non-parent docs, but parent docID= matched  
childScorer=class org.apache.lucene.search.TermScorer


So, how do I make a query that is searching in both, journals and  
articles, giving me the journals ordered by their score? How do I get  
the score of the child documents to be added to the score of the  
parent document?


Thank you for your help.

- Gesh

Re: Searching and scoring with block join

2014-01-22 Thread dev



Zitat von Mikhail Khludnev :


On Wed, Jan 22, 2014 at 10:17 PM,  wrote:


I know that I can't just make a query like this: {!parent
which=is_parent:true}+Term, most likely I'll get this error: child query
must only match non-parent docs, but parent docID= matched
childScorer=class org.apache.lucene.search.TermScorer



Hello Gesh,

As it's state there child clause should not match any parent docs, but the
query +Term matches them because it applies some default field which, I
believe belongs to parent docs.

That blog has an example of searching across both 'scopes'
q=+BRAND_s:Nike +_query_:"{!parent which=type_s:parent}+COLOR_s:Red
+SIZE_s:XL"
mind exact fields specified for both scopes. In your case you need to
switch from conjunction '+' to disjunction.



Hello Mikhail,

Yes, that's correct.

I also already tried the query you brought as example, but I have  
problems with the scoring.
I'm using edismax as defType, but I'm not quite sure how to use it  
with a {!parent } query.


For example, if I do this query, the score is always 0
{!parent which=is_parent:true}+content_de:Test

The blog says: ToParentBlockJoinQuery supports a few modes of score  
calculations. {!parent} parser has None mode hardcoded.
So, can I change the hardcoded mode somehow? I didn't find any further  
documentation about the parameters of {!parent}.


If I'm doing this request, the score seems only be calculated by the  
results found in "title".

title:Test _query_:"{!parent which=is_parent:true}+content_de:Test"

Sorry if I ask stupid questions but I just have started to work with  
solr and some techniques are not very familiar.


Thanks
-Gesh

Re: Searching and scoring with block join

2014-01-24 Thread dev



Zitat von Mikhail Khludnev :


nesting query parsers is shown at
http://blog.griddynamics.com/2013/12/grandchildren-and-siblings-with-block.html

try to start from the following:
title:Test _query_:"{!parent which=is_parent:true}{!dismax
qf=content_de}Test"
mind about local params referencing eg {!... v=$nest}&nest=...


Thank you for the hint.
I don't really know how {!dismax ...} and local parameter referencing  
are solving my problem.
I read your blog entry, but I have some issues to understand how I can  
use your explanations.
Would you mind giving me a short example how these query params  
helping me to get a proper result with a combined score for parent and  
children?


Thank you very much.


there is no such parm in
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/join/BlockJoinParentQParser.java#L67
Raise an feature request issue, at least, don't hesitate to contribute.


Ah, okay, it was a misunderstanding then.
I created an issue: https://issues.apache.org/jira/browse/SOLR-5662


Sorry if I ask stupid questions but I just have started to work with solr
and some techniques are not very familiar.




Thanks
-Gesh

Indexing and searching documents in different languages

2013-04-09 Thread dev



Hello,

I'm trying to index a large number of documents in different languages.
I don't know the language of the document, so I'm using  
TikaLanguageIdentifierUpdateProcessorFactory to identify it.


So, this is my configuration in solrconfig.xml

 
   class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">

 true
 title,subtitle,content
 language_s
 0.3
 general
 en,fr,de,it,es
 true
 true
   
   
   
 

So, the detection works fine and I put some dynamic fields in  
schema.xml to store the results:
  stored="true" multiValued="true"/>
  stored="true" multiValued="true"/>
  stored="true" multiValued="true"/>
  stored="true" multiValued="true"/>
  stored="true" multiValued="true"/>


My main problem now is how to search the document without knowing the  
language of the searched document.
I don't want to have a huge querystring like   
?q=title_en:+term+subtitle_en:+term+title_de:+term...
Okay, using copyField and copy all fields into the "text" field...but  
"text" has the type text_general, so the language specific indexing is  
not working. I could use at least a combined field for every language  
(like text_en, text_fr...) but still, my querystring gets very long  
and to add new languages is terribly uncomfortable.


So, what can I do? Is there a better solution to index and search  
documents in many languages without knowing the language of the  
document and the query before?


- Geschan

Re: Indexing and searching documents in different languages

2013-04-10 Thread dev


Thx, I'll try this approach.

Zitat von Alexandre Rafalovitch :


Have you looked at edismax and the 'qf' fields parameter? It allows you to
define the fields to search. Also, you can define those parameters in
solrconfig.xml and not have to send them down the wire.

Finally, you can define several different request handlers (e.g. /ensearch,
/frsearch) and have each of them use different 'qf' values, possibly with
'fl' field also defined and with field name aliasing from language-specific
to generic names.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Apr 9, 2013 at 2:32 PM,  wrote:



Hello,

I'm trying to index a large number of documents in different languages.
I don't know the language of the document, so I'm using
TikaLanguageIdentifierUpdatePr**ocessorFactory to identify it.

So, this is my configuration in solrconfig.xml

 
   
 true
 title,**subtitle,content
 **language_s
 0.3
 **general
 en,fr,**de,it,es
 true
 **true
   
   
   
 

So, the detection works fine and I put some dynamic fields in schema.xml
to store the results:
  
  
  
  
  

My main problem now is how to search the document without knowing the
language of the searched document.
I don't want to have a huge querystring like
 ?q=title_en:+term+subtitle_en:**+term+title_de:+term...
Okay, using copyField and copy all fields into the "text" field...but
"text" has the type text_general, so the language specific indexing is not
working. I could use at least a combined field for every language (like
text_en, text_fr...) but still, my querystring gets very long and to add
new languages is terribly uncomfortable.

So, what can I do? Is there a better solution to index and search
documents in many languages without knowing the language of the document
and the query before?

- Geschan

Very bad search performance with group=true

2013-06-11 Thread dev


Hi,

I'm indexing pdf documents to use full text search with solr.
To get the number of the page where the result was found, I save every  
page separately and group the results with a field called doc_id.
(See this topic:  
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c1362242815.4092.140661199082425.338ed...@webmail.messagingengine.com%3E  
)


This works fine if I search in a single document, but if I search over  
the whole database  for a term, the results are really really slow,  
especially if group.limit is above 10. I indexed about 150.000 pages  
for now, but in the end it will be more than 1.000.000 pages.


How can I improve search performance?

I'm using this configuration:


 
   explicit

   json
   true
   text

   
   edismax
   
  id^10.0 ean^10.0
  title^10.0 subtitle^10.0 original_title^5.0
  content^3.0
  content_en^3.0
  content_fr^3.0
  content_de^3.0
  content_it^3.0
  content_es^3.0
  keyword^5.0 text^0.5
  author^2.0 editor^1.0
  publisher^3.0 category^1.0 series^5.0 information^1.0
   
   100%
   *:*
   10
	   id, title, subtitle, original_title, author,  
editor, publisher, category, series, score

   true
   doc_id
   20
   true
   content_*
   content
   true
   
   
 


THX for your help.

- Gesh

Get page number of searchresult of a pdf in solr

2013-02-28 Thread dev


Hello,

I'm building a web application where users can search for pdf  
documents and view them with pdf.js. I would like to display the  
search results with a short snippet of the paragraph where the search  
term where found and a link to open the document at the right page.


So what I need is the page number and a short text snippet of every  
search result.


I'm using SOLR 4.1 for indexing pdf documents. The indexing itself  
works fine but I don't know how to get the page number and paragraph  
of a search result. I only get the document where the search term was  
found in.


-Gesh

Re: Get page number of searchresult of a pdf in solr

2013-03-01 Thread dev

Is it possible to write a plugin that is converting each page  
separately with Tika and saving all pages in one document (maybe in a  
dynamic field like "page_*")? I would like to have only one document  
stored in SOLR for each pdf (it fit's better to the way my web  
application is managing these documents and I would like to use the  
same id as unique identifier).



To be honest, I can't understand why SOLR is not able to find the  
pages where the search term was found in. It's a quite common task in  
my opinion.


-Gesh

Zitat von Michael Della Bitta :


My guess is the best way to do this is to index each page separately
and to store a link to the PDF/page in each document.

That would probably require you to preprocess the PDFs to turn each
one into a single page per PDF, or to extract the text per page
another way.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn?t a Game


On Thu, Feb 28, 2013 at 3:26 PM,   wrote:

Hello,

I'm building a web application where users can search for pdf documents and
view them with pdf.js. I would like to display the search results with a
short snippet of the paragraph where the search term where found and a link
to open the document at the right page.

So what I need is the page number and a short text snippet of every search
result.

I'm using SOLR 4.1 for indexing pdf documents. The indexing itself works
fine but I don't know how to get the page number and paragraph of a search
result. I only get the document where the search term was found in.

-Gesh

Help me understand these newrelic graphs

2014-03-13 Thread Software Dev

Here are some screen shots of our Solr Cloud cluster via Newrelic

http://postimg.org/gallery/2hyzyeyc/

We currently have a 5 node cluster and all indexing is done on separate
machines and shipped over. Our machines are running on SSD's with 18G of
ram (Index size is 8G). We only have 1 shard at the moment with replicas on
all 5 machines. I'm guessing thats a bit of a waste?

How come when we do our bulk updating the response time actually decreases?
I would think the load would be higher therefor response time should be
higher. Any way I can decrease the response time?

Thanks

Re: Help me understand these newrelic graphs

2014-03-13 Thread Software Dev

Ahh.. its including the add operation. That makes sense I then. A bit silly
on NR's part they don't break it down.

Otis, our index is only 8G so I don't consider that big by any means but
our queries can get a bit complex with a bit of faceting. Do you still
think it makes sense to shard? How easy would this be to get working?


On Thu, Mar 13, 2014 at 4:02 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi,
>
> I think NR has support for breaking by handler, no?  Just checked - no.
>  Only webapp controller, but that doesn't apply to Solr.
>
> SPM should be more helpful when it comes to monitoring Solr - you can
> filter by host, handler, collection/core, etc. -- you can see the demo -
> https://apps.sematext.com/demo - though this is plain Solr, not SolrCloud.
>
> If your index is big or queries are complex, shard it and parallelize
> search.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Thu, Mar 13, 2014 at 6:17 PM, ralph tice  wrote:
>
> > I think your response time is including the average response for an add
> > operation, which generally returns very quickly and due to sheer number
> are
> > averaging out the response time of your queries.  New Relic should break
> > out requests based on which handler they're hitting but they don't seem
> to.
> >
> >
> > On Thu, Mar 13, 2014 at 2:18 PM, Software Dev  > >wrote:
> >
> > > Here are some screen shots of our Solr Cloud cluster via Newrelic
> > >
> > > http://postimg.org/gallery/2hyzyeyc/
> > >
> > > We currently have a 5 node cluster and all indexing is done on separate
> > > machines and shipped over. Our machines are running on SSD's with 18G
> of
> > > ram (Index size is 8G). We only have 1 shard at the moment with
> replicas
> > on
> > > all 5 machines. I'm guessing thats a bit of a waste?
> > >
> > > How come when we do our bulk updating the response time actually
> > decreases?
> > > I would think the load would be higher therefor response time should be
> > > higher. Any way I can decrease the response time?
> > >
> > > Thanks
> > >
> >
>

Re: Help me understand these newrelic graphs

2014-03-14 Thread Software Dev

If that is the case, what would help?


On Thu, Mar 13, 2014 at 8:46 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> It really depends, hard to give a definitive instruction without more
> pieces of info.
> e.g. if your CPUs are all maxed out and you already have a high number of
> concurrent queries than sharding may not be of any help at all.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Thu, Mar 13, 2014 at 7:42 PM, Software Dev  >wrote:
>
> > Ahh.. its including the add operation. That makes sense I then. A bit
> silly
> > on NR's part they don't break it down.
> >
> > Otis, our index is only 8G so I don't consider that big by any means but
> > our queries can get a bit complex with a bit of faceting. Do you still
> > think it makes sense to shard? How easy would this be to get working?
> >
> >
> > On Thu, Mar 13, 2014 at 4:02 PM, Otis Gospodnetic <
> > otis.gospodne...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I think NR has support for breaking by handler, no?  Just checked - no.
> > >  Only webapp controller, but that doesn't apply to Solr.
> > >
> > > SPM should be more helpful when it comes to monitoring Solr - you can
> > > filter by host, handler, collection/core, etc. -- you can see the demo
> -
> > > https://apps.sematext.com/demo - though this is plain Solr, not
> > SolrCloud.
> > >
> > > If your index is big or queries are complex, shard it and parallelize
> > > search.
> > >
> > > Otis
> > > --
> > > Performance Monitoring * Log Analytics * Search Analytics
> > > Solr & Elasticsearch Support * http://sematext.com/
> > >
> > >
> > > On Thu, Mar 13, 2014 at 6:17 PM, ralph tice 
> > wrote:
> > >
> > > > I think your response time is including the average response for an
> add
> > > > operation, which generally returns very quickly and due to sheer
> number
> > > are
> > > > averaging out the response time of your queries.  New Relic should
> > break
> > > > out requests based on which handler they're hitting but they don't
> seem
> > > to.
> > > >
> > > >
> > > > On Thu, Mar 13, 2014 at 2:18 PM, Software Dev <
> > static.void@gmail.com
> > > > >wrote:
> > > >
> > > > > Here are some screen shots of our Solr Cloud cluster via Newrelic
> > > > >
> > > > > http://postimg.org/gallery/2hyzyeyc/
> > > > >
> > > > > We currently have a 5 node cluster and all indexing is done on
> > separate
> > > > > machines and shipped over. Our machines are running on SSD's with
> 18G
> > > of
> > > > > ram (Index size is 8G). We only have 1 shard at the moment with
> > > replicas
> > > > on
> > > > > all 5 machines. I'm guessing thats a bit of a waste?
> > > > >
> > > > > How come when we do our bulk updating the response time actually
> > > > decreases?
> > > > > I would think the load would be higher therefor response time
> should
> > be
> > > > > higher. Any way I can decrease the response time?
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > >
> >
>

Re: Help me understand these newrelic graphs

2014-03-14 Thread Software Dev

Here is a screenshot of the host information:
http://postimg.org/image/vub5ihxix/

As you can see we have 24 core CPU's and the load is only at 5-7.5.


On Fri, Mar 14, 2014 at 10:02 AM, Software Dev wrote:

> If that is the case, what would help?
>
>
> On Thu, Mar 13, 2014 at 8:46 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com> wrote:
>
>> It really depends, hard to give a definitive instruction without more
>> pieces of info.
>> e.g. if your CPUs are all maxed out and you already have a high number of
>> concurrent queries than sharding may not be of any help at all.
>>
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Thu, Mar 13, 2014 at 7:42 PM, Software Dev > >wrote:
>>
>> > Ahh.. its including the add operation. That makes sense I then. A bit
>> silly
>> > on NR's part they don't break it down.
>> >
>> > Otis, our index is only 8G so I don't consider that big by any means but
>> > our queries can get a bit complex with a bit of faceting. Do you still
>> > think it makes sense to shard? How easy would this be to get working?
>> >
>> >
>> > On Thu, Mar 13, 2014 at 4:02 PM, Otis Gospodnetic <
>> > otis.gospodne...@gmail.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > I think NR has support for breaking by handler, no?  Just checked -
>> no.
>> > >  Only webapp controller, but that doesn't apply to Solr.
>> > >
>> > > SPM should be more helpful when it comes to monitoring Solr - you can
>> > > filter by host, handler, collection/core, etc. -- you can see the
>> demo -
>> > > https://apps.sematext.com/demo - though this is plain Solr, not
>> > SolrCloud.
>> > >
>> > > If your index is big or queries are complex, shard it and parallelize
>> > > search.
>> > >
>> > > Otis
>> > > --
>> > > Performance Monitoring * Log Analytics * Search Analytics
>> > > Solr & Elasticsearch Support * http://sematext.com/
>> > >
>> > >
>> > > On Thu, Mar 13, 2014 at 6:17 PM, ralph tice 
>> > wrote:
>> > >
>> > > > I think your response time is including the average response for an
>> add
>> > > > operation, which generally returns very quickly and due to sheer
>> number
>> > > are
>> > > > averaging out the response time of your queries.  New Relic should
>> > break
>> > > > out requests based on which handler they're hitting but they don't
>> seem
>> > > to.
>> > > >
>> > > >
>> > > > On Thu, Mar 13, 2014 at 2:18 PM, Software Dev <
>> > static.void@gmail.com
>> > > > >wrote:
>> > > >
>> > > > > Here are some screen shots of our Solr Cloud cluster via Newrelic
>> > > > >
>> > > > > http://postimg.org/gallery/2hyzyeyc/
>> > > > >
>> > > > > We currently have a 5 node cluster and all indexing is done on
>> > separate
>> > > > > machines and shipped over. Our machines are running on SSD's with
>> 18G
>> > > of
>> > > > > ram (Index size is 8G). We only have 1 shard at the moment with
>> > > replicas
>> > > > on
>> > > > > all 5 machines. I'm guessing thats a bit of a waste?
>> > > > >
>> > > > > How come when we do our bulk updating the response time actually
>> > > > decreases?
>> > > > > I would think the load would be higher therefor response time
>> should
>> > be
>> > > > > higher. Any way I can decrease the response time?
>> > > > >
>> > > > > Thanks
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Help me understand these newrelic graphs

2014-03-17 Thread Software Dev

Otis, I want to get those spikes down lower if possible. As mentioned in
the above posts that the 25ms timing you are seeing is not really accurate
because that's the average response time for ALL requests including the
bulk add operations which are generally super fast. Our true response time
is around 100ms.


On Fri, Mar 14, 2014 at 10:54 AM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Are you trying to bring that 24.9 ms response time down?
> Looks like there is room for more aggressive sharing there, yes.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Fri, Mar 14, 2014 at 1:07 PM, Software Dev  >wrote:
>
> > Here is a screenshot of the host information:
> > http://postimg.org/image/vub5ihxix/
> >
> > As you can see we have 24 core CPU's and the load is only at 5-7.5.
> >
> >
> > On Fri, Mar 14, 2014 at 10:02 AM, Software Dev <
> static.void@gmail.com
> > >wrote:
> >
> > > If that is the case, what would help?
> > >
> > >
> > > On Thu, Mar 13, 2014 at 8:46 PM, Otis Gospodnetic <
> > > otis.gospodne...@gmail.com> wrote:
> > >
> > >> It really depends, hard to give a definitive instruction without more
> > >> pieces of info.
> > >> e.g. if your CPUs are all maxed out and you already have a high number
> > of
> > >> concurrent queries than sharding may not be of any help at all.
> > >>
> > >> Otis
> > >> --
> > >> Performance Monitoring * Log Analytics * Search Analytics
> > >> Solr & Elasticsearch Support * http://sematext.com/
> > >>
> > >>
> > >> On Thu, Mar 13, 2014 at 7:42 PM, Software Dev <
> > static.void@gmail.com
> > >> >wrote:
> > >>
> > >> > Ahh.. its including the add operation. That makes sense I then. A
> bit
> > >> silly
> > >> > on NR's part they don't break it down.
> > >> >
> > >> > Otis, our index is only 8G so I don't consider that big by any means
> > but
> > >> > our queries can get a bit complex with a bit of faceting. Do you
> still
> > >> > think it makes sense to shard? How easy would this be to get
> working?
> > >> >
> > >> >
> > >> > On Thu, Mar 13, 2014 at 4:02 PM, Otis Gospodnetic <
> > >> > otis.gospodne...@gmail.com> wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I think NR has support for breaking by handler, no?  Just checked
> -
> > >> no.
> > >> > >  Only webapp controller, but that doesn't apply to Solr.
> > >> > >
> > >> > > SPM should be more helpful when it comes to monitoring Solr - you
> > can
> > >> > > filter by host, handler, collection/core, etc. -- you can see the
> > >> demo -
> > >> > > https://apps.sematext.com/demo - though this is plain Solr, not
> > >> > SolrCloud.
> > >> > >
> > >> > > If your index is big or queries are complex, shard it and
> > parallelize
> > >> > > search.
> > >> > >
> > >> > > Otis
> > >> > > --
> > >> > > Performance Monitoring * Log Analytics * Search Analytics
> > >> > > Solr & Elasticsearch Support * http://sematext.com/
> > >> > >
> > >> > >
> > >> > > On Thu, Mar 13, 2014 at 6:17 PM, ralph tice  >
> > >> > wrote:
> > >> > >
> > >> > > > I think your response time is including the average response for
> > an
> > >> add
> > >> > > > operation, which generally returns very quickly and due to sheer
> > >> number
> > >> > > are
> > >> > > > averaging out the response time of your queries.  New Relic
> should
> > >> > break
> > >> > > > out requests based on which handler they're hitting but they
> don't
> > >> seem
> > >> > > to.
> > >> > > >
> > >> > > >
> > >> > > > On Thu, Mar 13, 2014 at 2:18 PM, Software Dev <
> > >> > static.void@gmail.com
> > >> > > > >wrote:
> > >> > > >
> > >> > > > > Here are some screen shots of our Solr Cloud cluster via
> > Newrelic
> > >> > > > >
> > >> > > > > http://postimg.org/gallery/2hyzyeyc/
> > >> > > > >
> > >> > > > > We currently have a 5 node cluster and all indexing is done on
> > >> > separate
> > >> > > > > machines and shipped over. Our machines are running on SSD's
> > with
> > >> 18G
> > >> > > of
> > >> > > > > ram (Index size is 8G). We only have 1 shard at the moment
> with
> > >> > > replicas
> > >> > > > on
> > >> > > > > all 5 machines. I'm guessing thats a bit of a waste?
> > >> > > > >
> > >> > > > > How come when we do our bulk updating the response time
> actually
> > >> > > > decreases?
> > >> > > > > I would think the load would be higher therefor response time
> > >> should
> > >> > be
> > >> > > > > higher. Any way I can decrease the response time?
> > >> > > > >
> > >> > > > > Thanks
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Solr Cloud collection keep going down?

2014-03-22 Thread Software Dev

We have 2 collections with 1 shard each replicated over 5 servers in the
cluster. We see a lot of flapping (down or recovering) on one of the
collections. When this happens the other collection hosted on the same
machine is still marked as active. When this happens it takes a fairly long
time (~30 minutes) for the collection to come back online, if at all. I
find that its usually more reliable to completely shutdown solr on the
affected machine and bring it back up with its core disabled. We then
re-enable the core when its marked as active.

A few questions:

1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
that marks one collection as down but the other on the same machine as up?

2) Why does recovery take forever when a node goes down.. even if its only
down for 30 seconds. Our index is only 7-8G and we are running on SSD's.

3) What can be done to diagnose and fix this problem?

Re: Solr Cloud collection keep going down?

2014-03-22 Thread Software Dev

iter.write(OutputStreamWriter.java:207)

at org.apache.solr.util.FastWriter.flush(FastWriter.java:141)

at org.apache.solr.util.FastWriter.write(FastWriter.java:55)

at org.apache.solr.response.RubyWriter.writeStr(RubyResponseWriter.java:87)

at 
org.apache.solr.response.JSONWriter.writeNamedListAsFlat(JSONResponseWriter.java:285)

at 
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:301)

at 
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at 
org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at 
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at 
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at 
org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at 
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at 
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at 
org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at 
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)

at org.apache.solr.response.RubyResponseWriter.write(RubyResponseWriter.java:37)

at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:768)

at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:440)

at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)

at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)

at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)

at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)

at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)

at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)

at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)

at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:368)

at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)

at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)

at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)

at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)

at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)

at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)

at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)

at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)

at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

at java.lang.Thread.run(Thread.java:744)

Caused by: java.net.SocketException: Connection reset

at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)

at java.net.SocketOutputStream.write(SocketOutputStream.java:159)

at org.eclipse.jetty.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:375)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:164)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:182)

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)

... 51 more


,code=500}



On Sat, Mar 22, 2014 at 12:23 PM, Software Dev
 wrote:
> We have 2 collections with 1 shard each replicated over 5 servers in the
> cluster. We see a lot of flapping (down or recovering) on one of the
> collections. When this happens the other collection hosted on the same
> machine is still marked as active. When this happens it takes a fairly long
> time (~30 minutes) for the collection to come back online, if at all. I find
> that its usually more reliable to completely shutdown solr on the affected
> machine and bring it back up with its core disabled. We then re-enable the
> core when its marked as active.
>
> A few questions:
>
> 1) What is the healt

Re: Solr Cloud collection keep going down?

2014-03-24 Thread Software Dev

Shawn,

Thanks for pointing me in the right direction. After consulting the
above document I *think* that the problem may be too large of a heap
and which may be affecting GC collection and hence causing ZK
timeouts.

We have around 20G of memory on these machines with a min/max of heap
at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for
aside for disk cache. Why did we choose 6-10? No other reason than we
wanted to allot enough for disk cache and then everything else was
thrown and Solr. Does this sound about right?

I took some screenshots for VisualVM and our NewRelic reporting as
well as some relevant portions of our SolrConfig.xml. Any
thoughts/comments would be greatly appreciated.

http://postimg.org/gallery/4t73sdks/1fc10f9c/

Thanks




On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey  wrote:
> On 3/22/2014 1:23 PM, Software Dev wrote:
>> We have 2 collections with 1 shard each replicated over 5 servers in the
>> cluster. We see a lot of flapping (down or recovering) on one of the
>> collections. When this happens the other collection hosted on the same
>> machine is still marked as active. When this happens it takes a fairly long
>> time (~30 minutes) for the collection to come back online, if at all. I
>> find that its usually more reliable to completely shutdown solr on the
>> affected machine and bring it back up with its core disabled. We then
>> re-enable the core when its marked as active.
>>
>> A few questions:
>>
>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
>> that marks one collection as down but the other on the same machine as up?
>>
>> 2) Why does recovery take forever when a node goes down.. even if its only
>> down for 30 seconds. Our index is only 7-8G and we are running on SSD's.
>>
>> 3) What can be done to diagnose and fix this problem?
>
> Unless you are actually using the ping request handler, the healthcheck
> config will not matter.  Or were you referring to something else?
>
> Referencing the logs you included in your reply:  The EofException
> errors happen because your client code times out and disconnects before
> the request it made has completed.  That is most likely just a symptom
> that has nothing at all to do with the problem.
>
> Read the following wiki page.  What I'm going to say below will
> reference information you can find there:
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Relevant side note: The default zookeeper client timeout is 15 seconds.
>  A typical zookeeper config defines tickTime as 2 seconds, and the
> timeout cannot be configured to be more than 20 times the tickTime,
> which means it cannot go beyond 40 seconds.  The default timeout value
> 15 seconds is usually more than enough, unless you are having
> performance problems.
>
> If you are not actually taking Solr instances down, then the fact that
> you are seeing the log replay messages indicates to me that something is
> taking so much time that the connection to Zookeeper times out.  When it
> finally responds, it will attempt to recover the index, which means
> first it will replay the transaction log and then it might replicate the
> index from the shard leader.
>
> Replaying the transaction log is likely the reason it takes so long to
> recover.  The wiki page I linked above has a "slow startup" section that
> explains how to fix this.
>
> There is some kind of underlying problem that is causing the zookeeper
> connection to timeout.  It is most likely garbage collection pauses or
> insufficient RAM to cache the index, possibly both.
>
> You did not indicate how much total RAM you have or how big your Java
> heap is.  As the wiki page mentions in the SSD section, SSD is not a
> substitute for having enough RAM to cache at significant percentage of
> your index.
>
> Thanks,
> Shawn
>

Question on highlighting edgegrams

2014-03-24 Thread Software Dev

In 3.5.0 we have the following.


  



  
  


  


If we searched for "c" with highlighting enabled we would get back
results such as:

cdat
crocdile
cool beans

But in the latest Solr (4.7) we get the full words highlighted back.
Did something change from these versions with regards to highlighting?

Thanks

Re: Question on highlighting edgegrams

2014-03-25 Thread Software Dev

Bump

On Mon, Mar 24, 2014 at 3:00 PM, Software Dev  wrote:
> In 3.5.0 we have the following.
>
>  positionIncrementGap="100">
>   
> 
> 
>  maxGramSize="30"/>
>   
>   
> 
> 
>   
> 
>
> If we searched for "c" with highlighting enabled we would get back
> results such as:
>
> cdat
> crocdile
> cool beans
>
> But in the latest Solr (4.7) we get the full words highlighted back.
> Did something change from these versions with regards to highlighting?
>
> Thanks

Replication (Solr Cloud)

2014-03-25 Thread Software Dev

I see that by default in SolrCloud that my collections are
replicating. Should this be disabled in SolrCloud as this is already
handled by it?

>From the documentation:

"The Replication screen shows you the current replication state for
the named core you have specified. In Solr, replication is for the
index only. SolrCloud has supplanted much of this functionality, but
if you are still using index replication, you can use this screen to
see the replication state:"

I just want to make sure before I disable it that if we send an update
to one server that the document will be correctly replicated across
all nodes. Thanks

Re: Replication (Solr Cloud)

2014-03-25 Thread Software Dev

Thanks for the reply. Ill make sure NOT to disable it.

Re: Solr Cloud collection keep going down?

2014-03-25 Thread Software Dev

Can anyone else chime in? Thanks

On Mon, Mar 24, 2014 at 10:10 AM, Software Dev
 wrote:
> Shawn,
>
> Thanks for pointing me in the right direction. After consulting the
> above document I *think* that the problem may be too large of a heap
> and which may be affecting GC collection and hence causing ZK
> timeouts.
>
> We have around 20G of memory on these machines with a min/max of heap
> at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for
> aside for disk cache. Why did we choose 6-10? No other reason than we
> wanted to allot enough for disk cache and then everything else was
> thrown and Solr. Does this sound about right?
>
> I took some screenshots for VisualVM and our NewRelic reporting as
> well as some relevant portions of our SolrConfig.xml. Any
> thoughts/comments would be greatly appreciated.
>
> http://postimg.org/gallery/4t73sdks/1fc10f9c/
>
> Thanks
>
>
>
>
> On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey  wrote:
>> On 3/22/2014 1:23 PM, Software Dev wrote:
>>> We have 2 collections with 1 shard each replicated over 5 servers in the
>>> cluster. We see a lot of flapping (down or recovering) on one of the
>>> collections. When this happens the other collection hosted on the same
>>> machine is still marked as active. When this happens it takes a fairly long
>>> time (~30 minutes) for the collection to come back online, if at all. I
>>> find that its usually more reliable to completely shutdown solr on the
>>> affected machine and bring it back up with its core disabled. We then
>>> re-enable the core when its marked as active.
>>>
>>> A few questions:
>>>
>>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
>>> that marks one collection as down but the other on the same machine as up?
>>>
>>> 2) Why does recovery take forever when a node goes down.. even if its only
>>> down for 30 seconds. Our index is only 7-8G and we are running on SSD's.
>>>
>>> 3) What can be done to diagnose and fix this problem?
>>
>> Unless you are actually using the ping request handler, the healthcheck
>> config will not matter.  Or were you referring to something else?
>>
>> Referencing the logs you included in your reply:  The EofException
>> errors happen because your client code times out and disconnects before
>> the request it made has completed.  That is most likely just a symptom
>> that has nothing at all to do with the problem.
>>
>> Read the following wiki page.  What I'm going to say below will
>> reference information you can find there:
>>
>> http://wiki.apache.org/solr/SolrPerformanceProblems
>>
>> Relevant side note: The default zookeeper client timeout is 15 seconds.
>>  A typical zookeeper config defines tickTime as 2 seconds, and the
>> timeout cannot be configured to be more than 20 times the tickTime,
>> which means it cannot go beyond 40 seconds.  The default timeout value
>> 15 seconds is usually more than enough, unless you are having
>> performance problems.
>>
>> If you are not actually taking Solr instances down, then the fact that
>> you are seeing the log replay messages indicates to me that something is
>> taking so much time that the connection to Zookeeper times out.  When it
>> finally responds, it will attempt to recover the index, which means
>> first it will replay the transaction log and then it might replicate the
>> index from the shard leader.
>>
>> Replaying the transaction log is likely the reason it takes so long to
>> recover.  The wiki page I linked above has a "slow startup" section that
>> explains how to fix this.
>>
>> There is some kind of underlying problem that is causing the zookeeper
>> connection to timeout.  It is most likely garbage collection pauses or
>> insufficient RAM to cache the index, possibly both.
>>
>> You did not indicate how much total RAM you have or how big your Java
>> heap is.  As the wiki page mentions in the SSD section, SSD is not a
>> substitute for having enough RAM to cache at significant percentage of
>> your index.
>>
>> Thanks,
>> Shawn
>>

Re: Replication (Solr Cloud)

2014-03-25 Thread Software Dev

One other question. If I optimize a collection on one node, does this
get replicated to all others when finished?

On Tue, Mar 25, 2014 at 10:13 AM, Software Dev
 wrote:
> Thanks for the reply. Ill make sure NOT to disable it.

Re: Replication (Solr Cloud)

2014-03-25 Thread Software Dev

Ehh.. found out the hard way. I optimized the collection on 1 machine
and when it was completed it replicated to the others and took my
cluster down. Shitty

On Tue, Mar 25, 2014 at 10:46 AM, Software Dev
 wrote:
> One other question. If I optimize a collection on one node, does this
> get replicated to all others when finished?
>
> On Tue, Mar 25, 2014 at 10:13 AM, Software Dev
>  wrote:
>> Thanks for the reply. Ill make sure NOT to disable it.

Re: Replication (Solr Cloud)

2014-03-25 Thread Software Dev

So its generally a bad idea to optimize I gather?

- In older versions it might have done them all at once, but I believe
that newer versions only do one core at a time.

On Tue, Mar 25, 2014 at 11:16 AM, Shawn Heisey  wrote:
> On 3/25/2014 11:59 AM, Software Dev wrote:
>>
>> Ehh.. found out the hard way. I optimized the collection on 1 machine
>> and when it was completed it replicated to the others and took my
>> cluster down. Shitty
>
>
> It doesn't get replicated -- each core in the collection will be optimized.
> In older versions it might have done them all at once, but I believe that
> newer versions only do one core at a time.
>
> Doing an optimize on a Solr core results in a LOT of I/O. If your Solr
> install is having performance issues, that will push it over the edge.  When
> SolrCloud ends up with a performance problem in one place, they tend to
> multiply and cause MORE problems.  It can get bad enough that the whole
> cluster goes down because it's trying to do a recovery on every node.  For
> that reason, it's extremely important that you have enough system resources
> available across your cloud (RAM in particular) to avoid performance issues.
>
> Thanks,
> Shawn
>

Re: Replication (Solr Cloud)

2014-03-25 Thread Software Dev

"In older versions it might have done them all at once, but I believe
that newer versions only do one core at a time."

It looks like it did it all at once and I'm on the latest (4.7)

On Tue, Mar 25, 2014 at 11:27 AM, Software Dev
 wrote:
> So its generally a bad idea to optimize I gather?
>
> - In older versions it might have done them all at once, but I believe
> that newer versions only do one core at a time.
>
> On Tue, Mar 25, 2014 at 11:16 AM, Shawn Heisey  wrote:
>> On 3/25/2014 11:59 AM, Software Dev wrote:
>>>
>>> Ehh.. found out the hard way. I optimized the collection on 1 machine
>>> and when it was completed it replicated to the others and took my
>>> cluster down. Shitty
>>
>>
>> It doesn't get replicated -- each core in the collection will be optimized.
>> In older versions it might have done them all at once, but I believe that
>> newer versions only do one core at a time.
>>
>> Doing an optimize on a Solr core results in a LOT of I/O. If your Solr
>> install is having performance issues, that will push it over the edge.  When
>> SolrCloud ends up with a performance problem in one place, they tend to
>> multiply and cause MORE problems.  It can get bad enough that the whole
>> cluster goes down because it's trying to do a recovery on every node.  For
>> that reason, it's extremely important that you have enough system resources
>> available across your cloud (RAM in particular) to avoid performance issues.
>>
>> Thanks,
>> Shawn
>>

Re: Question on highlighting edgegrams

2014-03-25 Thread Software Dev

Same problem here:
http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-td4114748.html

On Tue, Mar 25, 2014 at 9:39 AM, Software Dev  wrote:
> Bump
>
> On Mon, Mar 24, 2014 at 3:00 PM, Software Dev  
> wrote:
>> In 3.5.0 we have the following.
>>
>> > positionIncrementGap="100">
>>   
>> 
>> 
>> > maxGramSize="30"/>
>>   
>>   
>> 
>> 
>>   
>> 
>>
>> If we searched for "c" with highlighting enabled we would get back
>> results such as:
>>
>> cdat
>> crocdile
>> cool beans
>>
>> But in the latest Solr (4.7) we get the full words highlighted back.
>> Did something change from these versions with regards to highlighting?
>>
>> Thanks

What contributes to disk IO?

2014-03-25 Thread Software Dev

What are the main contributing factors for Solr Cloud generating a lot
of disk IO?

A lot of reads? Writes? Insufficient RAM?

I would think if there was enough disk cache available for the whole
index there would be little to no disk IO.

Re: Question on highlighting edgegrams

2014-03-26 Thread Software Dev

Is this a known bug?

On Tue, Mar 25, 2014 at 1:12 PM, Software Dev  wrote:
> Same problem here:
> http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-td4114748.html
>
> On Tue, Mar 25, 2014 at 9:39 AM, Software Dev  
> wrote:
>> Bump
>>
>> On Mon, Mar 24, 2014 at 3:00 PM, Software Dev  
>> wrote:
>>> In 3.5.0 we have the following.
>>>
>>> >> positionIncrementGap="100">
>>>   
>>> 
>>> 
>>> >> maxGramSize="30"/>
>>>   
>>>   
>>> 
>>> 
>>>   
>>> 
>>>
>>> If we searched for "c" with highlighting enabled we would get back
>>> results such as:
>>>
>>> cdat
>>> crocdile
>>> cool beans
>>>
>>> But in the latest Solr (4.7) we get the full words highlighted back.
>>> Did something change from these versions with regards to highlighting?
>>>
>>> Thanks

What are my options?

2014-03-27 Thread Software Dev

We have a collection named "items". These are simply products that we
sell. A large part of our scoring involves boosting on certain metrics
for each product (amount sold, total GMS, ratings, etc). Some of these
metrics are actually split across multiple tables.

We are currently re-indexing the complete document anytime any of
these values changes. I'm wondering if there is a better way?

Some ideas:

1) Partial update the document. Is this even possible?
2) Add a parent-child relationship on Item and its metrics?
3) Dump all metrics to a file and use that as it changes throughout
the day? I forgot the actual component that does it. Either way, can
it handle multiple values?
4) Something else?

I appreciate any feedback. Thanks

Re: Question on highlighting edgegrams

2014-03-27 Thread Software Dev

Certainly I am not the only user experiencing this?

On Wed, Mar 26, 2014 at 1:11 PM, Software Dev  wrote:
> Is this a known bug?
>
> On Tue, Mar 25, 2014 at 1:12 PM, Software Dev  
> wrote:
>> Same problem here:
>> http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-td4114748.html
>>
>> On Tue, Mar 25, 2014 at 9:39 AM, Software Dev  
>> wrote:
>>> Bump
>>>
>>> On Mon, Mar 24, 2014 at 3:00 PM, Software Dev  
>>> wrote:
>>>> In 3.5.0 we have the following.
>>>>
>>>> >>> positionIncrementGap="100">
>>>>   
>>>> 
>>>> 
>>>> >>> maxGramSize="30"/>
>>>>   
>>>>   
>>>> 
>>>> 
>>>>   
>>>> 
>>>>
>>>> If we searched for "c" with highlighting enabled we would get back
>>>> results such as:
>>>>
>>>> cdat
>>>> crocdile
>>>> cool beans
>>>>
>>>> But in the latest Solr (4.7) we get the full words highlighted back.
>>>> Did something change from these versions with regards to highlighting?
>>>>
>>>> Thanks

Re: Question on highlighting edgegrams

2014-03-28 Thread Software Dev

Shalin,

I am running 4.7 and seeing this behavior :(

On Thu, Mar 27, 2014 at 10:36 PM, Shalin Shekhar Mangar
 wrote:
> Yes, there are known bugs with EdgeNGram filters. I think they are fixed in 
> 4.4
>
> See https://issues.apache.org/jira/browse/LUCENE-3907
>
> On Fri, Mar 28, 2014 at 10:17 AM, Software Dev
>  wrote:
>> Certainly I am not the only user experiencing this?
>>
>> On Wed, Mar 26, 2014 at 1:11 PM, Software Dev  
>> wrote:
>>> Is this a known bug?
>>>
>>> On Tue, Mar 25, 2014 at 1:12 PM, Software Dev  
>>> wrote:
>>>> Same problem here:
>>>> http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-td4114748.html
>>>>
>>>> On Tue, Mar 25, 2014 at 9:39 AM, Software Dev  
>>>> wrote:
>>>>> Bump
>>>>>
>>>>> On Mon, Mar 24, 2014 at 3:00 PM, Software Dev  
>>>>> wrote:
>>>>>> In 3.5.0 we have the following.
>>>>>>
>>>>>> >>>>> positionIncrementGap="100">
>>>>>>   
>>>>>> 
>>>>>> 
>>>>>> >>>>> maxGramSize="30"/>
>>>>>>   
>>>>>>   
>>>>>> 
>>>>>> 
>>>>>>   
>>>>>> 
>>>>>>
>>>>>> If we searched for "c" with highlighting enabled we would get back
>>>>>> results such as:
>>>>>>
>>>>>> cdat
>>>>>> crocdile
>>>>>> cool beans
>>>>>>
>>>>>> But in the latest Solr (4.7) we get the full words highlighted back.
>>>>>> Did something change from these versions with regards to highlighting?
>>>>>>
>>>>>> Thanks
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.

Highlighting bug with edgegrams

2014-04-09 Thread Software Dev

In 3.5.0 we have the following.


  



  
  


  


If we searched for "c" with highlighting enabled we would get back
results such as:

cdat
crocdile
cool beans

But in the latest Solr (4.7.1) we get the full words highlighted back.
Did something change from these versions with regards to highlighting?

Thanks

Found an old post but no info:

http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-td4114748.html

Re: Sharding and replicas (Solr Cloud)

2013-11-07 Thread Software Dev

Sorry about the confusion. I meant I created my config via the ZkCLI and
then I wanted to create my core via the CollectionsAPI. I *think* I have it
working but was wondering why there are a crazy amount of core names under
the admin "Core Selector"?

When I create X amount of shards via the bootstrap command I think it only
creates 1 core. Am I missing something?

On Thu, Nov 7, 2013 at 1:06 PM, Shawn Heisey  wrote:

> On 11/7/2013 1:58 PM, Mark wrote:
>
>> If I create my collection via the ZkCLI (https://cwiki.apache.org/
>> confluence/display/solr/Command+Line+Utilities) how do I configure the
>> number of shards and replicas?
>>
>
> I was not aware that  you could create collections with zkcli.  I did not
> think that was possible.  Use the collections API:
>
> http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_
> Collections_API
>
> Thanks,
> Shawn
>
>

Re: Sharding and replicas (Solr Cloud)

2013-11-07 Thread Software Dev

I too want to be in control of everything that is created.

Here is what I'm trying to do.

1) Start up a cluster of 5 Solr Instances
2) Import the configuration to Zookeeper
3) Manually create a collection via the collections api with number of
shards and replication factor

Now there are some issues with step 3. After creating the collection reload
the GUI I always see:

   - *collection1:*
org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
   Could not find configName for collection collection1 found:null

until I restart the cluster. Is there a way around this?

Also after creating the collection it creates a directory in
$SOLR_HOME/home. So in this example it created
${SOLR_HOME}/collection1_shard1_replica1 and
${SOLR_HOME}/collection1_shard1_replica2. What happens when I rename both
of these to the same in the core admin?






On Thu, Nov 7, 2013 at 3:15 PM, Shawn Heisey  wrote:

> On 11/7/2013 2:52 PM, Software Dev wrote:
>
>> Sorry about the confusion. I meant I created my config via the ZkCLI and
>> then I wanted to create my core via the CollectionsAPI. I *think* I have
>> it
>> working but was wondering why there are a crazy amount of core names under
>> the admin "Core Selector"?
>>
>> When I create X amount of shards via the bootstrap command I think it only
>> creates 1 core. Am I missing something?
>>
>
> If you create it with numShards=1 and replicationFactor=2, you'll end up
> with a total of 2 cores across all your Solr instances.  For my simple
> cloud install, these are the numbers that I'm using.  One shard, a total of
> two copies.
>
> If you create it with the numbers given on the wiki page, numShards=3 and
> replicationFactor=4, there would be a total of 12 cores created across all
> your servers.  The maxShardsPerNode parameter defaults to 1, which means
> that only 1 core per instance (SolrCloud node) is allowed for that
> collection.  If there aren't enough Solr instances for the numbers you have
> entered, the creation will fail.
>
> I don't know any details about what the bootstrap_conf parameter actually
> does when it creates collections.  I've never used it - I want to be in
> control of the configs and collections that get created.
>
> Thanks,
> Shawn
>
>

Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev

We are testing our shiny new Solr Cloud architecture but we are
experiencing some issues when doing bulk indexing.

We have 5 solr cloud machines running and 3 indexing machines (separate
from the cloud servers). The indexing machines pull off ids from a queue
then they index and ship over a document via a CloudSolrServer. It appears
that the indexers are too fast because the load (particularly disk io) on
the solr cloud machines spikes through the roof making the entire cluster
unusable. It's kind of odd because the total index size is not even
large..ie, < 10GB. Are there any optimization/enhancements I could try to
help alleviate these problems?

I should note that for the above collection we have only have 1 shard thats
replicated across all machines so all machines have the full index.

Would we benefit from switching to a ConcurrentUpdateSolrServer where all
updates get sent to 1 machine and 1 machine only? We could then remove this
machine from our cluster than that handles user requests.

Thanks for any input.

Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev

We commit have a soft commit every 5 seconds and hard commit every 30. As
far as docs/second it would guess around 200/sec which doesn't seem that
high.


On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson wrote:

> Questions: How often do you commit your updates? What is your
> indexing rate in docs/second?
>
> In a SolrCloud setup, you should be using a CloudSolrServer. If the
> server is having trouble keeping up with updates, switching to CUSS
> probably wouldn't help.
>
> So I suspect there's something not optimal about your setup that's
> the culprit.
>
> Best,
> Erick
>
> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev 
> wrote:
> > We are testing our shiny new Solr Cloud architecture but we are
> > experiencing some issues when doing bulk indexing.
> >
> > We have 5 solr cloud machines running and 3 indexing machines (separate
> > from the cloud servers). The indexing machines pull off ids from a queue
> > then they index and ship over a document via a CloudSolrServer. It
> appears
> > that the indexers are too fast because the load (particularly disk io) on
> > the solr cloud machines spikes through the roof making the entire cluster
> > unusable. It's kind of odd because the total index size is not even
> > large..ie, < 10GB. Are there any optimization/enhancements I could try to
> > help alleviate these problems?
> >
> > I should note that for the above collection we have only have 1 shard
> thats
> > replicated across all machines so all machines have the full index.
> >
> > Would we benefit from switching to a ConcurrentUpdateSolrServer where all
> > updates get sent to 1 machine and 1 machine only? We could then remove
> this
> > machine from our cluster than that handles user requests.
> >
> > Thanks for any input.
>

Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev

We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all
updates get sent to one machine or something?


On Mon, Jan 20, 2014 at 2:42 PM, Software Dev wrote:

> We commit have a soft commit every 5 seconds and hard commit every 30. As
> far as docs/second it would guess around 200/sec which doesn't seem that
> high.
>
>
> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson 
> wrote:
>
>> Questions: How often do you commit your updates? What is your
>> indexing rate in docs/second?
>>
>> In a SolrCloud setup, you should be using a CloudSolrServer. If the
>> server is having trouble keeping up with updates, switching to CUSS
>> probably wouldn't help.
>>
>> So I suspect there's something not optimal about your setup that's
>> the culprit.
>>
>> Best,
>> Erick
>>
>> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev 
>> wrote:
>> > We are testing our shiny new Solr Cloud architecture but we are
>> > experiencing some issues when doing bulk indexing.
>> >
>> > We have 5 solr cloud machines running and 3 indexing machines (separate
>> > from the cloud servers). The indexing machines pull off ids from a queue
>> > then they index and ship over a document via a CloudSolrServer. It
>> appears
>> > that the indexers are too fast because the load (particularly disk io)
>> on
>> > the solr cloud machines spikes through the roof making the entire
>> cluster
>> > unusable. It's kind of odd because the total index size is not even
>> > large..ie, < 10GB. Are there any optimization/enhancements I could try
>> to
>> > help alleviate these problems?
>> >
>> > I should note that for the above collection we have only have 1 shard
>> thats
>> > replicated across all machines so all machines have the full index.
>> >
>> > Would we benefit from switching to a ConcurrentUpdateSolrServer where
>> all
>> > updates get sent to 1 machine and 1 machine only? We could then remove
>> this
>> > machine from our cluster than that handles user requests.
>> >
>> > Thanks for any input.
>>
>
>

Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev

4.6.0


On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller  wrote:

> What version are you running?
>
> - Mark
>
> On Jan 20, 2014, at 5:43 PM, Software Dev 
> wrote:
>
> > We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all
> > updates get sent to one machine or something?
> >
> >
> > On Mon, Jan 20, 2014 at 2:42 PM, Software Dev  >wrote:
> >
> >> We commit have a soft commit every 5 seconds and hard commit every 30.
> As
> >> far as docs/second it would guess around 200/sec which doesn't seem that
> >> high.
> >>
> >>
> >> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson <
> erickerick...@gmail.com>wrote:
> >>
> >>> Questions: How often do you commit your updates? What is your
> >>> indexing rate in docs/second?
> >>>
> >>> In a SolrCloud setup, you should be using a CloudSolrServer. If the
> >>> server is having trouble keeping up with updates, switching to CUSS
> >>> probably wouldn't help.
> >>>
> >>> So I suspect there's something not optimal about your setup that's
> >>> the culprit.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev <
> static.void@gmail.com>
> >>> wrote:
> >>>> We are testing our shiny new Solr Cloud architecture but we are
> >>>> experiencing some issues when doing bulk indexing.
> >>>>
> >>>> We have 5 solr cloud machines running and 3 indexing machines
> (separate
> >>>> from the cloud servers). The indexing machines pull off ids from a
> queue
> >>>> then they index and ship over a document via a CloudSolrServer. It
> >>> appears
> >>>> that the indexers are too fast because the load (particularly disk io)
> >>> on
> >>>> the solr cloud machines spikes through the roof making the entire
> >>> cluster
> >>>> unusable. It's kind of odd because the total index size is not even
> >>>> large..ie, < 10GB. Are there any optimization/enhancements I could try
> >>> to
> >>>> help alleviate these problems?
> >>>>
> >>>> I should note that for the above collection we have only have 1 shard
> >>> thats
> >>>> replicated across all machines so all machines have the full index.
> >>>>
> >>>> Would we benefit from switching to a ConcurrentUpdateSolrServer where
> >>> all
> >>>> updates get sent to 1 machine and 1 machine only? We could then remove
> >>> this
> >>>> machine from our cluster than that handles user requests.
> >>>>
> >>>> Thanks for any input.
> >>>
> >>
> >>
>
>

Removing a node from Solr Cloud

2014-01-21 Thread Software Dev

What is the process for completely removing a node from Solr Cloud? We
recently removed one but t its still showing up as "Gone" in the Cloud
admin.

Thanks

Setting leaderVoteWait for auto discovered cores

2014-01-21 Thread Software Dev

How is this accomplished? We currently have an empty solr.xml
(auto-discovery) so I'm not sure where to put this value?

Re: Removing a node from Solr Cloud

2014-01-21 Thread Software Dev

Thanks. Anyway to accomplish this if the machine crashed (ie, can't unload
it from that admin)?


On Tue, Jan 21, 2014 at 11:25 AM, Anshum Gupta wrote:

> You could unload the cores. This optionally also deletes the data and
> instance directory.
> Look at http://wiki.apache.org/solr/CoreAdmin#UNLOAD.
>
>
> On Tue, Jan 21, 2014 at 10:22 AM, Software Dev  >wrote:
>
> > What is the process for completely removing a node from Solr Cloud? We
> > recently removed one but t its still showing up as "Gone" in the Cloud
> > admin.
> >
> > Thanks
> >
>
>
>
> --
>
> Anshum Gupta
> http://www.anshumgupta.net
>

Re: Solr Cloud Bulk Indexing Questions

2014-01-21 Thread Software Dev

Any other suggestions?


On Mon, Jan 20, 2014 at 2:49 PM, Software Dev wrote:

> 4.6.0
>
>
> On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller wrote:
>
>> What version are you running?
>>
>> - Mark
>>
>> On Jan 20, 2014, at 5:43 PM, Software Dev 
>> wrote:
>>
>> > We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all
>> > updates get sent to one machine or something?
>> >
>> >
>> > On Mon, Jan 20, 2014 at 2:42 PM, Software Dev <
>> static.void@gmail.com>wrote:
>> >
>> >> We commit have a soft commit every 5 seconds and hard commit every 30.
>> As
>> >> far as docs/second it would guess around 200/sec which doesn't seem
>> that
>> >> high.
>> >>
>> >>
>> >> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson <
>> erickerick...@gmail.com>wrote:
>> >>
>> >>> Questions: How often do you commit your updates? What is your
>> >>> indexing rate in docs/second?
>> >>>
>> >>> In a SolrCloud setup, you should be using a CloudSolrServer. If the
>> >>> server is having trouble keeping up with updates, switching to CUSS
>> >>> probably wouldn't help.
>> >>>
>> >>> So I suspect there's something not optimal about your setup that's
>> >>> the culprit.
>> >>>
>> >>> Best,
>> >>> Erick
>> >>>
>> >>> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev <
>> static.void@gmail.com>
>> >>> wrote:
>> >>>> We are testing our shiny new Solr Cloud architecture but we are
>> >>>> experiencing some issues when doing bulk indexing.
>> >>>>
>> >>>> We have 5 solr cloud machines running and 3 indexing machines
>> (separate
>> >>>> from the cloud servers). The indexing machines pull off ids from a
>> queue
>> >>>> then they index and ship over a document via a CloudSolrServer. It
>> >>> appears
>> >>>> that the indexers are too fast because the load (particularly disk
>> io)
>> >>> on
>> >>>> the solr cloud machines spikes through the roof making the entire
>> >>> cluster
>> >>>> unusable. It's kind of odd because the total index size is not even
>> >>>> large..ie, < 10GB. Are there any optimization/enhancements I could
>> try
>> >>> to
>> >>>> help alleviate these problems?
>> >>>>
>> >>>> I should note that for the above collection we have only have 1 shard
>> >>> thats
>> >>>> replicated across all machines so all machines have the full index.
>> >>>>
>> >>>> Would we benefit from switching to a ConcurrentUpdateSolrServer where
>> >>> all
>> >>>> updates get sent to 1 machine and 1 machine only? We could then
>> remove
>> >>> this
>> >>>> machine from our cluster than that handles user requests.
>> >>>>
>> >>>> Thanks for any input.
>> >>>
>> >>
>> >>
>>
>>
>

Re: Solr Cloud Bulk Indexing Questions

2014-01-22 Thread Software Dev

A suggestion would be to hard commit much less often, ie every 10
minutes, and see if there is a change.

- Will try this

How much system RAM ? JVM Heap ? Enough space in RAM for system disk cache ?

- We have 18G of ram 12 dedicated to Solr but as of right now the total
index size is only 5GB

Ah, and what about network IO ? Could that be a limiting factor ?

- What is the size of your documents ? A few KB, MB, ... ?

Under 1MB

- Again, total index size is only 5GB so I dont know if this would be a
problem






On Wed, Jan 22, 2014 at 12:26 AM, Andre Bois-Crettez
wrote:

> 1 node having more load should be the leader (because of the extra work
> of receiving and distributing updates, but my experiences show only a
> bit more CPU usage, and no difference in disk IO).
>
> A suggestion would be to hard commit much less often, ie every 10
> minutes, and see if there is a change.
> How much system RAM ? JVM Heap ? Enough space in RAM for system disk cache
> ?
> What is the size of your documents ? A few KB, MB, ... ?
> Ah, and what about network IO ? Could that be a limiting factor ?
>
>
> André
>
>
> On 2014-01-21 23:40, Software Dev wrote:
>
>> Any other suggestions?
>>
>>
>> On Mon, Jan 20, 2014 at 2:49 PM, Software Dev 
>> wrote:
>>
>>  4.6.0
>>>
>>>
>>> On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller >> >wrote:
>>>
>>>  What version are you running?
>>>>
>>>> - Mark
>>>>
>>>> On Jan 20, 2014, at 5:43 PM, Software Dev 
>>>> wrote:
>>>>
>>>>  We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do
>>>>> all
>>>>> updates get sent to one machine or something?
>>>>>
>>>>>
>>>>> On Mon, Jan 20, 2014 at 2:42 PM, Software Dev <
>>>>>
>>>> static.void@gmail.com>wrote:
>>>>
>>>>> We commit have a soft commit every 5 seconds and hard commit every 30.
>>>>>>
>>>>> As
>>>>
>>>>> far as docs/second it would guess around 200/sec which doesn't seem
>>>>>>
>>>>> that
>>>>
>>>>> high.
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson <
>>>>>>
>>>>> erickerick...@gmail.com>wrote:
>>>>
>>>>> Questions: How often do you commit your updates? What is your
>>>>>>> indexing rate in docs/second?
>>>>>>>
>>>>>>> In a SolrCloud setup, you should be using a CloudSolrServer. If the
>>>>>>> server is having trouble keeping up with updates, switching to CUSS
>>>>>>> probably wouldn't help.
>>>>>>>
>>>>>>> So I suspect there's something not optimal about your setup that's
>>>>>>> the culprit.
>>>>>>>
>>>>>>> Best,
>>>>>>> Erick
>>>>>>>
>>>>>>> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev <
>>>>>>>
>>>>>> static.void@gmail.com>
>>>>
>>>>> wrote:
>>>>>>>
>>>>>>>> We are testing our shiny new Solr Cloud architecture but we are
>>>>>>>> experiencing some issues when doing bulk indexing.
>>>>>>>>
>>>>>>>> We have 5 solr cloud machines running and 3 indexing machines
>>>>>>>>
>>>>>>> (separate
>>>>
>>>>> from the cloud servers). The indexing machines pull off ids from a
>>>>>>>>
>>>>>>> queue
>>>>
>>>>> then they index and ship over a document via a CloudSolrServer. It
>>>>>>>>
>>>>>>> appears
>>>>>>>
>>>>>>>> that the indexers are too fast because the load (particularly disk
>>>>>>>>
>>>>>>> io)
>>>>
>>>>> on
>>>>>>>
>>>>>>>> the solr cloud machines spikes through the roof making the entire
>>>>>>>>
>>>>>>> cluster
>>>>>>>
>>>>>>>> unusable. It's kind of odd because the total index size is not even
>>>>>>>> large..ie, < 10GB. Are there any optimization/enhancements I could
>>>>>>>>
>>>>>>> try
>>>>
>>>>> to
>>>>>>>
>>>>>>>> help alleviate these problems?
>>>>>>>>
>>>>>>>> I should note that for the above collection we have only have 1
>>>>>>>> shard
>>>>>>>>
>>>>>>> thats
>>>>>>>
>>>>>>>> replicated across all machines so all machines have the full index.
>>>>>>>>
>>>>>>>> Would we benefit from switching to a ConcurrentUpdateSolrServer
>>>>>>>> where
>>>>>>>>
>>>>>>> all
>>>>>>>
>>>>>>>> updates get sent to 1 machine and 1 machine only? We could then
>>>>>>>>
>>>>>>> remove
>>>>
>>>>> this
>>>>>>>
>>>>>>>> machine from our cluster than that handles user requests.
>>>>>>>>
>>>>>>>> Thanks for any input.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> --
>> André Bois-Crettez
>>
>> Software Architect
>> Search Developer
>> http://www.kelkoo.com/
>>
>
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>

Re: Solr Cloud Bulk Indexing Questions

2014-01-23 Thread Software Dev

Thanks for suggestions. After reading that document I feel even more
confused though because I always thought that hard commits should be less
frequent that hard commits.

Is there any way to configure autoCommit, softCommit values on a per
request basis? The majority of the time we have small flow of updates
coming in and we would like to see them in ASAP. However we occasionally
need to do some bulk indexing (once a week or less) and the need to see
those updates right away isn't as critical.

I would say 95% of the time we are in "Index-Light Query-Light/Heavy" mode
and the other 5% is "Index-Heavy Query-Light/Heavy" mode.

Thanks


On Wed, Jan 22, 2014 at 5:33 PM, Erick Erickson wrote:

> When you're doing hard commits, is it with openSeacher = true or
> false? It should probably be false...
>
> Here's a rundown of the soft/hard commit consequences:
>
>
> http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> I suspect (but, of course, can't prove) that you're over-committing
> and hitting segment
> merges without meaning to...
>
> FWIW,
> Erick
>
> On Wed, Jan 22, 2014 at 1:46 PM, Software Dev 
> wrote:
> > A suggestion would be to hard commit much less often, ie every 10
> > minutes, and see if there is a change.
> >
> > - Will try this
> >
> > How much system RAM ? JVM Heap ? Enough space in RAM for system disk
> cache ?
> >
> > - We have 18G of ram 12 dedicated to Solr but as of right now the total
> > index size is only 5GB
> >
> > Ah, and what about network IO ? Could that be a limiting factor ?
> >
> > - What is the size of your documents ? A few KB, MB, ... ?
> >
> > Under 1MB
> >
> > - Again, total index size is only 5GB so I dont know if this would be a
> > problem
> >
> >
> >
> >
> >
> >
> > On Wed, Jan 22, 2014 at 12:26 AM, Andre Bois-Crettez
> > wrote:
> >
> >> 1 node having more load should be the leader (because of the extra work
> >> of receiving and distributing updates, but my experiences show only a
> >> bit more CPU usage, and no difference in disk IO).
> >>
> >> A suggestion would be to hard commit much less often, ie every 10
> >> minutes, and see if there is a change.
> >> How much system RAM ? JVM Heap ? Enough space in RAM for system disk
> cache
> >> ?
> >> What is the size of your documents ? A few KB, MB, ... ?
> >> Ah, and what about network IO ? Could that be a limiting factor ?
> >>
> >>
> >> André
> >>
> >>
> >> On 2014-01-21 23:40, Software Dev wrote:
> >>
> >>> Any other suggestions?
> >>>
> >>>
> >>> On Mon, Jan 20, 2014 at 2:49 PM, Software Dev <
> static.void@gmail.com>
> >>> wrote:
> >>>
> >>>  4.6.0
> >>>>
> >>>>
> >>>> On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller  >>>> >wrote:
> >>>>
> >>>>  What version are you running?
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> On Jan 20, 2014, at 5:43 PM, Software Dev  >
> >>>>> wrote:
> >>>>>
> >>>>>  We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do
> >>>>>> all
> >>>>>> updates get sent to one machine or something?
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Jan 20, 2014 at 2:42 PM, Software Dev <
> >>>>>>
> >>>>> static.void@gmail.com>wrote:
> >>>>>
> >>>>>> We commit have a soft commit every 5 seconds and hard commit every
> 30.
> >>>>>>>
> >>>>>> As
> >>>>>
> >>>>>> far as docs/second it would guess around 200/sec which doesn't seem
> >>>>>>>
> >>>>>> that
> >>>>>
> >>>>>> high.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson <
> >>>>>>>
> >>>>>> erickerick...@gmail.com>wrote:
> >>>>>
> >>>>>> Questions: How often do you commit your updates? What is your
> >>>>>>>> indexing rate in docs/second?
> >>>>>>>>
> >>>>>>>&g

Re: Solr Cloud Bulk Indexing Questions

2014-01-23 Thread Software Dev

Also, any suggestions on debugging? What should I look for and how? Thanks


On Thu, Jan 23, 2014 at 10:01 AM, Software Dev wrote:

> Thanks for suggestions. After reading that document I feel even more
> confused though because I always thought that hard commits should be less
> frequent that hard commits.
>
> Is there any way to configure autoCommit, softCommit values on a per
> request basis? The majority of the time we have small flow of updates
> coming in and we would like to see them in ASAP. However we occasionally
> need to do some bulk indexing (once a week or less) and the need to see
> those updates right away isn't as critical.
>
> I would say 95% of the time we are in "Index-Light Query-Light/Heavy" mode
> and the other 5% is "Index-Heavy Query-Light/Heavy" mode.
>
> Thanks
>
>
> On Wed, Jan 22, 2014 at 5:33 PM, Erick Erickson 
> wrote:
>
>> When you're doing hard commits, is it with openSeacher = true or
>> false? It should probably be false...
>>
>> Here's a rundown of the soft/hard commit consequences:
>>
>>
>> http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>>
>> I suspect (but, of course, can't prove) that you're over-committing
>> and hitting segment
>> merges without meaning to...
>>
>> FWIW,
>> Erick
>>
>> On Wed, Jan 22, 2014 at 1:46 PM, Software Dev 
>> wrote:
>> > A suggestion would be to hard commit much less often, ie every 10
>> > minutes, and see if there is a change.
>> >
>> > - Will try this
>> >
>> > How much system RAM ? JVM Heap ? Enough space in RAM for system disk
>> cache ?
>> >
>> > - We have 18G of ram 12 dedicated to Solr but as of right now the total
>> > index size is only 5GB
>> >
>> > Ah, and what about network IO ? Could that be a limiting factor ?
>> >
>> > - What is the size of your documents ? A few KB, MB, ... ?
>> >
>> > Under 1MB
>> >
>> > - Again, total index size is only 5GB so I dont know if this would be a
>> > problem
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Jan 22, 2014 at 12:26 AM, Andre Bois-Crettez
>> > wrote:
>> >
>> >> 1 node having more load should be the leader (because of the extra work
>> >> of receiving and distributing updates, but my experiences show only a
>> >> bit more CPU usage, and no difference in disk IO).
>> >>
>> >> A suggestion would be to hard commit much less often, ie every 10
>> >> minutes, and see if there is a change.
>> >> How much system RAM ? JVM Heap ? Enough space in RAM for system disk
>> cache
>> >> ?
>> >> What is the size of your documents ? A few KB, MB, ... ?
>> >> Ah, and what about network IO ? Could that be a limiting factor ?
>> >>
>> >>
>> >> André
>> >>
>> >>
>> >> On 2014-01-21 23:40, Software Dev wrote:
>> >>
>> >>> Any other suggestions?
>> >>>
>> >>>
>> >>> On Mon, Jan 20, 2014 at 2:49 PM, Software Dev <
>> static.void@gmail.com>
>> >>> wrote:
>> >>>
>> >>>  4.6.0
>> >>>>
>> >>>>
>> >>>> On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller > >>>> >wrote:
>> >>>>
>> >>>>  What version are you running?
>> >>>>>
>> >>>>> - Mark
>> >>>>>
>> >>>>> On Jan 20, 2014, at 5:43 PM, Software Dev <
>> static.void@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>  We also noticed that disk IO shoots up to 100% on 1 of the nodes.
>> Do
>> >>>>>> all
>> >>>>>> updates get sent to one machine or something?
>> >>>>>>
>> >>>>>>
>> >>>>>> On Mon, Jan 20, 2014 at 2:42 PM, Software Dev <
>> >>>>>>
>> >>>>> static.void@gmail.com>wrote:
>> >>>>>
>> >>>>>> We commit have a soft commit every 5 seconds and hard commit every
>> 30.
>> >>>>>>>
>> >>>>>> As
>> >>>>>
>> >>>>>> far as docs/second it would guess around 200/s

Re: Solr Cloud Bulk Indexing Questions

2014-01-23 Thread Software Dev

Does maxWriteMBPerSec apply to NRTCachingDirectoryFactory? I only
see maxMergeSizeMB and maxCachedMB as configuration values.


On Thu, Jan 23, 2014 at 11:05 AM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi,
>
> Have you tried maxWriteMBPerSec?
>
> http://search-lucene.com/?q=maxWriteMBPerSec&fc_project=Solr
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev  >wrote:
>
> > We are testing our shiny new Solr Cloud architecture but we are
> > experiencing some issues when doing bulk indexing.
> >
> > We have 5 solr cloud machines running and 3 indexing machines (separate
> > from the cloud servers). The indexing machines pull off ids from a queue
> > then they index and ship over a document via a CloudSolrServer. It
> appears
> > that the indexers are too fast because the load (particularly disk io) on
> > the solr cloud machines spikes through the roof making the entire cluster
> > unusable. It's kind of odd because the total index size is not even
> > large..ie, < 10GB. Are there any optimization/enhancements I could try to
> > help alleviate these problems?
> >
> > I should note that for the above collection we have only have 1 shard
> thats
> > replicated across all machines so all machines have the full index.
> >
> > Would we benefit from switching to a ConcurrentUpdateSolrServer where all
> > updates get sent to 1 machine and 1 machine only? We could then remove
> this
> > machine from our cluster than that handles user requests.
> >
> > Thanks for any input.
> >
>

SolrCloudServer questions

2014-01-31 Thread Software Dev

Can someone clarify what the following options are:

- updatesToLeaders
- shutdownLBHttpSolrServer
- parallelUpdates

Also, I remember in older version of Solr there was an efficient format
that was used between SolrJ and Solr that is more compact. Does this sill
exist in the latest version of Solr? If so, is it the default?

Thanks

Disabling Commit/Auto-Commit (SolrCloud)

2014-01-31 Thread Software Dev

Is there a way to disable commit/hard-commit at runtime? For example, we
usually have our hard commit and soft-commit set really low but when we do
bulk indexing we would like to disable this to increase performance. If
there isn't a an easy way of doing this would simply pushing a new
solrconfig to solrcloud work?

Re: SolrCloudServer questions

2014-01-31 Thread Software Dev

Which of any of these settings would be beneficial when bulk uploading?


On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller  wrote:

>
>
> On Jan 31, 2014, at 1:56 PM, Greg Walters 
> wrote:
>
> > I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore
> my response.
> >
> >> -updatesToLeaders
> >
> > Only send documents to shard leaders while indexing. This saves
> cross-talk between slaves and leaders which results in more efficient
> document routing.
>
> Right, but recently this has less of an affect because CloudSolrServer can
> now hash documents and directly send them to the right place. This option
> has become more historical. Just make sure you set the correct id field on
> the CloudSolrServer instance for this hashing to work (I think it defaults
> to "id").
>
> >
> >> shutdownLBHttpSolrServer
> >
> > CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute
> requests (that aren't updates directly to leaders). Where did you find
> this? I don't see this in the javadoc anywhere but it is a boolean in the
> CloudSolrServer class. It looks like when you create a new CloudSolrServer
> and pass it your own LBHttpSolrServer the boolean gets set to false and the
> CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down.
> >
> >> parellelUpdates
> >
> > The javadoc's done have any description for this one but I checked out
> the code for CloudSolrServer and if parallelUpdates it looks like it
> executes update statements to multiple shards at the same time.
>
> Right, we should def add some javadoc, but this sends updates to shards in
> parallel rather than with a single thread. Can really increase update
> speed. Still not as powerful as using CloudSolrServer from multiple
> threads, but a nice improvement non the less.
>
>
> - Mark
>
> http://about.me/markrmiller
>
> >
> > I'm no dev but I can read so please excuse any errors on my part.
> >
> > Thanks,
> > Greg
> >
> > On Jan 31, 2014, at 11:40 AM, Software Dev 
> wrote:
> >
> >> Can someone clarify what the following options are:
> >>
> >> - updatesToLeaders
> >> - shutdownLBHttpSolrServer
> >> - parallelUpdates
> >>
> >> Also, I remember in older version of Solr there was an efficient format
> >> that was used between SolrJ and Solr that is more compact. Does this
> sill
> >> exist in the latest version of Solr? If so, is it the default?
> >>
> >> Thanks
> >
>
>

Re: SolrCloudServer questions

2014-02-01 Thread Software Dev

Out use case is we have 3 indexing machines pulling off a kafka queue and
they are all sending individual updates.


On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller  wrote:

> Just make sure parallel updates is set to true.
>
> If you want to load even faster, you can use the bulk add methods, or if
> you need more fine grained responses, use the single add from multiple
> threads (though bulk add can also be done via multiple threads if you
> really want to try and push the max).
>
> - Mark
>
> http://about.me/markrmiller
>
> On Jan 31, 2014, at 3:50 PM, Software Dev 
> wrote:
>
> > Which of any of these settings would be beneficial when bulk uploading?
> >
> >
> > On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller 
> wrote:
> >
> >>
> >>
> >> On Jan 31, 2014, at 1:56 PM, Greg Walters 
> >> wrote:
> >>
> >>> I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore
> >> my response.
> >>>
> >>>> -updatesToLeaders
> >>>
> >>> Only send documents to shard leaders while indexing. This saves
> >> cross-talk between slaves and leaders which results in more efficient
> >> document routing.
> >>
> >> Right, but recently this has less of an affect because CloudSolrServer
> can
> >> now hash documents and directly send them to the right place. This
> option
> >> has become more historical. Just make sure you set the correct id field
> on
> >> the CloudSolrServer instance for this hashing to work (I think it
> defaults
> >> to "id").
> >>
> >>>
> >>>> shutdownLBHttpSolrServer
> >>>
> >>> CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute
> >> requests (that aren't updates directly to leaders). Where did you find
> >> this? I don't see this in the javadoc anywhere but it is a boolean in
> the
> >> CloudSolrServer class. It looks like when you create a new
> CloudSolrServer
> >> and pass it your own LBHttpSolrServer the boolean gets set to false and
> the
> >> CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut
> down.
> >>>
> >>>> parellelUpdates
> >>>
> >>> The javadoc's done have any description for this one but I checked out
> >> the code for CloudSolrServer and if parallelUpdates it looks like it
> >> executes update statements to multiple shards at the same time.
> >>
> >> Right, we should def add some javadoc, but this sends updates to shards
> in
> >> parallel rather than with a single thread. Can really increase update
> >> speed. Still not as powerful as using CloudSolrServer from multiple
> >> threads, but a nice improvement non the less.
> >>
> >>
> >> - Mark
> >>
> >> http://about.me/markrmiller
> >>
> >>>
> >>> I'm no dev but I can read so please excuse any errors on my part.
> >>>
> >>> Thanks,
> >>> Greg
> >>>
> >>> On Jan 31, 2014, at 11:40 AM, Software Dev 
> >> wrote:
> >>>
> >>>> Can someone clarify what the following options are:
> >>>>
> >>>> - updatesToLeaders
> >>>> - shutdownLBHttpSolrServer
> >>>> - parallelUpdates
> >>>>
> >>>> Also, I remember in older version of Solr there was an efficient
> format
> >>>> that was used between SolrJ and Solr that is more compact. Does this
> >> sill
> >>>> exist in the latest version of Solr? If so, is it the default?
> >>>>
> >>>> Thanks
> >>>
> >>
> >>
>
>

Re: SolrCloudServer questions

2014-02-01 Thread Software Dev

Also, if we are seeing a huge cpu spike on the leader when doing a bulk
index, would changing any of the options help?


On Sat, Feb 1, 2014 at 2:59 PM, Software Dev wrote:

> Out use case is we have 3 indexing machines pulling off a kafka queue and
> they are all sending individual updates.
>
>
> On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller wrote:
>
>> Just make sure parallel updates is set to true.
>>
>> If you want to load even faster, you can use the bulk add methods, or if
>> you need more fine grained responses, use the single add from multiple
>> threads (though bulk add can also be done via multiple threads if you
>> really want to try and push the max).
>>
>> - Mark
>>
>> http://about.me/markrmiller
>>
>> On Jan 31, 2014, at 3:50 PM, Software Dev 
>> wrote:
>>
>> > Which of any of these settings would be beneficial when bulk uploading?
>> >
>> >
>> > On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller 
>> wrote:
>> >
>> >>
>> >>
>> >> On Jan 31, 2014, at 1:56 PM, Greg Walters 
>> >> wrote:
>> >>
>> >>> I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore
>> >> my response.
>> >>>
>> >>>> -updatesToLeaders
>> >>>
>> >>> Only send documents to shard leaders while indexing. This saves
>> >> cross-talk between slaves and leaders which results in more efficient
>> >> document routing.
>> >>
>> >> Right, but recently this has less of an affect because CloudSolrServer
>> can
>> >> now hash documents and directly send them to the right place. This
>> option
>> >> has become more historical. Just make sure you set the correct id
>> field on
>> >> the CloudSolrServer instance for this hashing to work (I think it
>> defaults
>> >> to "id").
>> >>
>> >>>
>> >>>> shutdownLBHttpSolrServer
>> >>>
>> >>> CloudSolrServer uses a LBHttpSolrServer behind the scenes to
>> distribute
>> >> requests (that aren't updates directly to leaders). Where did you find
>> >> this? I don't see this in the javadoc anywhere but it is a boolean in
>> the
>> >> CloudSolrServer class. It looks like when you create a new
>> CloudSolrServer
>> >> and pass it your own LBHttpSolrServer the boolean gets set to false
>> and the
>> >> CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut
>> down.
>> >>>
>> >>>> parellelUpdates
>> >>>
>> >>> The javadoc's done have any description for this one but I checked out
>> >> the code for CloudSolrServer and if parallelUpdates it looks like it
>> >> executes update statements to multiple shards at the same time.
>> >>
>> >> Right, we should def add some javadoc, but this sends updates to
>> shards in
>> >> parallel rather than with a single thread. Can really increase update
>> >> speed. Still not as powerful as using CloudSolrServer from multiple
>> >> threads, but a nice improvement non the less.
>> >>
>> >>
>> >> - Mark
>> >>
>> >> http://about.me/markrmiller
>> >>
>> >>>
>> >>> I'm no dev but I can read so please excuse any errors on my part.
>> >>>
>> >>> Thanks,
>> >>> Greg
>> >>>
>> >>> On Jan 31, 2014, at 11:40 AM, Software Dev > >
>> >> wrote:
>> >>>
>> >>>> Can someone clarify what the following options are:
>> >>>>
>> >>>> - updatesToLeaders
>> >>>> - shutdownLBHttpSolrServer
>> >>>> - parallelUpdates
>> >>>>
>> >>>> Also, I remember in older version of Solr there was an efficient
>> format
>> >>>> that was used between SolrJ and Solr that is more compact. Does this
>> >> sill
>> >>>> exist in the latest version of Solr? If so, is it the default?
>> >>>>
>> >>>> Thanks
>> >>>
>> >>
>> >>
>>
>>
>

How does Solr parse schema.xml?

2014-02-26 Thread Software Dev

Can anyone point me in the right direction. I'm trying to duplicate the
functionality of the analysis request handler so we can wrap a service
around it to return the terms given a string of text. We would like to read
the same schema.xml file to configure the analyzer,tokenizer, etc but I
can't seem to find the class that actually does the parsing of that file.

Thanks

Re: Does Solr flush to disk even before ramBufferSizeMB is hit?

2011-08-30 Thread roz dev

Thanks Shawn.

If Solr writes this info to Disk as soon as possible (which is what I am
seeing) then ramBuffer setting seems to be misleading.

Anyone else has any thoughts on this?

-Saroj


On Mon, Aug 29, 2011 at 6:14 AM, Shawn Heisey  wrote:

> On 8/28/2011 11:18 PM, roz dev wrote:
>
>> I notice that even though InfoStream does not mention that data is being
>> flushed to disk, new segment files were created on the server.
>> Size of these files kept growing even though there was enough Heap
>> available
>> and 856MB Ram was not even used.
>>
>
> With the caveat that I am not an expert and someone may correct me, I'll
> offer this:  It's been my experience that Solr will write the files that
> constitute stored fields as soon as they are available, because that
> information is always the same and nothing will change in those files based
> on the next chunk of data.
>
> Thanks,
> Shawn
>
>

Re: DataImportHandler using new connection on each query

2011-09-02 Thread eks dev

I am not sure if current version has this, but  DIH used to reload
connections after some idle time

if (currTime - connLastUsed > CONN_TIME_OUT) {
synchronized (this) {
Connection tmpConn = factory.call();
closeConnection();
connLastUsed = System.currentTimeMillis();
return conn = tmpConn;
}


Where CONN_TIME_OUT = 10 seconds



On Fri, Sep 2, 2011 at 12:36 AM, Chris Hostetter
 wrote:
>
> : However, I tested this against a slower SQL Server and I saw
> : dramatically worse results. Instead of re-using their database, each of
> : the sub-entities is recreating a connection each time the query runs.
>
> are you seeing any specific errors logged before these new connections are
> created?
>
> I don't *think* there's anything in the DIH JDBC/SQL code that causes it
> to timeout existing connections -- is it possible this is sometihng
> specific to the JDBC Driver you are using?
>
> Or maybe you are using the DIH "threads" option along with a JNDI/JDBC
> based pool of connections that is configured to create new Connections on
> demand, and with the fast DB it can reuse them but on the slow DB it does
> enough stuff in parallel to keep asking for new connections to be created?
>
>
> If it's DIH creating new connections over and over then i'm pretty sure
> you should see an INFO level log message like this for each connection...
>
>        LOG.info("Creating a connection for entity "
>                + context.getEntityAttribute(DataImporter.NAME) + " with URL: "
>                + url);
>
> ...are those messages different against you fast DB and your slow DB?
>
> -Hoss
>

Re: DataImportHandler using new connection on each query

2011-09-02 Thread eks dev

take care, "running 10 hours" != "idling 10 seconds" and trying again.
Those are different cases.

It is not dropping *used* connections (good to know it works that
good, thanks for reporting!), just not reusing connections more than
10 seconds idle



On Fri, Sep 2, 2011 at 10:26 PM, Gora Mohanty  wrote:
> On Sat, Sep 3, 2011 at 1:38 AM, Shawn Heisey  wrote:
> [...]
>> I use DIH with MySQL.  When things are going well, a full rebuild will leave
>> connections open and active for over two hours.  This is the case with
>> 1.4.0, 1.4.1, 3.1.0, and 3.2.0.  Due to some kind of problem on the database
>> server, last night I had a rebuild going for more than 11 hours with no
>> problems, verified from the processlist on the server.
>
> Will second that. Have had DIH connections open to both
> mysql, and MS-SQL for 8-10h. Dropped connections could
> be traced to network issues, or some other exception.
>
> Regards,
> Gora
>

Re: DataImportHandler using new connection on each query

2011-09-02 Thread eks dev

watch out, "running 10 hours" != "idling 10 seconds" and trying again.
Those are different cases.

It is not dropping *used* connections (good to know it works that
good, thanks for reporting!), just not reusing connections more than
10 seconds idle



On Fri, Sep 2, 2011 at 10:26 PM, Gora Mohanty  wrote:
> On Sat, Sep 3, 2011 at 1:38 AM, Shawn Heisey  wrote:
> [...]
>> I use DIH with MySQL.  When things are going well, a full rebuild will leave
>> connections open and active for over two hours.  This is the case with
>> 1.4.0, 1.4.1, 3.1.0, and 3.2.0.  Due to some kind of problem on the database
>> server, last night I had a rebuild going for more than 11 hours with no
>> problems, verified from the processlist on the server.
>
> Will second that. Have had DIH connections open to both
> mysql, and MS-SQL for 8-10h. Dropped connections could
> be traced to network issues, or some other exception.
>
> Regards,
> Gora
>

Which Solr / Lucene direcotory for ramfs?

2011-09-16 Thread eks dev

probably stupid question,

Which Directory implementation should be the best suited for index
mounted on ramfs/tmpfs? I guess plain old FSDirectory, (or mmap/nio?)

solr-user@lucene.apache.org

2011-09-16 Thread eks dev

probably stupid question,

Which Directory implementation should be the best suited for index
mounted on ramfs/tmpfs? I guess plain old FSDirectory, (or mmap/nio?)

what is the default value of omitNorms and termVectors in solr schema

2011-09-18 Thread roz dev

Hi

As per this document, http://wiki.apache.org/solr/FieldOptionsByUseCase,
omitNorms and termVectors have to be "explicitly" specified in some cases.

I am wondering what is the default value of these settings if solr schema
definition does not state them.

*Example:*



In above case, will Solr create norms for this field and term vector as
well?

Any ideas?

Thanks
Saroj

cache invalidation in slaves

2011-09-20 Thread roz dev

Hi All

Solr has different types of caches such as filterCache, queryResultCache and
document Cache .
I know that if a commit is done then a new searcher is opened and new caches
are built. And, this makes sense.

What happens when commits are happening on master and slaves are pulling all
the delta updates.

Do slaves trash their cache and rebuild them every time there is a new delta
index updates downloaded to slave?


Thanks
Saroj

q and fq in solr 1.4.1

2011-09-20 Thread roz dev

Hi All

I am sure that q vs fq question has been answered several times.

But, I still have a question which I would like to know the answers for:

if we have a solr query like this

q=*&fq=field_1:XYZ&fq=field_2:ABC&sortBy=field_3+asc

How does SolrIndexSearcher fire query in 1.4.1

Will it fire query against whole index first because q=* then filter the
results against field_1 and field_2 or is it in parallel?

and, if we say that get only 20 rows at a time then will solr do following
1) get all the docs (because q is set to *) and sort them by field_3
2) then, filter the results by field_1 and field_2

Or, will it apply sorting after doing the filter?

Please let me know how Solr 1.4.1 works.

Thanks
Saroj

Production Issue: SolrJ client throwing this error even though field type is not defined in schema

2011-09-21 Thread roz dev

Hi All

We are getting this error in our Production Solr Setup.

Message: Element type "t_sort" must be followed by either attribute
specifications, ">" or "/>".
Solr version is 1.4.1

Stack trace indicates that solr is returning malformed document.


Caused by: org.apache.solr.client.solrj.SolrServerException: Error
executing query
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at 
com.gap.gid.search.impl.SearchServiceImpl.executeQuery(SearchServiceImpl.java:232)
... 15 more
Caused by: org.apache.solr.common.SolrException: parsing error
at 
org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:140)
at 
org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:101)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:481)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
... 17 more
Caused by: javax.xml.stream.XMLStreamException: ParseError at
[row,col]:[3,136974]
Message: Element type "t_sort" must be followed by either attribute
specifications, ">" or "/>".
at 
com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:594)
at 
org.apache.solr.client.solrj.impl.XMLResponseParser.readArray(XMLResponseParser.java:282)
at 
org.apache.solr.client.solrj.impl.XMLResponseParser.readDocument(XMLResponseParser.java:410)
at 
org.apache.solr.client.solrj.impl.XMLResponseParser.readDocuments(XMLResponseParser.java:360)
at 
org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:241)
at 
org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:125)
... 21 more

Re: Production Issue: SolrJ client throwing - Element type must be followed by either attribute specifications, ">" or "/>".

2011-09-22 Thread roz dev

Wanted to update the list with our finding.

We reduced the number of documents which are being retrieved from Solr and
this error did not appear again.
Might be the case that due to high number of documents, solr is returning
incomplete documents.

-Saroj


On Wed, Sep 21, 2011 at 12:13 PM, roz dev  wrote:

> Hi All
>
> We are getting this error in our Production Solr Setup.
>
> Message: Element type "t_sort" must be followed by either attribute 
> specifications, ">" or "/>".
> Solr version is 1.4.1
>
> Stack trace indicates that solr is returning malformed document.
>
>
> Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing 
> query
>   at 
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
>   at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
>   at 
> com.gap.gid.search.impl.SearchServiceImpl.executeQuery(SearchServiceImpl.java:232)
>   ... 15 more
> Caused by: org.apache.solr.common.SolrException: parsing error
>   at 
> org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:140)
>   at 
> org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:101)
>   at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:481)
>   at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>   at 
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
>   ... 17 more
> Caused by: javax.xml.stream.XMLStreamException: ParseError at 
> [row,col]:[3,136974]
> Message: Element type "t_sort" must be followed by either attribute 
> specifications, ">" or "/>".
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:594)
>   at 
> org.apache.solr.client.solrj.impl.XMLResponseParser.readArray(XMLResponseParser.java:282)
>   at 
> org.apache.solr.client.solrj.impl.XMLResponseParser.readDocument(XMLResponseParser.java:410)
>   at 
> org.apache.solr.client.solrj.impl.XMLResponseParser.readDocuments(XMLResponseParser.java:360)
>   at 
> org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:241)
>   at 
> org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:125)
>   ... 21 more
>
>

Update ingest rate drops suddenly

2011-09-24 Thread eks dev

just looking for hints where to look for...

We were testing single threaded ingest rate on solr, trunk version on
atypical collection (a lot of small documents), and we noticed
something we are not able to explain.

Setup:
We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD,
machine with enough memory and 8 cores.   Schema has 5 stored fields,
4 of them indexed no positions no norms.
Average net document size (optimized index size / number of documents)
is around 100 bytes.

On a test with 40 Mio document:
- we had update ingest rate  on first 4,4Mio documents @  incredible
34k records / second...
- then it dropped, suddenly to 20k records per second and this rate
remained stable (variance 1k) until...
- we hit 13Mio, where ingest rate dropped again really hard, from one
instant in time to another to 10k records per second.

it stayed there until we reached the end @40Mio (slightly reducing, to
ca 9k, but this is not long enough to see trend).

Nothing unusual happening with jvm memory ( tooth-saw  200- 450M fully
regular). CPU in turn was  following the ingest rate trend, inicating
that we were waiting on something. No searches , no commits, nothing.

autoCommit was turned off. Updates were streaming directly from the database.

-
I did not expect something like this, knowing lucene merges in
background. Also, having such sudden drops in ingest rate is
indicative that we are not leaking something. (drop would have been
much more gradual). It is some caches, but why two really significant
drops? 33k/sec to 20k and than to 10k... We would love to keep it  @34
k/second :)

I am not really acquainted with the new MergePolicy and flushing
settings, but I suspect this is something there we could tweak.

Could it be windows is somehow, hmm, quirky with solr default
directory on win64/jvm (I think it is MMAP by default)... We did not
saturate IO with such a small documents I guess, It is a just couple
of Gig over 1-2 hours.

All in all, it works good, but is having such hard update ingest rate
drops normal?

Thanks,
eks.

Re: Update ingest rate drops suddenly

2011-09-25 Thread eks dev

Thanks Otis,
we will look into these issues again, slightly deeper. Network
problems are not likely, but DB, I do not know, this is huge select
... we will try to scan db, without indexing, just to see if it can
sustain... But gut feeling says, nope, this is not the one.

IO saturation would surprise me, but you never know. Might be very
well that SSD is somehow having problems with this sustained
throughput.

8 Core... no, this was single update thread.

we left default index settings (do not tweak if it works :)
32

32MB sounds like a lot of our documents (100b average on disk size).
Assuming ram efficiency of 50% (?), we lend at 100k buffered
documents. Yes, this is kind of  smallish as every ~3 seconds we
fill-up ramBuffer. (our Analyzers surprised  me with 30k+ records per
second).

256 will do the job, ~24 seconds should be plenty of "idle" time for
IO-OS-JVM  to sort out MMAP issues, if any (windows was newer MMAP
performance champion when using it from java, but once you dance
around it, it works ok)...


Max jvm heap on this test was 768m, memory never went above 500m,
Using  -XX:-UseParallelGC ... this is definitely not a gc problem.

cheers,
eks


On Sun, Sep 25, 2011 at 6:20 AM, Otis Gospodnetic
 wrote:
> eks,
>
> This is clear as day - you're using Winblows!  Kidding.
>
> I'd:
> * watch IO with something like vmstat 2 and see if the rate drops correlate 
> to increased disk IO or IO wait time
> * monitor the DB from which you were pulling the data - maybe the DB or the 
> server that runs it had issues
> * monitor the network over which you pull data from DB
>
> If none of the above reveals the problem I'd still:
> * grab all data you need to index and copy it locally
> * index everything locally
>
> Out of curiosity, how big is your ramBufferSizeMB and your -Xmx?
> And on that 8-core box you have ~8 indexing threads going?
>
> Otis
> 
> Sematext is Hiring -- http://sematext.com/about/jobs.html
>
>
>
>
>>
>>From: eks dev 
>>To: solr-user 
>>Sent: Saturday, September 24, 2011 3:18 PM
>>Subject: Update ingest rate drops suddenly
>>
>>just looking for hints where to look for...
>>
>>We were testing single threaded ingest rate on solr, trunk version on
>>atypical collection (a lot of small documents), and we noticed
>>something we are not able to explain.
>>
>>Setup:
>>We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD,
>>machine with enough memory and 8 cores.   Schema has 5 stored fields,
>>4 of them indexed no positions no norms.
>>Average net document size (optimized index size / number of documents)
>>is around 100 bytes.
>>
>>On a test with 40 Mio document:
>>- we had update ingest rate  on first 4,4Mio documents @  incredible
>>34k records / second...
>>- then it dropped, suddenly to 20k records per second and this rate
>>remained stable (variance 1k) until...
>>- we hit 13Mio, where ingest rate dropped again really hard, from one
>>instant in time to another to 10k records per second.
>>
>>it stayed there until we reached the end @40Mio (slightly reducing, to
>>ca 9k, but this is not long enough to see trend).
>>
>>Nothing unusual happening with jvm memory ( tooth-saw  200- 450M fully
>>regular). CPU in turn was  following the ingest rate trend, inicating
>>that we were waiting on something. No searches , no commits, nothing.
>>
>>autoCommit was turned off. Updates were streaming directly from the database.
>>
>>-
>>I did not expect something like this, knowing lucene merges in
>>background. Also, having such sudden drops in ingest rate is
>>indicative that we are not leaking something. (drop would have been
>>much more gradual). It is some caches, but why two really significant
>>drops? 33k/sec to 20k and than to 10k... We would love to keep it  @34
>>k/second :)
>>
>>I am not really acquainted with the new MergePolicy and flushing
>>settings, but I suspect this is something there we could tweak.
>>
>>Could it be windows is somehow, hmm, quirky with solr default
>>directory on win64/jvm (I think it is MMAP by default)... We did not
>>saturate IO with such a small documents I guess, It is a just couple
>>of Gig over 1-2 hours.
>>
>>All in all, it works good, but is having such hard update ingest rate
>>drops normal?
>>
>>Thanks,
>>eks.
>>
>>
>>

Re: Update ingest rate drops suddenly

2011-09-26 Thread eks dev

Just to bring closure on this one, we were slurping data from the
wrong DB (hardly desktop class machine)...

Solr did not cough on 41Mio records @34k updates / sec.,  single threaded.
Great!



On Sat, Sep 24, 2011 at 9:18 PM, eks dev  wrote:
> just looking for hints where to look for...
>
> We were testing single threaded ingest rate on solr, trunk version on
> atypical collection (a lot of small documents), and we noticed
> something we are not able to explain.
>
> Setup:
> We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD,
> machine with enough memory and 8 cores.   Schema has 5 stored fields,
> 4 of them indexed no positions no norms.
> Average net document size (optimized index size / number of documents)
> is around 100 bytes.
>
> On a test with 40 Mio document:
> - we had update ingest rate  on first 4,4Mio documents @  incredible
> 34k records / second...
> - then it dropped, suddenly to 20k records per second and this rate
> remained stable (variance 1k) until...
> - we hit 13Mio, where ingest rate dropped again really hard, from one
> instant in time to another to 10k records per second.
>
> it stayed there until we reached the end @40Mio (slightly reducing, to
> ca 9k, but this is not long enough to see trend).
>
> Nothing unusual happening with jvm memory ( tooth-saw  200- 450M fully
> regular). CPU in turn was  following the ingest rate trend, inicating
> that we were waiting on something. No searches , no commits, nothing.
>
> autoCommit was turned off. Updates were streaming directly from the database.
>
> -
> I did not expect something like this, knowing lucene merges in
> background. Also, having such sudden drops in ingest rate is
> indicative that we are not leaking something. (drop would have been
> much more gradual). It is some caches, but why two really significant
> drops? 33k/sec to 20k and than to 10k... We would love to keep it  @34
> k/second :)
>
> I am not really acquainted with the new MergePolicy and flushing
> settings, but I suspect this is something there we could tweak.
>
> Could it be windows is somehow, hmm, quirky with solr default
> directory on win64/jvm (I think it is MMAP by default)... We did not
> saturate IO with such a small documents I guess, It is a just couple
> of Gig over 1-2 hours.
>
> All in all, it works good, but is having such hard update ingest rate
> drops normal?
>
> Thanks,
> eks.
>

Re: Production Issue: SolrJ client throwing this error even though field type is not defined in schema

2011-09-30 Thread roz dev

This issue disappeared when we reduced the number of documents which were
being returned from Solr.

Looks to be some issue with Tomcat or Solr, returning truncated responses.

-Saroj


On Sun, Sep 25, 2011 at 9:21 AM,  wrote:

> If I had to give a gentle nudge, I would ask you to validate your schema
> XML file. You can do so by looking for any w3c XML validator website and
> just copy pasting the text there to find out where its malformed.
>
> Sent from my iPhone
>
> On Sep 24, 2011, at 2:01 PM, Erick Erickson 
> wrote:
>
> > You might want to review:
> >
> > http://wiki.apache.org/solr/UsingMailingLists
> >
> > There's really not much to go on here.
> >
> > Best
> > Erick
> >
> > On Wed, Sep 21, 2011 at 12:13 PM, roz dev  wrote:
> >> Hi All
> >>
> >> We are getting this error in our Production Solr Setup.
> >>
> >> Message: Element type "t_sort" must be followed by either attribute
> >> specifications, ">" or "/>".
> >> Solr version is 1.4.1
> >>
> >> Stack trace indicates that solr is returning malformed document.
> >>
> >>
> >> Caused by: org.apache.solr.client.solrj.SolrServerException: Error
> >> executing query
> >>at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
> >>at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> >>at
> com.gap.gid.search.impl.SearchServiceImpl.executeQuery(SearchServiceImpl.java:232)
> >>... 15 more
> >> Caused by: org.apache.solr.common.SolrException: parsing error
> >>at
> org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:140)
> >>at
> org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:101)
> >>at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:481)
> >>at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
> >>at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
> >>... 17 more
> >> Caused by: javax.xml.stream.XMLStreamException: ParseError at
> >> [row,col]:[3,136974]
> >> Message: Element type "t_sort" must be followed by either attribute
> >> specifications, ">" or "/>".
> >>at
> com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:594)
> >>at
> org.apache.solr.client.solrj.impl.XMLResponseParser.readArray(XMLResponseParser.java:282)
> >>at
> org.apache.solr.client.solrj.impl.XMLResponseParser.readDocument(XMLResponseParser.java:410)
> >>at
> org.apache.solr.client.solrj.impl.XMLResponseParser.readDocuments(XMLResponseParser.java:360)
> >>at
> org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:241)
> >>at
> org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:125)
> >>... 21 more
> >>
>

Re: capacity planning

2011-10-11 Thread eks dev

Re. "I have little experience with VM servers for search."

We had huge performance penalty on VMs,  CPU was bottleneck.
We couldn't freely run measurements to figure out what the problem really
was (hosting was contracted by customer...), but it was something pretty
scary, kind of 8-10 times slower than advertised dedicated equivalent.
Whatever its worth, if you can afford it, keep lucene away from it. Lucene
is highly optimized machine, and someone twiddling with context switches is
not welcome there.

Of course, if you get IO bound, it makes no big diff anyhow.

This is just my singular experience, might be the hosting team did not
configure it right, or something changed in meantime (~ 4 Years old
experience),  but we burnt our fingers that hard I still remember it




On Tue, Oct 11, 2011 at 7:49 PM, Toke Eskildsen wrote:

> Travis Low [t...@4centurion.com] wrote:
> > Toke, thanks.  Comments embedded (hope that's okay):
>
> Inline or top-posting? Long discussion, but for mailing lists I clearly
> prefer the former.
>
> [Toke: Estimate characters]
>
> > Yes.  We estimate each of the 23K DB records has 600 pages of text for
> the
> > combined documents, 300 words per page, 5 characters per word.  Which
> > coincidentally works out to about 21GB, so good guessing there. :)
>
> Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does
> not sound scary at all.
>
> > The way it works is we have researchers modifying the DB records during
> the
> > day, and they may upload documents at that time.  We estimate 50-60
> uploads
> > throughout the day.  If possible, we'd like to index them as they are
> > uploaded, but if that would negatively affect the search, then we can
> > rebuild the index nightly.
> >
> > Which is better?
>
> The analyzing part is only CPU and you're running multi-core so as long as
> you only analyze using one thread you're safe there. That leaves us with
> I/O: Even for spinning drives, a daily load of just 60 updates of 1MB of
> extracted text each shouldn't have any real effect - with the usual caveat
> that large merges should be avoided by either optimizing at night or
> tweaking merge policy to avoid large segments. With such a relatively small
> index, (re)opening and warm up should be painless too.
>
> Summary: 300GB is a fair amount of data and takes some power to crunch.
> However, in the Solr/Lucene end your index size and your update rates are
> nothing to worry about. Usual caveat for advanced use and all that applies.
>
> [Toke: i7, 8GB, 1TB spinning, 256GB SSD]
>
> > We have a very beefy VM server that we will use for benchmarking, but
> your
> > specs provide a starting point.  Thanks very much for that.
>
> I have little experience with VM servers for search. Although we use a lot
> of virtual machines, we use dedicated machines for our searchers, primarily
> to ensure low latency for I/O. They might be fine for that too, but we
> haven't tried it yet.
>
> Glad to be of help,
> Toke

Index format difference between 4.0 and 3.4

2011-11-14 Thread roz dev

Hi All,

We are using Solr 1.4.1 in production and are considering an upgrade to
newer version.

It seems that Solr 3.x requires a complete rebuild of index as the format
seems to have changed.

Is Solr 4.0 index file format compatible with Solr 3.x format?

Please advise.

Thanks
Saroj

codec="Pulsing" per field broken?

2011-12-11 Thread eks dev

on the latest trunk, my schema.xml with field type declaration
containing //codec="Pulsing"// does not work any more (throws
exception from FieldType). It used to work wit approx. a month old
trunk version.

I didn't dig deeper, can be that the old schema.xml  was broken and
worked by accident.



org.apache.solr.common.SolrException: Plugin Initializing failure for
[schema.xml] fieldType
at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:183)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:368)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:107)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:651)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:409)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at runjettyrun.Bootstrap.main(Bootstrap.java:86)
Caused by: java.lang.RuntimeException: schema fieldtype
storableCity(X.StorableField) invalid
arguments:{codec=Pulsing}
at org.apache.solr.schema.FieldType.setArgs(FieldType.java:177)
at 
org.apache.solr.schema.FieldTypePluginLoader.init(FieldTypePluginLoader.java:127)
at 
org.apache.solr.schema.FieldTypePluginLoader.init(FieldTypePluginLoader.java:43)
at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:180)
... 18 more

Re: codec="Pulsing" per field broken?

2011-12-11 Thread eks dev

Thanks Robert,

I've missed LUCENE-3490... Awesome!

On Sun, Dec 11, 2011 at 6:37 PM, Robert Muir  wrote:
> On Sun, Dec 11, 2011 at 11:34 AM, eks dev  wrote:
>> on the latest trunk, my schema.xml with field type declaration
>> containing //codec="Pulsing"// does not work any more (throws
>> exception from FieldType). It used to work wit approx. a month old
>> trunk version.
>>
>> I didn't dig deeper, can be that the old schema.xml  was broken and
>> worked by accident.
>>
>
> Hi,
>
> The short answer is, you should change this to //postingsFormat="Pulsing40"//
> See 
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema_codec.xml
>
> The longer answer is that the Codec API in lucene trunk was extended recently:
> https://issues.apache.org/jira/browse/LUCENE-3490
>
> Previously "Codec" only allowed you to customize the format of the
> postings lists.
> We are working to have it cover the entire index segment (at the
> moment nearly everything except deletes and encoding of compound files
> can be customized).
>
> For example, look at SimpleText now:
> http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/codecs/simpletext/
> As you see, it now implements plain-text stored fields, term vectors,
> norms, segments file, fieldinfos, etc.
> See Codec.java 
> (http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/codecs/Codec.java)
> or LUCENE-3490 for more details.
>
> Because of this, what you had before is now just "PostingsFormat", as
> Pulsing is just a wrapper around a postings implementation that
> inlines low frequency terms.
> Lucene's default Codec uses a per-field postings setup, so you can
> still configure the postings per-field, just differently.
>
> --
> lucidimagination.com

hot deploy of newer version of solr schema in production

2012-01-23 Thread roz dev

Hi All,

I need community's feedback about deploying newer versions of solr schema
into production while existing (older) schema is in use by applications.

How do people perform these things? What has been the learning of people
about this.

Any thoughts are welcome.

Thanks
Saroj

Re: filter query from external list of Solr unique IDs

2010-10-16 Thread eks dev

if your index is read-only in production, can you add mapping
unique_id-Lucene docId in your kv store and and build filters externally?
That would make unique Key obsolete in your production index, as you would
work at lucene doc id level.

That way, you offline the problem to update/optimize phase. Ugly part is a
lot of updates on your kv-store...

I am not really familiar with solr, but working directly with lucene this is
doable, even having parallel index that has unique ID as a stored field, and
another index with indexed fields on update master, and than having only
this index with indexed fields in production.





On Fri, Oct 15, 2010 at 8:59 PM, Burton-West, Tom wrote:

> Hi Jonathan,
>
> The advantages of the obvious approach you outline are that it is simple,
> it fits in to the existing Solr model, it doesn't require any customization
> or modification to Solr/Lucene java code.  Unfortunately, it does not scale
> well.  We originally tried just what you suggest for our implementation of
> Collection Builder.  For a user's personal collection we had a table that
> maps the collection id to the unique Solr ids.
> Then when they wanted to search their collection, we just took their search
> and added a filter query with the fq=(id:1 OR id:2 OR).   I seem to
> remember running in to a limit on the number of OR clauses allowed. Even if
> you can set that limit larger, there are a  number of efficiency issues.
>
> We ended up constructing a separate Solr index where we have a multi-valued
> collection number field. Unfortunately, until incremental field updating
> gets implemented, this means that every time someone adds a document to a
> collection, the entire document (including 700KB of OCR) needs to be
> re-indexed just to update the collection number field. This approach has
> allowed us to scale up to a total of something under 100,000 documents, but
> we don't think we can scale it much beyond that for various reasons.
>
> I was actually thinking of some kind of custom Lucene/Solr component that
> would for example take a query parameter such as &lookitUp=123 and the
> component might do a JDBC query against a database or kv store and return
> results in some form that would be efficient for Solr/Lucene to process. (Of
> course this assumes that a JDBC query would be more efficient than just
> sending a long list of ids to Solr).  The other part of the equation is
> mapping the unique Solr ids to internal Lucene ids in order to implement a
> filter query.   I was wondering if something like the unique id to Lucene id
> mapper in zoie might be useful or if that is too specific to zoie. SoThis
> may be totally off-base, since I haven't looked at the zoie code at all yet.
>
> In our particular use case, we might be able to build some kind of
> in-memory map after we optimize an index and before we mount it in
> production. In our workflow, we update the index and optimize it before we
> release it and once it is released to production there is no
> indexing/merging taking place on the production index (so the internal
> Lucene ids don't change.)
>
> Tom
>
>
>
> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Friday, October 15, 2010 1:07 PM
> To: solr-user@lucene.apache.org
> Subject: RE: filter query from external list of Solr unique IDs
>
> Definitely interested in this.
>
> The naive obvious approach would be just putting all the ID's in the query.
> Like fq=(id:1 OR id:2 OR).  Or making it another clause in the 'q'.
>
> Can you outline what's wrong with this approach, to make it more clear
> what's needed in a solution?
> 
>

Re: can we configure spellcheck to be invoked after request processing?

2013-03-04 Thread roz dev

James,

You are right. I was setting up spell checker incorrectly.

It works correctly as you described.

Spell checker is invoked after the query component and it does not stop
Solr from executing query.

Thanks for correcting me.
Saroj





On Fri, Mar 1, 2013 at 7:30 AM, Dyer, James wrote:

> I'm a little confused here because if you are searching q=jeap OR denim ,
> then you should be getting both documents back.  Having spellcheck
> configured does not affect your search results at all.  Having it in your
> request will sometime result in spelling suggestions, usually if one or
> more terms you queried is not in the index.  But if all of your query terms
> are optional then you need only have 1 term match anything to get results.
>  You should get the same results regardless of whether or not you have
> spellcheck in the request.
>
> While spellcheck does not affect your query results, the results do affect
> spellcheck.  This is why you should put spellcheck in the "last-components"
> section of your request handler configuration.  This ensures that the query
> is run before spellcheck.
>
> James Dyer
> Ingram Content Group
> (615) 213-4311
>
>
> -Original Message-
> From: roz dev [mailto:rozde...@gmail.com]
> Sent: Thursday, February 28, 2013 6:33 PM
> To: solr-user@lucene.apache.org
> Subject: can we configure spellcheck to be invoked after request
> processing?
>
> Hi All,
> I may be asking a stupid question but please bear with me.
>
> Is it possible to configure Spell check to be invoked after Solr has
> processed the original query?
>
> My use case is :
>
> I am using DirectSpellChecker and have a document which has "Denim" as a
> term and there is another document which has "Jeap".
>
> I am issuing a Search as "Jean" or "Denim"
>
> I am finding that this Solr query is giving me ZERO results and suggesting
> "Jeap" as an alternative.
>
> I want Solr to try to run the query for "Jean" or "Denim" and if there are
> no results found then only suggest "Jeap" as an alternative
>
> Is this doable in Solr?
>
> Any suggestions.
>
> -Saroj
>
>

Can we manipulate termfreq to count as 1 for multiple matches?

2013-03-13 Thread roz dev

Hi All

I am wondering if there is a way to alter term frequency of a certain field
as 1, even if there are multiple matches in that document?

Use Case is:

Let's say that I have a document with 2 fields

- Name and
- Description

And, there is a document with data like this

Document_1
Name = Blue Jeans
Description = This jeans is very soft.  Jeans is pretty nice.

Now, If I Search for "Jeans" then "Jeans" is found in 2 places in
Description field.

Term Frequency for Description is 2

I want Solr to count term frequency for Description as 1 even if "Jeans" is
found multiple times in this field.

For all other fields, i do want to get the term frequency, as it is.

Is this doable in Solr with any of the functions?

Any inputs are welcome.

Thanks
Saroj

Re: hot deploy of newer version of solr schema in production

2012-01-31 Thread roz dev

Thanks Jan for your inputs.

I am keen to know about the way people keep running live sites while there
is a breaking change which calls for complete re-indexing.
we want to build a new index , with new schema (it may take couple of
hours) without impacting live e-commerce site.

any thoughts are welcome

Thanks
Saroj


On Tue, Jan 24, 2012 at 12:21 AM, Jan Høydahl  wrote:

> Hi,
>
> To be able to do a true hot deploy of newer schema without reindexing, you
> must carefully see to that none of your changes are breaking changes. So
> you should test the process on your development machine and make sure it
> works. Adding and deleting fields would work, but not changing the
> field-type or analysis of an existing field. Depending on from/to version,
> you may want to keep the old schema-version number.
>
> The process is:
> 1. Deploy the new schema, including all dependencies such as dictionaries
> 2. Do a RELOAD CORE http://wiki.apache.org/solr/CoreAdmin#RELOAD
>
> My preference is to do a more thorough upgrade of schema including new
> functionality and breaking changes, and then do a full reindex. The
> exception is if my index is huge and the reason for Solr upgrade or schema
> change is to fix a bug, not to use new functionality.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 24. jan. 2012, at 01:51, roz dev wrote:
>
> > Hi All,
> >
> > I need community's feedback about deploying newer versions of solr schema
> > into production while existing (older) schema is in use by applications.
> >
> > How do people perform these things? What has been the learning of people
> > about this.
> >
> > Any thoughts are welcome.
> >
> > Thanks
> > Saroj
>
>

reader/searcher refresh after replication (commit)

2012-02-21 Thread eks dev

Hi all,
I am a bit confused with IndexSearcher refresh lifecycles...
In a master slave setup, I override postCommit listener on slave
(solr trunk version) to read some user information stored in
userCommitData on master

--
@Override
public final void postCommit() {
// This returnes "stale" information that was present before
replication finished
RefCounted refC = core.getNewestSearcher(true);
Map userData =
refC.get().getIndexReader().getIndexCommit().getUserData();
}

I expected core.getNewestSearcher(true); to return refreshed
SolrIndexSearcher, but it didn't

When is this information going to be refreshed to the status from the
replicated index, I repeat this is postCommit listener?

What is the way to get the information from the last commit point?

Maybe like this?
core.getDeletionPolicy().getLatestCommit().getUserData();

Or I need to explicitly open new searcher (isn't solr does this behind
the scenes?)
core.openNewSearcher(false, false)

Not critical, reopening new searcher works, but I would like to
understand these lifecycles, when solr loads latest commit point...

Thanks, eks

Re: reader/searcher refresh after replication (commit)

2012-02-21 Thread eks dev

Thanks Mark,
Hmm, I would like to have this information asap, not to wait until the
first search gets executed (depends on user) . Is solr going to create
new searcher as a part of "replication transaction"...

Just to make it clear why I need it...
I have simple master, many slaves config where master does "batch"
updates in big chunks (things user can wait longer to see on search
side) but slaves work in soft commit mode internally where I permit
them to run away slightly from master in order to know where
"incremental update" should start, I read it from UserData 

Basically, ideally, before commit (after successful replication is
finished) ends, I would like to read in these counters to let
"incremental update" run from the right point...

I need to prevent updating "replicated index" before I read this
information (duplicates can appear) are there any "IndexWriter"
listeners around?

Thanks again,
eks.

On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller  wrote:
> Post commit calls are made before a new searcher is opened.
>
> Might be easier to try to hook in with a new searcher listener?
>
> On Feb 21, 2012, at 8:23 AM, eks dev wrote:
>
>> Hi all,
>> I am a bit confused with IndexSearcher refresh lifecycles...
>> In a master slave setup, I override postCommit listener on slave
>> (solr trunk version) to read some user information stored in
>> userCommitData on master
>>
>> --
>> @Override
>> public final void postCommit() {
>> // This returnes "stale" information that was present before
>> replication finished
>> RefCounted refC = core.getNewestSearcher(true);
>> Map userData =
>> refC.get().getIndexReader().getIndexCommit().getUserData();
>> }
>> 
>> I expected core.getNewestSearcher(true); to return refreshed
>> SolrIndexSearcher, but it didn't
>>
>> When is this information going to be refreshed to the status from the
>> replicated index, I repeat this is postCommit listener?
>>
>> What is the way to get the information from the last commit point?
>>
>> Maybe like this?
>> core.getDeletionPolicy().getLatestCommit().getUserData();
>>
>> Or I need to explicitly open new searcher (isn't solr does this behind
>> the scenes?)
>> core.openNewSearcher(false, false)
>>
>> Not critical, reopening new searcher works, but I would like to
>> understand these lifecycles, when solr loads latest commit point...
>>
>> Thanks, eks
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>

Re: reader/searcher refresh after replication (commit)

2012-02-21 Thread eks dev

And drinks on me to those who decoupled implicit commit from close...
this was tricky trap

On Tue, Feb 21, 2012 at 9:10 PM, eks dev  wrote:
> Thanks Mark,
> Hmm, I would like to have this information asap, not to wait until the
> first search gets executed (depends on user) . Is solr going to create
> new searcher as a part of "replication transaction"...
>
> Just to make it clear why I need it...
> I have simple master, many slaves config where master does "batch"
> updates in big chunks (things user can wait longer to see on search
> side) but slaves work in soft commit mode internally where I permit
> them to run away slightly from master in order to know where
> "incremental update" should start, I read it from UserData 
>
> Basically, ideally, before commit (after successful replication is
> finished) ends, I would like to read in these counters to let
> "incremental update" run from the right point...
>
> I need to prevent updating "replicated index" before I read this
> information (duplicates can appear) are there any "IndexWriter"
> listeners around?
>
>
> Thanks again,
> eks.
>
>
>
> On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller  wrote:
>> Post commit calls are made before a new searcher is opened.
>>
>> Might be easier to try to hook in with a new searcher listener?
>>
>> On Feb 21, 2012, at 8:23 AM, eks dev wrote:
>>
>>> Hi all,
>>> I am a bit confused with IndexSearcher refresh lifecycles...
>>> In a master slave setup, I override postCommit listener on slave
>>> (solr trunk version) to read some user information stored in
>>> userCommitData on master
>>>
>>> --
>>> @Override
>>> public final void postCommit() {
>>> // This returnes "stale" information that was present before
>>> replication finished
>>> RefCounted refC = core.getNewestSearcher(true);
>>> Map userData =
>>> refC.get().getIndexReader().getIndexCommit().getUserData();
>>> }
>>> 
>>> I expected core.getNewestSearcher(true); to return refreshed
>>> SolrIndexSearcher, but it didn't
>>>
>>> When is this information going to be refreshed to the status from the
>>> replicated index, I repeat this is postCommit listener?
>>>
>>> What is the way to get the information from the last commit point?
>>>
>>> Maybe like this?
>>> core.getDeletionPolicy().getLatestCommit().getUserData();
>>>
>>> Or I need to explicitly open new searcher (isn't solr does this behind
>>> the scenes?)
>>> core.openNewSearcher(false, false)
>>>
>>> Not critical, reopening new searcher works, but I would like to
>>> understand these lifecycles, when solr loads latest commit point...
>>>
>>> Thanks, eks
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: reader/searcher refresh after replication (commit)

2012-02-22 Thread eks dev

Yes, I consciously let my slaves run away from the master in order to
reduce update latency, but every now and then they sync up with master
that is doing heavy lifting.

The price you pay is that slaves do not see the same documents as the
master, but this is the case anyhow with replication, in my setup
slave may go ahead of master with updates, this delta gets zeroed
after replication and the game starts again.

What you have to take into account with this is very small time window
where you may "go back in time" on slaves (not seeing documents that
were already there), but we are talking about seconds and a couple out
of 200Mio documents (only those documents that were softComited on
slave during replication, since commit ond master and postCommit on
slave).

Why do you think something is strange here?

> What are you expecting a BeforeCommitListener could do for you, if one
> would exist?
Why should I be expecting something?

I just need to read userCommit Data as soon as replication is done,
and I am looking for proper/easy way to do it.  (postCommitListener is
what I use now).

What makes me slightly nervous are those life cycle questions, e.g.
when I issue update command before and after postCommit event, which
index gets updated, the one just replicated or the one that was there
just before replication.

There are definitely ways to optimize this, for example to force
replication handler to copy only delta files if index gets updated on
slave and master  (there is already todo somewhere on solr replication
Wiki I think). Now replicationHandler copies complete index if this
gets detected ...

I am all ears if there are better proposals to have low latency
updates in multi server setup...

On Tue, Feb 21, 2012 at 11:53 PM, Em  wrote:
> Eks,
>
> that sounds strange!
>
> Am I getting you right?
> You have a master which indexes batch-updates from time to time.
> Furthermore you got some slaves, pulling data from that master to keep
> them up-to-date with the newest batch-updates.
> Additionally your slaves index own content in soft-commit mode that
> needs to be available as soon as possible.
> In consequence the slavesare not in sync with the master.
>
> I am not 100% certain, but chances are good that Solr's
> replication-mechanism only changes those segments that are not in sync
> with the master.
>
> What are you expecting a BeforeCommitListener could do for you, if one
> would exist?
>
> Kind regards,
> Em
>
> Am 21.02.2012 21:10, schrieb eks dev:
>> Thanks Mark,
>> Hmm, I would like to have this information asap, not to wait until the
>> first search gets executed (depends on user) . Is solr going to create
>> new searcher as a part of "replication transaction"...
>>
>> Just to make it clear why I need it...
>> I have simple master, many slaves config where master does "batch"
>> updates in big chunks (things user can wait longer to see on search
>> side) but slaves work in soft commit mode internally where I permit
>> them to run away slightly from master in order to know where
>> "incremental update" should start, I read it from UserData 
>>
>> Basically, ideally, before commit (after successful replication is
>> finished) ends, I would like to read in these counters to let
>> "incremental update" run from the right point...
>>
>> I need to prevent updating "replicated index" before I read this
>> information (duplicates can appear) are there any "IndexWriter"
>> listeners around?
>>
>>
>> Thanks again,
>> eks.
>>
>>
>>
>> On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller  wrote:
>>> Post commit calls are made before a new searcher is opened.
>>>
>>> Might be easier to try to hook in with a new searcher listener?
>>>
>>> On Feb 21, 2012, at 8:23 AM, eks dev wrote:
>>>
>>>> Hi all,
>>>> I am a bit confused with IndexSearcher refresh lifecycles...
>>>> In a master slave setup, I override postCommit listener on slave
>>>> (solr trunk version) to read some user information stored in
>>>> userCommitData on master
>>>>
>>>> --
>>>> @Override
>>>> public final void postCommit() {
>>>> // This returnes "stale" information that was present before
>>>> replication finished
>>>> RefCounted refC = core.getNewestSearcher(true);
>>>> Map userData =
>>>> refC.get().getIndexReader().getIndexCommit().getUserData();
>>>> }
>>>> 
>>>> I expected core.getNewestSearcher(true); to return refreshed
>>>>

SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher

2012-02-22 Thread eks dev

We started observing strange failures from ReplicationHandler when we
commit on master trunk version 4-5 days old.
It works sometimes, and sometimes not didn't dig deeper yet.

Looks like the real culprit hides behind:
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed

Looks familiar to somebody?


120222 154959 SEVERE SnapPull failed
:org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043)
at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source)
at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348)
at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source)
at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.lucene.store.AlreadyClosedException: this
IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810)
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815)
at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984)
at 
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254)
at 
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233)
at 
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223)
at 
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095)
... 15 more

Re: Unusually long data import time?

2012-02-22 Thread eks dev

Davon, you ought to try to update from many threads, (I do not know if
DIH can do it, check it), but lucene does great job if fed from many
update threads...

depends where your time gets lost, but it is usually a) analysis chain
or b) database

if it os a) and your server has spare cpu-cores, you can scale at X
NooCores rate

On Wed, Feb 22, 2012 at 7:41 PM, Devon Baumgarten
 wrote:
> Ahmet,
>
> I do not. I commented autoCommit out.
>
> Devon Baumgarten
>
>
>
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Wednesday, February 22, 2012 12:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Unusually long data import time?
>
>> Would it be unusual for an import of 160 million documents
>> to take 18 hours?  Each document is less than 1kb and I
>> have the DataImportHandler using the jdbc driver to connect
>> to SQL Server 2008. The full-import query calls a stored
>> procedure that contains only a select from my target table.
>>
>> Is there any way I can speed this up? I saw recently someone
>> on this list suggested a new user could get all their Solr
>> data imported in under an hour. I sure hope that's true!
>
> Do have autoCommit or autoSoftCommit configured in solrconfig.xml?

dih and solr cloud

2012-02-22 Thread eks dev

out of curiosity, trying to see if new cloud features can replace what
I use now...

how is this (batch) update forwarding solved at cloud level?

imagine simple one shard and one replica case, if I fire up DIH
update, is this going to be replicated to replica shard?
If yes,
- is it going to be sent document by document (network, imagine
100Mio+ update commands going to replica from slave for big batches)
- somehow batch into "packages" to reduce load
- distributed at index level somehow



This is important case, today with master/slave solr replication,  but
is not mentioned at  http://wiki.apache.org/solr/SolrCloud

Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher

2012-02-22 Thread eks dev

thanks Mark, I will give it a go and report back...

On Thu, Feb 23, 2012 at 1:31 AM, Mark Miller  wrote:
> Looks like an issue around replication IndexWriter reboot, soft commits and 
> hard commits.
>
> I think I've got a workaround for it:
>
> Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java
> ===
> --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (revision 
> 1292344)
> +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (working copy)
> @@ -499,6 +499,17 @@
>
>       // reboot the writer on the new index and get a new searcher
>       solrCore.getUpdateHandler().newIndexWriter();
> +      Future[] waitSearcher = new Future[1];
> +      solrCore.getSearcher(true, false, waitSearcher, true);
> +      if (waitSearcher[0] != null) {
> +        try {
> +         waitSearcher[0].get();
> +       } catch (InterruptedException e) {
> +         SolrException.log(LOG,e);
> +       } catch (ExecutionException e) {
> +         SolrException.log(LOG,e);
> +       }
> +     }
>       // update our commit point to the right dir
>       solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, false));
>
> That should allow the searcher that the following commit command prompts to 
> see the *new* IndexWriter.
>
> On Feb 22, 2012, at 10:56 AM, eks dev wrote:
>
>> We started observing strange failures from ReplicationHandler when we
>> commit on master trunk version 4-5 days old.
>> It works sometimes, and sometimes not didn't dig deeper yet.
>>
>> Looks like the real culprit hides behind:
>> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>>
>> Looks familiar to somebody?
>>
>>
>> 120222 154959 SEVERE SnapPull failed
>> :org.apache.solr.common.SolrException: Error opening new searcher
>>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138)
>>    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251)
>>    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043)
>>    at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source)
>>    at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503)
>>    at 
>> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348)
>>    at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source)
>>    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163)
>>    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>    at 
>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>>    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>>    at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>    at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>    at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>    at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>    at java.lang.Thread.run(Thread.java:722)
>> Caused by: org.apache.lucene.store.AlreadyClosedException: this
>> IndexWriter is closed
>>    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810)
>>    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815)
>>    at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984)
>>    at 
>> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254)
>>    at 
>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233)
>>    at 
>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223)
>>    at 
>> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
>>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095)
>>    ... 15 more
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>

Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher

2012-02-23 Thread eks dev

it loos like it works, with patch, after a couple of hours of testing
under same conditions didn't see it happen (without it, approx. every
15 minutes).

I do not think it will happen again with this patch.

Thanks again and my respect to your debugging capacity, my bug report
was really thin.


On Thu, Feb 23, 2012 at 8:47 AM, eks dev  wrote:
> thanks Mark, I will give it a go and report back...
>
> On Thu, Feb 23, 2012 at 1:31 AM, Mark Miller  wrote:
>> Looks like an issue around replication IndexWriter reboot, soft commits and 
>> hard commits.
>>
>> I think I've got a workaround for it:
>>
>> Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java
>> ===
>> --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (revision 
>> 1292344)
>> +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (working 
>> copy)
>> @@ -499,6 +499,17 @@
>>
>>       // reboot the writer on the new index and get a new searcher
>>       solrCore.getUpdateHandler().newIndexWriter();
>> +      Future[] waitSearcher = new Future[1];
>> +      solrCore.getSearcher(true, false, waitSearcher, true);
>> +      if (waitSearcher[0] != null) {
>> +        try {
>> +         waitSearcher[0].get();
>> +       } catch (InterruptedException e) {
>> +         SolrException.log(LOG,e);
>> +       } catch (ExecutionException e) {
>> +         SolrException.log(LOG,e);
>> +       }
>> +     }
>>       // update our commit point to the right dir
>>       solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, 
>> false));
>>
>> That should allow the searcher that the following commit command prompts to 
>> see the *new* IndexWriter.
>>
>> On Feb 22, 2012, at 10:56 AM, eks dev wrote:
>>
>>> We started observing strange failures from ReplicationHandler when we
>>> commit on master trunk version 4-5 days old.
>>> It works sometimes, and sometimes not didn't dig deeper yet.
>>>
>>> Looks like the real culprit hides behind:
>>> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>>>
>>> Looks familiar to somebody?
>>>
>>>
>>> 120222 154959 SEVERE SnapPull failed
>>> :org.apache.solr.common.SolrException: Error opening new searcher
>>>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138)
>>>    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251)
>>>    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043)
>>>    at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source)
>>>    at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503)
>>>    at 
>>> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348)
>>>    at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source)
>>>    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163)
>>>    at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>    at 
>>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>>>    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>>>    at 
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>>    at 
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>>    at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>>    at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>>    at java.lang.Thread.run(Thread.java:722)
>>> Caused by: org.apache.lucene.store.AlreadyClosedException: this
>>> IndexWriter is closed
>>>    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810)
>>>    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815)
>>>    at 
>>> org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984)
>>>    at 
>>> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254)
>>>    at 
>>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233)
>>>    at 
>>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223)
>>>    at 
>>> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
>>>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095)
>>>    ... 15 more
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Solr Cloud, Commits and Master/Slave configuration

2012-02-27 Thread roz dev

Hi All,

I am trying to understand features of Solr Cloud, regarding commits and
scaling.


   - If I am using Solr Cloud then do I need to explicitly call commit
   (hard-commit)? Or, a soft commit is okay and Solr Cloud will do the job of
   writing to disk?


   - Do We still need to use  Master/Slave setup to scale searching? If we
   have to use Master/Slave setup then do i need to issue hard-commit to make
   my changes visible to slaves?
   - If I were to use NRT with Master/Slave setup with soft commit then
   will the slave be able to see changes made on master with soft commit?

Any inputs are welcome.

Thanks

-Saroj

Re: Solr Cloud, Commits and Master/Slave configuration

2012-02-28 Thread eks dev

SolrCluod is going to be great, NRT feature is really huge step
forward, as well as central configuration, elasticity ...

The only thing I do not yet understand is treatment of cases that were
traditionally covered by Master/Slave setup. Batch update

If I get it right (?), updates to replicas are sent one by one,
meaning when one server receives update, it gets forwarded to all
replicas. This is great for reduced update latency case, but I do not
know how is it implemented if you hit it with "batch" update. This
would cause huge amount of update commands going to replicas. Not so
good for throughput.

- Master slave does distribution at segment level, (no need to
replicate analysis, far less network traffic). Good for batch updates
- SolrCloud does par update command (low latency, but chatty and
Analysis step is done N_Servers times). Good for incremental updates

Ideally, some sort of "batching" is going to be available in
SolrCloud, and some cont roll over it, e.g. forward batches of 1000
documents (basically keep update log slightly longer and forward it as
a batch update command). This would still cause duplicate analysis,
but would reduce network traffic.

Please bare in mind, this is more of a question than a statement,  I
didn't look at the cloud code. It might be I am completely wrong here!

On Tue, Feb 28, 2012 at 4:01 AM, Erick Erickson  wrote:
> As I understand it (and I'm just getting into SolrCloud myself), you can
> essentially forget about master/slave stuff. If you're using NRT,
> the soft commit will make the docs visible, you don't ned to do a hard
> commit (unlike the master/slave days). Essentially, the update is sent
> to each shard leader and then fanned out into the replicas for that
> leader. All automatically. Leaders are elected automatically. ZooKeeper
> is used to keep the cluster information.
>
> Additionally, SolrCloud keeps a transaction log of the updates, and replays
> them if the indexing is interrupted, so you don't risk data loss the way
> you used to.
>
> There aren't really masters/slaves in the old sense any more, so
> you have to get out of that thought-mode (it's hard, I know).
>
> The code is under pretty active development, so any feedback is
> valuable
>
> Best
> Erick
>
> On Mon, Feb 27, 2012 at 3:26 AM, roz dev  wrote:
>> Hi All,
>>
>> I am trying to understand features of Solr Cloud, regarding commits and
>> scaling.
>>
>>
>>   - If I am using Solr Cloud then do I need to explicitly call commit
>>   (hard-commit)? Or, a soft commit is okay and Solr Cloud will do the job of
>>   writing to disk?
>>
>>
>>   - Do We still need to use  Master/Slave setup to scale searching? If we
>>   have to use Master/Slave setup then do i need to issue hard-commit to make
>>   my changes visible to slaves?
>>   - If I were to use NRT with Master/Slave setup with soft commit then
>>   will the slave be able to see changes made on master with soft commit?
>>
>> Any inputs are welcome.
>>
>> Thanks
>>
>> -Saroj

Re: Solr Cloud, Commits and Master/Slave configuration

2012-03-01 Thread eks dev

Thanks Mark,
Good, this is probably good enough to give it a try. My analyzers are
normally fast,  doing duplicate analysis  (at each replica) is
probably not going to cost a lot, if there is some decent "batching"

Can this be somehow controlled (depth of this buffer / time till flush
or some such). Which "events" trigger this flushing to replicas
(softCommit, commit, something new?)

What I found useful is to always think in terms of incremental (low
latency) and batch (high throughput) updates. I just then need some
knobs to tweak behavior of this update process.

I wold really like to move away from Master/Slave, Cloud makes a lot
of things way simpler for us users ... Will give it a try in a couple
of weeks

Later we can even think about putting replication at segment level for
"extremely expensive analysis, batch cases", or "initial cluster
seeding" as a replication option. But this is then just an
optimization.

Cheers,
eks


On Thu, Mar 1, 2012 at 5:24 AM, Mark Miller  wrote:
> We actually do currently batch updates - we are being somewhat loose when we 
> say a document at a time. There is a buffer of updates per replica that gets 
> flushed depending on the requests coming through and the buffer size.
>
> - Mark Miller
> lucidimagination.com
>
> On Feb 28, 2012, at 3:38 AM, eks dev wrote:
>
>> SolrCluod is going to be great, NRT feature is really huge step
>> forward, as well as central configuration, elasticity ...
>>
>> The only thing I do not yet understand is treatment of cases that were
>> traditionally covered by Master/Slave setup. Batch update
>>
>> If I get it right (?), updates to replicas are sent one by one,
>> meaning when one server receives update, it gets forwarded to all
>> replicas. This is great for reduced update latency case, but I do not
>> know how is it implemented if you hit it with "batch" update. This
>> would cause huge amount of update commands going to replicas. Not so
>> good for throughput.
>>
>> - Master slave does distribution at segment level, (no need to
>> replicate analysis, far less network traffic). Good for batch updates
>> - SolrCloud does par update command (low latency, but chatty and
>> Analysis step is done N_Servers times). Good for incremental updates
>>
>> Ideally, some sort of "batching" is going to be available in
>> SolrCloud, and some cont roll over it, e.g. forward batches of 1000
>> documents (basically keep update log slightly longer and forward it as
>> a batch update command). This would still cause duplicate analysis,
>> but would reduce network traffic.
>>
>> Please bare in mind, this is more of a question than a statement,  I
>> didn't look at the cloud code. It might be I am completely wrong here!
>>
>>
>>
>>
>>
>> On Tue, Feb 28, 2012 at 4:01 AM, Erick Erickson  
>> wrote:
>>> As I understand it (and I'm just getting into SolrCloud myself), you can
>>> essentially forget about master/slave stuff. If you're using NRT,
>>> the soft commit will make the docs visible, you don't ned to do a hard
>>> commit (unlike the master/slave days). Essentially, the update is sent
>>> to each shard leader and then fanned out into the replicas for that
>>> leader. All automatically. Leaders are elected automatically. ZooKeeper
>>> is used to keep the cluster information.
>>>
>>> Additionally, SolrCloud keeps a transaction log of the updates, and replays
>>> them if the indexing is interrupted, so you don't risk data loss the way
>>> you used to.
>>>
>>> There aren't really masters/slaves in the old sense any more, so
>>> you have to get out of that thought-mode (it's hard, I know).
>>>
>>> The code is under pretty active development, so any feedback is
>>> valuable
>>>
>>> Best
>>> Erick
>>>
>>> On Mon, Feb 27, 2012 at 3:26 AM, roz dev  wrote:
>>>> Hi All,
>>>>
>>>> I am trying to understand features of Solr Cloud, regarding commits and
>>>> scaling.
>>>>
>>>>
>>>>   - If I am using Solr Cloud then do I need to explicitly call commit
>>>>   (hard-commit)? Or, a soft commit is okay and Solr Cloud will do the job 
>>>> of
>>>>   writing to disk?
>>>>
>>>>
>>>>   - Do We still need to use  Master/Slave setup to scale searching? If we
>>>>   have to use Master/Slave setup then do i need to issue hard-commit to 
>>>> make
>>>>   my changes visible to slaves?
>>>>   - If I were to use NRT with Master/Slave setup with soft commit then
>>>>   will the slave be able to see changes made on master with soft commit?
>>>>
>>>> Any inputs are welcome.
>>>>
>>>> Thanks
>>>>
>>>> -Saroj
>
>
>
>
>
>
>
>
>
>
>
>

Re: Solr Design question on spatial search

2012-03-02 Thread Venu Dev

So let's say x=10 miles. Now if I search for San then San Francisco, San Mateo 
should be returned because there is a retail store in San Francisco. But San 
Jose should not be returned because it is more than 10 miles away from San 
Francisco. Had there been a retail store in San Jose then it should be also 
returned when you search for San. I can restrict the queries to a country. 

Thanks,
~Venu

On Mar 2, 2012, at 5:57 AM, Erick Erickson  wrote:

> I don't see how this works, since your search for San could also return
> San Marino, Italy. Would you then return all retail stores in
> X miles of that city? What about San Salvador de Jujuy, Argentina?
> 
> And even in your example, San would match San Mateo. But should
> the search then return any stores within X miles of San Mateo?
> You have to stop somewhere
> 
> Is there any other information you have that restricts how far to expand the
> search?
> 
> Best
> Erick
> 
> On Thu, Mar 1, 2012 at 4:57 PM, Venu Gmail Dev  
> wrote:
>> I don't think Spatial search will fully fit into this. I have 2 approaches 
>> in mind but I am not satisfied with either one of them.
>> 
>> a) Have 2 separate indexes. First one to store the information about all the 
>> cities and second one to store the retail stores information. Whenever user 
>> searches for a city then I return all the matching cities from first index 
>> and then do a spatial search on each of the matched city in the second 
>> index. But this is too costly.
>> 
>> b) Index only the cities which have a nearby store. Do all the 
>> calculation(s) before indexing the data so that the search is fast. The 
>> problem that I see with this approach is that if a new retail store or a 
>> city is added then I would have to re-index all the data again.
>> 
>> 
>> On Mar 1, 2012, at 7:59 AM, Dirceu Vieira wrote:
>> 
>>> I believe that what you need is spatial search...
>>> 
>>> Have a look a the documention:  http://wiki.apache.org/solr/SpatialSearch
>>> 
>>> On Wed, Feb 29, 2012 at 10:54 PM, Venu Shankar 
>>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I have a design question for Solr.
>>>> 
>>>> I work for an enterprise which has a lot of retail stores (approx. 20K).
>>>> These retail stores are spread across the world.  My search requirement is
>>>> to find all the cities which are within x miles of a retail store.
>>>> 
>>>> So lets say if we have a retail Store in San Francisco and if I search for
>>>> "San" then San Francisco, Santa Clara, San Jose, San Juan, etc  should be
>>>> returned as they are within x miles from San Francisco. I also want to rank
>>>> the search results by their distance.
>>>> 
>>>> I can create an index with all the cities in it but I am not sure how do I
>>>> ensure that the cities returned in a search result have a nearby retail
>>>> store. Any suggestions ?
>>>> 
>>>> Thanks,
>>>> Venu,
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Dirceu Vieira Júnior
>>> ---
>>> +47 9753 2473
>>> dirceuvjr.blogspot.com
>>> twitter.com/dirceuvjr
>>

Re: [SoldCloud] Slow indexing

2012-03-04 Thread eks dev

hmm, loks like you are facing exactly the phenomena I asked about.
See my question here:
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/61326

On Sun, Mar 4, 2012 at 9:24 PM, Markus Jelsma
 wrote:
> Hi,
>
> With auto-committing disabled we can now index many millions of documents in
> our test environment on a 5-node cluster with 5 shards and a replication
> factor of 2. The documents are uploaded from map/reduce. No significant
> changes were made to solrconfig and there are no update processors enabled.
> We are using a trunk revision from this weekend.
>
> The indexing speed is well below what we are used to see, we can easily
> index 5 millions documents on a non-cloud enabled Solr 3.x instance within
> an hour. What could be going on? There aren't many open TCP connections and
> the number of file descriptors is stable and I/O is low but CPU-time is
> high! Each node has two Solr cores both writing to their dedicated disk.
>
> The indexing speed is stable, it was slow at start and still is. It's now
> running for well over 6 hours and only 3.5 millions documents are indexed.
> Another strange detail is that the node receiving all incoming documents
> (we're not yet using a client side Solr server pool) has a much larger disk
> usage than all other nodes. This is peculiar as we expected all replica's to
> be a about the same size.
>
> The receiving node has slightly higher CPU than the other nodes but the
> thread dump shows a very large amount of threads of type
> cmdDistribExecutor-8-thread-292260 (295090) with 0-100ms CPU-time. At the
> top of the list these threads all have < 20ms time but near the bottom it
> rises to just over 100ms. All nodes have a couple of http-80-30 (121994)
> threads with very high CPU-time each.
>
> Is this a known issue? Did i miss something? Any ideas?
>
> Thanks

Re: Solr 4.0 and production environments

2012-03-07 Thread eks dev

I am here on lucene as a user since the project started, even before
solr came to life, many many years. And I was always using trunk
version for pretty big customers, and *never* experienced some serious
problems. The worst thing that can happen is to notice bug somewhere,
and if you have some reasonable testing for your product, you will see
it quickly.
But, with this community, *you will definitely not have wait long top
get it fixed*. Not only they will fix it, they will thank you for
bringing it up!

I can, as an old user, 100 % vouch what Robert said below.

Simply, just go for it, test you application a bit and make your users happy.




On Wed, Mar 7, 2012 at 5:55 PM, Robert Muir  wrote:
> On Wed, Mar 7, 2012 at 11:47 AM, Dirceu Vieira  wrote:
>> Hi All,
>>
>> Has anybody started using Solr 4.0 in production environments? Is it stable
>> enough?
>> I'm planning to create a proof of concept using solr 4.0, we have some
>> projects that will gain a lot with features such as near real time search,
>> joins and others, that are available only on version 4.
>>
>> Is it too risky to think of using it right now?
>> What are your thoughts and experiences with that?
>>
>
> In general, we try to keep our 'trunk' (slated to be 4.0) in very
> stable condition.
>
> Really, it should be 'ready-to-release' at any time, of course 4.0 has
> had many drastic changes: both at the Lucene and Solr level.
>
> Before deciding what is stable, you should define stability: is it:
> * api stability: will i be able to upgrade to a more recent snapshot
> of 4.0 without drastic changes to my app?
> * index format stability: will i be able to upgrade to a more recent
> snapshot of 4.0 without re-indexing?
> * correctness: is 4.0 dangerous in some way that it has many bugs
> since much of the code is new?
>
> I think you should limit your concerns to only the first 2 items, as
> far as correctness, just look at the tests. For any open source
> project, you can easily judge its quality by its tests: this is a
> fact.
>
> For lucene/solr the testing strategy, in my opinion, goes above and
> beyond many other projects: for example random testing:
> http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011_presentations#dawid_weiss
>
> and the new solr cloud functionality also adds the similar chaosmonkey
> concept on top of this already.
>
> If you are worried about bugs, is a lucene/solr trunk snapshot less
> reliable than even a released version of alternative software? its an
> interesting question. look at their tests.
>
> --
> lucidimagination.com

Is there any performance cost of using lots of OR in the solr query

2012-04-04 Thread roz dev

Hi All,

I am working on an application which makes few solr calls to get the data.

On the high level, We have a requirement like this


   - Make first call to Solr, to get the list of products which are
   children of a given category
   - Make 2nd solr call to get product documents based on a list of product
   ids

2nd query will look like

q=document_type:SKU&fq=product_id:(34 OR 45 OR 56 OR 77)

We can have close to 100 product ids in fq.

is there a performance cost of doing these solr calls which have lots of OR?

As per Slide # 41 of Presentation "The Seven Deadly Sins of Solr", it is a
bad idea to have these kind of queries.

http://www.slideshare.net/lucenerevolution/hill-jay-7-sins-of-solrpdf

But, It does not become clear the reason it is bad.

Any inputs will be welcome.

Thanks

Saroj

Re: How to do custom sorting in Solr?

2012-06-10 Thread roz dev

Hi All

>
> I have an index which contains a Catalog of Products and Categories, with
> Solr 4.0 from trunk
>
> Data is organized like this:
>
> Category: Books
>
> Sub Category: Programming
>
> Products:
>
> Product # 1,  Price: Regular Sort Order:1
> Product # 2,  Price: Markdown, Sort Order:2
> Product # 3   Price: Regular, Sort Order:3
> Product # 4   Price: Regular, Sort Order:4
> 
> .
> ...
> Product # 100   Price: Regular, Sort Order:100
>
> Sub Category: Fiction
>
> Products:
>
> Product # 1,  Price: Markdown, Sort Order:1
> Product # 2,  Price: Regular, Sort Order:2
> Product # 3   Price: Regular, Sort Order:3
> Product # 4   Price: Markdown, Sort Order:4
> 
> .
> ...
> Product # 70   Price: Regular, Sort Order:70
>
>
> I want to query Solr and sort these products within each of the
> sub-category in a such a way that products which are on markdown, are at
> the bottom of the documents list and other products
> which are on regular price, are sorted as per their sort order in their
> sub-category.
>
> Expected Results are
>
> Category: Books
>
> Sub Category: Programming
>
> Products:
>
> Product # 1,  Price: Regular Sort Order:1
> Product # 2,  Price: Markdown, Sort Order:101
> Product # 3   Price: Regular, Sort Order:3
> Product # 4   Price: Regular, Sort Order:4
> 
> .
> ...
> Product # 100   Price: Regular, Sort Order:100
>
> Sub Category: Fiction
>
> Products:
>
> Product # 1,  Price: Markdown, Sort Order:71
> Product # 2,  Price: Regular, Sort Order:2
> Product # 3   Price: Regular, Sort Order:3
> Product # 4   Price: Markdown, Sort Order:71
> 
> .
> ...
> Product # 70   Price: Regular, Sort Order:70
>
>
> My query is like this:
>
> q=*:*&fq=category:Books
>
> What are the options to implement custom sorting and how do I do it?
>
>
>- Define a Custom Function query?
>- Define a Custom Comparator? Or,
>- Define a Custom Collector?
>
>
> Please let me know the best way to go about it and any pointers to
> customize Solr 4.
>

Thanks
Saroj

Re: How to do custom sorting in Solr?

2012-06-10 Thread roz dev

Thanks Erik for your quick feedback

When Products are assigned to a category or Sub-Category then they can be
in any order and price type can be regular or markdown.
So, reg and markdown products are intermingled  as per their assignment but
I want to sort them in such a way that we
ensure that all the products which are on markdown are at the bottom of the
list.

I can use these multiple sorts but I realize that they are costly in terms
of heap used, as they are using FieldCache.

I have an index with 2M docs and docs are pretty big. So, I don't want to
use them unless there is no other option.

I am wondering if I can define a custom function query which can be like
this:


   - check if product is on the markdown
   - if yes then change its sort order field to be the max value in the
   given sub-category, say 99
   - else, use the sort order of the product in the sub-category

I have been looking at existing function queries but do not have a good
handle on how to make one of my own.

- Another option could be use a custom sort comparator but I am not sure
about the way it works

Any thoughts?


-Saroj




On Sun, Jun 10, 2012 at 5:02 AM, Erick Erickson wrote:

> Skimming this, I two options come to mind:
>
> 1> Simply apply primary, secondary, etc sorts. Something like
>   &sort=subcategory asc,markdown_or_regular desc,sort_order asc
>
> 2> You could also use grouping to arrange things in groups and sort within
>  those groups. This has the advantage of returning some members
>  of each of the top N groups in the result set, which makes it easier
> to
>  get some of each group rather than having to analyze the whole
> list
>
> But your example is somewhat contradictory. You say
> "products which are on markdown, are at
> the bottom of the documents list"
>
> But in your examples, products on "markdown" are intermingled
>
> Best
> Erick
>
> On Sun, Jun 10, 2012 at 3:36 AM, roz dev  wrote:
> > Hi All
> >
> >>
> >> I have an index which contains a Catalog of Products and Categories,
> with
> >> Solr 4.0 from trunk
> >>
> >> Data is organized like this:
> >>
> >> Category: Books
> >>
> >> Sub Category: Programming
> >>
> >> Products:
> >>
> >> Product # 1,  Price: Regular Sort Order:1
> >> Product # 2,  Price: Markdown, Sort Order:2
> >> Product # 3   Price: Regular, Sort Order:3
> >> Product # 4   Price: Regular, Sort Order:4
> >> 
> >> .
> >> ...
> >> Product # 100   Price: Regular, Sort Order:100
> >>
> >> Sub Category: Fiction
> >>
> >> Products:
> >>
> >> Product # 1,  Price: Markdown, Sort Order:1
> >> Product # 2,  Price: Regular, Sort Order:2
> >> Product # 3   Price: Regular, Sort Order:3
> >> Product # 4   Price: Markdown, Sort Order:4
> >> 
> >> .
> >> ...
> >> Product # 70   Price: Regular, Sort Order:70
> >>
> >>
> >> I want to query Solr and sort these products within each of the
> >> sub-category in a such a way that products which are on markdown, are at
> >> the bottom of the documents list and other products
> >> which are on regular price, are sorted as per their sort order in their
> >> sub-category.
> >>
> >> Expected Results are
> >>
> >> Category: Books
> >>
> >> Sub Category: Programming
> >>
> >> Products:
> >>
> >> Product # 1,  Price: Regular Sort Order:1
> >> Product # 2,  Price: Markdown, Sort Order:101
> >> Product # 3   Price: Regular, Sort Order:3
> >> Product # 4   Price: Regular, Sort Order:4
> >> 
> >> .
> >> ...
> >> Product # 100   Price: Regular, Sort Order:100
> >>
> >> Sub Category: Fiction
> >>
> >> Products:
> >>
> >> Product # 1,  Price: Markdown, Sort Order:71
> >> Product # 2,  Price: Regular, Sort Order:2
> >> Product # 3   Price: Regular, Sort Order:3
> >> Product # 4   Price: Markdown, Sort Order:71
> >> 
> >> .
> >> ...
> >> Product # 70   Price: Regular, Sort Order:70
> >>
> >>
> >> My query is like this:
> >>
> >> q=*:*&fq=category:Books
> >>
> >> What are the options to implement custom sorting and how do I do it?
> >>
> >>
> >>- Define a Custom Function query?
> >>- Define a Custom Comparator? Or,
> >>- Define a Custom Collector?
> >>
> >>
> >> Please let me know the best way to go about it and any pointers to
> >> customize Solr 4.
> >>
> >
> > Thanks
> > Saroj
>

Re: How to do custom sorting in Solr?

2012-06-10 Thread roz dev

Yes, these documents have lots of unique values as the same product could
be assigned to lots of other categories and that too, in a different sort
order.

We did some evaluation of heap usage and found that with kind of queries we
generate, heap usage was going up to 24-26 GB. I could trace it to the fact
that
fieldCache is creating an array of 2M size for each of the sort fields.

Since same products are mapped to multiple categories, we incur significant
memory overhead. Therefore, any solve where memory consumption can be
reduced is a good one for me.

In fact, we have situations where same product is mapped to more than 1
sub-category in the same category like


Books
 -- Programming
  - Java in a nutshell
 -- Sale (40% off)
  - Java in a nutshell


So,another thought in my mind is to somehow use second pass collector to
group books appropriately in Programming and Sale categories, with right
sort order.

But, i have no clue about that piece :(

-Saroj


On Sun, Jun 10, 2012 at 4:30 PM, Erick Erickson wrote:

> 2M docs is actually pretty small. Sorting is sensitive to the number
> of _unique_ values in the sort fields, not necessarily the number of
> documents.
>
> And sorting only works on fields with a single value (i.e. it can't have
> more than one token after analysis). So for each field you're only talking
> 2M values at the vary maximum, assuming that the field in question has
> a unique value per document, which I doubt very much given your
> problem description.
>
> So with a corpus that size, I'd "just try it'.
>
> Best
> Erick
>
> On Sun, Jun 10, 2012 at 7:12 PM, roz dev  wrote:
> > Thanks Erik for your quick feedback
> >
> > When Products are assigned to a category or Sub-Category then they can be
> > in any order and price type can be regular or markdown.
> > So, reg and markdown products are intermingled  as per their assignment
> but
> > I want to sort them in such a way that we
> > ensure that all the products which are on markdown are at the bottom of
> the
> > list.
> >
> > I can use these multiple sorts but I realize that they are costly in
> terms
> > of heap used, as they are using FieldCache.
> >
> > I have an index with 2M docs and docs are pretty big. So, I don't want to
> > use them unless there is no other option.
> >
> > I am wondering if I can define a custom function query which can be like
> > this:
> >
> >
> >   - check if product is on the markdown
> >   - if yes then change its sort order field to be the max value in the
> >   given sub-category, say 99
> >   - else, use the sort order of the product in the sub-category
> >
> > I have been looking at existing function queries but do not have a good
> > handle on how to make one of my own.
> >
> > - Another option could be use a custom sort comparator but I am not sure
> > about the way it works
> >
> > Any thoughts?
> >
> >
> > -Saroj
> >
> >
> >
> >
> > On Sun, Jun 10, 2012 at 5:02 AM, Erick Erickson  >wrote:
> >
> >> Skimming this, I two options come to mind:
> >>
> >> 1> Simply apply primary, secondary, etc sorts. Something like
> >>   &sort=subcategory asc,markdown_or_regular desc,sort_order asc
> >>
> >> 2> You could also use grouping to arrange things in groups and sort
> within
> >>  those groups. This has the advantage of returning some members
> >>  of each of the top N groups in the result set, which makes it
> easier
> >> to
> >>  get some of each group rather than having to analyze the whole
> >> list
> >>
> >> But your example is somewhat contradictory. You say
> >> "products which are on markdown, are at
> >> the bottom of the documents list"
> >>
> >> But in your examples, products on "markdown" are intermingled
> >>
> >> Best
> >> Erick
> >>
> >> On Sun, Jun 10, 2012 at 3:36 AM, roz dev  wrote:
> >> > Hi All
> >> >
> >> >>
> >> >> I have an index which contains a Catalog of Products and Categories,
> >> with
> >> >> Solr 4.0 from trunk
> >> >>
> >> >> Data is organized like this:
> >> >>
> >> >> Category: Books
> >> >>
> >> >> Sub Category: Programming
> >> >>
> >> >> Products:
> >> >>
> >> >> Product # 1,  Price: Regular Sort Order:1
> >> >> Product # 2,  Price: Markdown, So

Re: Issue with field collapsing in solr 4 while performing distributed search

2012-06-11 Thread roz dev

I think that there is no way around doing custom logic in this case.

If indexing process knows that documents have to be grouped then they
better be together.

-Saroj


On Mon, Jun 11, 2012 at 6:37 AM, Nitesh Nandy  wrote:

> Martijn,
>
> How do we add a custom algorithm for distributing documents in Solr Cloud?
> According to this discussion
>
> http://lucene.472066.n3.nabble.com/SolrCloud-how-to-index-documents-into-a-specific-core-and-how-to-search-against-that-core-td3985262.html
>  , Mark discourages users from using custom distribution mechanism in Solr
> Cloud.
>
> Load balancing is not an issue for us at the moment. In that case, how
> should we implement a custom partitioning algorithm.
>
>
> On Mon, Jun 11, 2012 at 6:23 PM, Martijn v Groningen <
> martijn.v.gronin...@gmail.com> wrote:
>
> > The ngroups returns the number of groups that have matched with the
> > query. However if you want ngroups to be correct in a distributed
> > environment you need
> > to put document belonging to the same group into the same shard.
> > Groups can't cross shard boundaries. I guess you need to do
> > some manual document partitioning.
> >
> > Martijn
> >
> > On 11 June 2012 14:29, Nitesh Nandy  wrote:
> > > Version: Solr 4.0 (svn build 30th may, 2012) with Solr Cloud  (2 slices
> > and
> > > 2 shards)
> > >
> > > The setup was done as per the wiki:
> > http://wiki.apache.org/solr/SolrCloud
> > >
> > > We are doing distributed search. While querying, we use field
> collapsing
> > > with "ngroups" set as true as we need the number of search results.
> > >
> > > However, there is a difference in the number of "result list" returned
> > and
> > > the "ngroups" value returned.
> > >
> > > Ex:
> > >
> >
> http://localhost:8983/solr/select?q=message:blah%20AND%20userid:3&&group=true&group.field=id&group.ngroups=true
> > >
> > >
> > > The response XMl looks like
> > >
> > > 
> > > 
> > > 
> > > 0
> > > 46
> > > 
> > > id
> > > true
> > > true
> > > messagebody:monit AND usergroupid:3
> > > 
> > > 
> > > 
> > > 
> > > 10
> > > 9
> > > 
> > > 
> > > 320043
> > > 
> > > ...
> > > 
> > > 
> > > 
> > > 398807
> > > ...
> > > 
> > > 
> > > 
> > > 346878
> > > ...
> > > 
> > > 
> > > 346880
> > > ...
> > > 
> > > 
> > > 
> > > 
> > > 
> > >
> > > So you can see that the ngroups value returned is 9 and the actual
> number
> > > of groups returned is 4
> > >
> > > Why do we have this discrepancy in the ngroups, matches and actual
> number
> > > of groups. Is this an open issue ?
> > >
> > >  Any kind of help is appreciated.
> > >
> > > --
> > > Regards,
> > >
> > > Nitesh Nandy
> >
> >
> >
> > --
> > Met vriendelijke groet,
> >
> > Martijn van Groningen
> >
>
>
>
> --
> Regards,
>
> Nitesh Nandy
>

SolrJ Question about Bad Request Root cause error

2011-01-11 Thread roz dev

Hi All

We are using SolrJ client (v 1.4.1) to integrate with our solr search
server.
We notice that whenever SolrJ request does not match with Solr schema, we
get Bad Request exception which makes sense.

org.apache.solr.common.SolrException: Bad Request

But, SolrJ Client does not provide any clue about the reason request is Bad.

Is there any way to get the root cause on client side?

Of Course, solr server logs have enough info to know that data is bad but it
would be great
to have the same info in the exception generated by SolrJ.

Any thoughts? Is there any plan to add this in future releases?

Thanks,
Saroj

Question about http://wiki.apache.org/solr/Deduplication

2011-03-24 Thread eks dev

Hi,
Use case I am trying to figure out is about preserving IDs without
re-indexing on duplicate, rather adding this new ID under list of
document id "aliases".

Example:
Input collection:
"id":1, "text":"dummy text 1", "signature":"A"
"id":2, "text":"dummy text 1", "signature":"A"

I add the first document in empty index, text is going to be indexed,
ID is going to be "1", so far so good

Now the question, if I add second document with id == "2", instead of
deleting/indexing this new document, I would like to store id == 2 in
multivalued Field "id"

At the end, I would have one document less indexed and both ID are
going to be "searchable" (and stored as well)...

Is it possible in solr to have multivalued "id"? Or I need to make my
own "mv_ID" for this? Any ideas how to achieve this efficiently?

My target is not to add new documents if signature matches, but to
have IDs indexed and stored?

Thanks,
eks

1 2 3 4 >

1 - 100 of 302 matches

Mail list logo