Re: Using fq as OR

2014-05-26 Thread Dmitry Kan
Erick,

correct me if the following's wrong, but if you have a custom query parser
configured to preprocess your searches, you'd need to send the
corresponding bit of the search in the q= parameter, rather than fq=
parameter. In that sense, q and fq are not exactly equal.

Dmitry


On Thu, May 22, 2014 at 5:57 PM, Erick Erickson wrote:

> Hmmm, not quite.
>
> AFAIK, anything you can put in a q clause can also be put in an fq
> clause. So it's not a matter of whether your search is precise or not
> that you should use for determining whether to use a q or fq clause.
> What _should_ influence this is whether docs that satisfy the clause
> should contribute to ranking.
>
> fq clauses do NOT contribute to ranking. They determine whether the
> doc is returned at all.
> q clauses contribute to the ranking.
>
> Additionally, the results of fq clauses are cached and may be re-used.
>
> That said, since fq clauses are often used in conjunction with
> faceting, they are very often used more precisely. But it's still a
> matter of caching and ranking that should determine where the clause
> goes.
>
> FWIW,
> Erick
>
> On Wed, May 21, 2014 at 9:09 PM, manju16832003 
> wrote:
> > The *fq* is used for searching more deterministic results something like
> > WHERE type={}
> > Where as *q* is something like WHERE type like '%%'
> >
> > user *fq*, if your are sure of what your going to search
> > use *q*, if not sure what your trying to search
> >
> > If you are using fq and if you do not get any matching documents, solr
> > throws 0 or error message
> > where q would try to match nearest documents for your search query
> >
> > That's what I have experienced so far. :-).
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Using-fq-as-OR-tp4137411p4137525.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


about analyzer and tokenizer

2014-05-26 Thread rachun
Dear all,


How can I do this...
I index the document  => Macbook
then when I query mac book I should get the result.

This is my schema setting...


   





  


Any suggest would be very appreciate.
Chun.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Full Indexing fails on Solr-Probable connection issue.HELP!

2014-05-26 Thread Aniket Bhoi
On Thu, May 22, 2014 at 9:31 PM, Shawn Heisey  wrote:

> On 5/22/2014 8:31 AM, Aniket Bhoi wrote:
> > On Thu, May 22, 2014 at 7:13 PM, Shawn Heisey  wrote:
> >
> >> On 5/22/2014 1:53 AM, Aniket Bhoi wrote:
> >>> Details:
> >>>
> >>> *Solr Version:*
> >>> Solr Specification Version: 3.4.0.2012.01.23.14.08.01
> >>> Solr Implementation Version: 3.4
> >>> Lucene Specification Version: 3.4
> >>> Lucene Implementation Version: 3.4
> >>>
> >>> *Tomcat version:*
> >>> Apache Tomcat/6.0.18
> >>>
> >>> *OS details:*
> >>> SUSE Linux Enterprise Server 11 (x86_64)
> >>> VERSION = 11
> >>> PATCHLEVEL = 1
> >>>
> >>> While running indexing on this server,It failed.
> >>>
> >>> Log excerpt:
> >>>
> >>> Caused by:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> >>> com.microsoft.sqlserver.jdbc.SQLServerException: The result set is
> >> closed.
> >>>
> >>> Out intial hypothesis was that there is a problem with the connection
> >>> thread,so we made changes to the context.xml and added
> >>> validationQuery,testOnBorrow etc..to make sure the thread doesnt time
> >> out.
> >>> We also killed a lot of sleeping sessions from the server to the
> >> database.
> >>> All of the above still didnt work
> >> I have reduced your log excerpt to what I think is the important part.
> >>
> >> Removing the multithreaded support as others have suggested is a good
> >> idea, but what I think is really happening here is that Solr is engaging
> >> in a multi-tier merge, so it stops indexing for a while ... and
> >> meanwhile, JDBC times out and closes your database connection because of
> >> inactivity.  When the largest merge tier finishes, indexing tries to
> >> resume, which it can't do because the database connection is closed.
> >>
> >> The solution is to allow more simultaneous merges to happen, which
> >> allows indexing to continue while a multi-tier merge is underway.  This
> >> is my indexConfig section from solrconfig.xml:
> >>
> >> 
> >>   
> >> 35
> >> 35
> >> 105
> >>   
> >>class="org.apache.lucene.index.ConcurrentMergeScheduler">
> >> 1
> >> 6
> >>   
> >>   48
> >>   false
> >> 
> >>
> >> The important part for your purposes is the mergeScheduler config, and
> >> in particular, maxMergeCount.  Increase that to 6.  If you are using
> >> standard spinning hard disks, do not increase maxThreadCount beyond 1.
> >> If you are using SSD, you can safely increase that a small amount, but I
> >> don't think I'd go above 2 or 3.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
> > I may be missing something ,or looking in the wrong place,But I cannot
> find
> > an indexConfig section or any other mentioned detail above in the
> > solrconfig.xml file
>
> Solr will work without one, in which case it will simply use the
> defaults.  With older 3.x versions the mergeScheduler config will
> actually need to go in an indexDefaults section.  The mainIndex and
> indexDefaults sections were deprecated in 3.6 and removed entirely in 4.x.
>
> https://issues.apache.org/jira/browse/SOLR-1052
>
> If you don't have indexDefaults either, you may need to add the config
> as a top level element under .  If you do this, here's what you
> should add:
>
> 
>   
> 1
> 6
>   
> 
>
> I think we should probably change the default value that Solr uses for
> maxMergeCount.  This problem comes up fairly often.  As long as
> maxThreadCount is 1, I cannot think of a really good reason to limit
> maxMergeCount at the level that we currently do.
>
> Thanks,
> Shawn
>
>

I changed the solrconfig.xml file and included the changes you
suggested.However,this didnt work out,the Indexing still fails..I have also
added this

*:maxActive="100" minIdle="10" maxWait="1" initialSize="10"
logAbandoned="true" validationQuery="select 1" testOnBorrow="true"
testOnReturn="true" validationQueryTimeout="30" removeAbandoned="true"
removeAbandonedTimeout="3600"  *

to the* context.xml* file.This didnt work etiher.The thing to note is after
I changed the solrconfig.xml to add the merge config changes,the indexing
failed after 6 hours.Earlier,It used to fail after 1 hour .


Re: about analyzer and tokenizer

2014-05-26 Thread Dmitry Kan
Hi Chun,

You can use the edge ngram filter [1] on your tokens, that will produce all
possible letter sequences in a certain (configurable) range, like: ma, ac,
bo, ok, mac, aac, boo, ook, book etc.
Then when querying, both mac and book should hit in the sequence and you
should get the macbook hit back. This comes at a price of increasing your
index size though.

[1]
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-EdgeN-GramFilter




On Mon, May 26, 2014 at 12:26 PM, rachun  wrote:

> Dear all,
>
>
> How can I do this...
> I index the document  => Macbook
> then when I query mac book I should get the result.
>
> This is my schema setting...
>
>  positionIncrementGap="100">
>   
> 
> 
> 
> 
>  words="lang/stopwords_th.txt"/>
>   
> 
>
> Any suggest would be very appreciate.
> Chun.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


Re: Combining Solr score with customized user ratings for a document

2014-05-26 Thread rulinma
Good. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Combining-Solr-score-with-customized-user-ratings-for-a-document-tp4040200p4138135.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Full Indexing fails on Solr-Probable connection issue.HELP!

2014-05-26 Thread Aniket Bhoi
Another thing I have noted is that the exception always follows a commit
operation.Log excerpt below:

INFO: SolrDeletionPolicy.onCommit: commits:num=2
commit{dir=/opt/solr/cores/calls/data/index,segFN=segments_2qt,version=1347458723267,generation=3557,filenames=[_3z9.tii,
_3z3.fnm, _3z9.nrm, _3za.prx, _3z9.fdt, _3z9.fnm, _3z9.fdx, _3z3.frq,
_3za.nrm, segments_2qt, _3z3.fdx, _3z9.prx, _3z3.fdt, _3za.fdx, _3z9.frq,
_3z3.prx, _3za.fdt, _3z3.tii, _3za.tis, _3za.fnm, _3z3.nrm, _3z9.tis,
_3za.tii, _3za.frq, _3z3.tis]
commit{dir=/opt/solr/cores/calls/data/index,segFN=segments_2qu,version=1347458723269,generation=3558,filenames=[_3zb.fdt,
_3z9.tii, _3z3.fnm, _3z9.nrm, _3zb.tii, _3zb.tis, _3zb.fdx, _3za.prx,
_3z9.fdt, _3z9.fnm, _3z9.fdx, _3zb.frq, _3z3.frq, _3za.nrm, segments_2qu,
_3z3.fdx, _3zb.prx, _3z9.prx, _3zb.fnm, _3z3.fdt, _3za.fdx, _3z9.frq,
_3z3.prx, _3za.fdt, _3zb.nrm, _3z3.tii, _3za.tis, _3za.fnm, _3z3.nrm,
_3z9.tis, _3za.tii, _3za.frq, _3z3.tis]
May 24, 2014 5:49:05 AM org.apache.solr.core.SolrDeletionPolicy
updateCommits
INFO: newest commit = 1347458723269
May 24, 2014 5:49:05 AM org.apache.solr.search.SolrIndexSearcher 
INFO: Opening Searcher@423dbcca main
May 24, 2014 5:49:05 AM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@423dbcca main from Searcher@19c19869 main
fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
May 24, 2014 5:49:05 AM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@423dbcca main
fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
May 24, 2014 5:49:05 AM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@423dbcca main from Searcher@19c19869 main
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
May 24, 2014 5:49:05 AM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@423dbcca main
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
May 24, 2014 5:49:05 AM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@423dbcca main from Searcher@19c19869 main
queryResultCache{lookups=1,hits=1,hitratio=1.00,inserts=3,evictions=0,size=3,warmupTime=2,cumulative_lookups=47,cumulative_hits=46,cumulative_hitratio=0.97,cumulative_inserts=1,cumulative_evictions=0}
May 24, 2014 5:49:05 AM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@423dbcca main
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=3,evictions=0,size=3,warmupTime=2,cumulative_lookups=47,cumulative_hits=46,cumulative_hitratio=0.97,cumulative_inserts=1,cumulative_evictions=0}
May 24, 2014 5:49:05 AM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@423dbcca main from Searcher@19c19869 main
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=40,evictions=0,size=40,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
May 24, 2014 5:49:05 AM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@423dbcca main
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
May 24, 2014 5:49:05 AM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener sending requests to Searcher@423dbcca main
May 24, 2014 5:49:05 AM org.apache.solr.core.SolrCore execute
INFO: [calls] webapp=null path=null
params={start=0&event=newSearcher&q=*:*&rows=20} hits=40028 status=0
QTime=2
May 24, 2014 5:49:05 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
May 24, 2014 5:49:05 AM org.apache.solr.core.SolrCore execute
INFO: [calls] webapp=null path=null
params={start=0&event=newSearcher&q=banking&rows=20} hits=636 status=0
QTime=3
May 24, 2014 5:49:05 AM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener done.
May 24, 2014 5:49:05 AM
org.apache.solr.handler.component.SpellCheckComponent$SpellCheckerListener
newSearcher
INFO: Index is not optimized therefore skipping building spell check index
for: default
May 24, 2014 5:49:05 AM org.apache.solr.core.SolrCore registerSearcher
INFO: [calls] Registered new searcher Searcher@423dbcca main
May 24, 2014 5:49:05 AM org.apache.solr.search.SolrIndexSearcher close
INFO: Closing Searcher@19c19869 main
fieldValueCache{lookups=0,hits=0,hitratio=0.00,ins

how to apply multiplcative Boost in multivalued field

2014-05-26 Thread Aman Tandon
HI,

I am confused to how to apply the multiplicative boost on multivalued field.




Suppose in plid the value goes like 111,1234,2345,4567,2335,9876,67

I am applying the filters on the plid like *..&fq=plid:(111 1234 2345 4567
2335 9876 67)*

Now i need to apply the boost on the first three plid as well, which is a
multivalued field, so help me out here.

With Regards
Aman Tandon


Re: How does query on AND work

2014-05-26 Thread Per Steffensen
Do not know if this is a special-case. I guess an AND-query where one 
side hits 500-1000 and the other side hits billions is a special-case. 
But this way of carrying out the query might also be an optimization in 
less uneven cases.
It does not require that the "lots of hits"-part of the query is a 
range-query, and it does not necessarily require that the field used in 
this part is DocValue (you can go fetch the values from "slow" store). 
But I guess it has to be a very uneven case if this approach should be 
faster on a non-DocValue field.


I think this can be generalized. I think of it as something similar as 
being able to "hint" relational databases not to use an specific index. 
I do not know that much about Solr/Lucene query-syntax, but I believe 
"filter-queries" (fq) are kinda queries that will be AND'ed onto the 
real query (q), and in order not to have to change the query-syntax too 
much (adding hits or something), I guess a first step for a feature 
doing what I am doing here, could be introduce something similar to 
"filter-queries" - queries that will be carried out on the result of (q 
+ fqs) but looking a the values of the documents in that result instead 
of intersecting with doc-sets found from index. Lets call it 
"post-query-value-filter"s (yes, we can definitely come up with a 
better/shorter name)


1) q=no_dlng_doc_ind_sto:() AND 
timestamp_dlng_doc_ind_sto:([ TO ])
2) 
q=no_dlng_doc_ind_sto:(),fq=timestamp_dlng_doc_ind_sto:([ TO 
])
3) 
q=no_dlng_doc_ind_sto:(),post-query-value-filter=timestamp_dlng_doc_ind_sto:([ 
TO ])


1) and 2) both use index on both no_dlng_doc_ind_sto and 
timestamp_dlng_doc_ind_sto. 3) uses only index on no_dlng_doc_ind_sto 
and does the time-interval filter part by fetching values (using 
DocValue if possible) for timestamp_dlng_doc_ind_sto for each of the 
docs found through the no_dlng_doc_ind_sto-index to see if this doc 
should really be included.


There are some things that I did not initially tell about actually 
wanting to do a facet search etc. Well, here is the full story: 
http://solrlucene.blogspot.dk/2014/05/performance-of-and-queries-with-uneven.html


Regards, Per Steffensen

On 23/05/14 17:37, Toke Eskildsen wrote:

Per Steffensen [st...@designware.dk] wrote:

* It IS more efficient to just use the index for the
"no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
part and then fetch timestamp-doc-values for those doc-ids to filter out
the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
the query.

Thank you for the follow up. It sounds rather special-case though, with 
requirement of DocValues for the range-field. Do you think this can be 
generalized?

- Toke Eskildsen





sort by spatial distance in faceting

2014-05-26 Thread Aman Tandon
Hi,

Is it possible to sort the results return on faceting by geo spatial
distance instead of result count.

Currently i am faceting on city, which returns me the top facets on behalf
of the docs matched for that particular city.

e.g.:
Delhi,400
Noida, 380
.
.
.
etc.

If the user selects the city then the facets should be according to the geo
spatial distance instead of results, Is it possible with the solr 4.7.x.?

With Regards
Aman Tandon


Re: about analyzer and tokenizer

2014-05-26 Thread Jack Krupansky
Unfortunately Solr and Lucene do not provide a truly clean out of the box 
solution for this obvious use case, but you can approximate it by using 
index-time synonyms, so that "mac book" will also index as "macbook" and 
"macbook" will also index as "mac book". Your SYNONYMS.TXT file would 
contain:


macbook,mac book

Only use the synonyms filter at index time. The standard query parsers don't 
support phrases for synonyms.


-- Jack Krupansky

-Original Message- 
From: rachun

Sent: Monday, May 26, 2014 5:26 AM
To: solr-user@lucene.apache.org
Subject: about analyzer and tokenizer

Dear all,


How can I do this...
I index the document  => Macbook
then when I query mac book I should get the result.

This is my schema setting...


 
   
   
   
   
   
 


Any suggest would be very appreciate.
Chun.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Using SolrCloud with RDBMS or without

2014-05-26 Thread Ali Nazemian
Hi everybody,

I was wondering which scenario (or the combination) would be better for my
application. From the aspect of performance, scalability and high
availability. Here is my application:

Suppose I am going to have more than 10m documents and it grows every day.
(probably in 1 years it reaches to more than 100m docs. I want to use Solr
as tool for indexing these documents but the problem is I have some data
fields that could change frequently. (not too much but it could change)

Scenarios:

1- Using SolrCloud as database for all data. (even the one that could be
changed)

2- Using SolrCloud as database for static data and using RDBMS (such as
oracle) for storing dynamic fields.

3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
data.

Best regards.

-- 
A.Nazemian


Re: Solr - Cores not initialised

2014-05-26 Thread Jack Krupansky
Usually a message like "SolrCore 'corexxx' is not available due to init 
failure" means that you had a syntax error in your schema.xml or 
solrconfig.xml so that it could not be successfully processed by Solr (which 
is done in the "init" method.)


Do a diff between your schema and config files before and after your change 
to see what your mistake might be. Feel free to post the two diffs here if 
your mistake is not obvious.


The message "s XML file does not appear to have any style information 
associated with it" suggests that you mangled some of the XML elements. It 
appears that you mangled that message as well! Feel free to post the 
complete message here as well.


-- Jack Krupansky

-Original Message- 
From: Manikandan Saravanan

Sent: Monday, May 26, 2014 1:52 AM
To: solr-user@lucene.apache.org
Cc: Varuna Venkatesh
Subject: Solr - Cores not initialised

Hi,

I’m running Solr 4.6.0 on an Ubuntu box. I recently made the following 
changes:


1. I edited Schema.xml to index my data by a column called timestamp.
2. I then ran the reload procedure as mentioned here 
https://wiki.apache.org/solr/CoreAdmin#RELOAD


After that, when I restarted Solr, I get a big red alert saying the 
following:
core0: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
No such core: core0
core1: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
No such core: core1
If I try to visit http://:8983/solr/core0/admin or http://Address>:8983/solr/core1/admin, I get this


s XML file does not appear to have any style information associated with it. 
The document tree is shown below.




SolrCore 'core0' is not available due to init failure: No such core: core0


org.apache.solr.common.SolrException: SolrCore 'core0' is not available due 
to init failure: No such core: core0 at 
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:818) at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289) 
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197) 
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) 
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) 
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) 
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) 
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) 
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) 
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) 
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) 
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) 
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) 
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) 
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) 
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) 
at org.eclipse.jetty.server.Server.handle(Server.java:368) at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) 
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) 
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) 
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) 
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) 
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) 
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) 
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) 
at java.lang.Thread.run(Thread.java:744) Caused by: 
org.apache.solr.common.SolrException: No such core: core0 at 
org.apache.solr.core.CoreContainer.reload(CoreContainer.java:675) at 
org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:717) 
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:178) 
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) 
at 
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:662) 
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) 
... 26 more


500



--
Manikandan Saravanan
Architect - Technology
TheSocialPeople 



Re: Using SolrCloud with RDBMS or without

2014-05-26 Thread Jack Krupansky
You could also consider DataStax Enterprise, which integrates Apache 
Cassandra as the primary database and Solr for indexing and query.


See:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise

-- Jack Krupansky

-Original Message- 
From: Ali Nazemian

Sent: Monday, May 26, 2014 9:50 AM
To: solr-user@lucene.apache.org
Subject: Using SolrCloud with RDBMS or without

Hi everybody,

I was wondering which scenario (or the combination) would be better for my
application. From the aspect of performance, scalability and high
availability. Here is my application:

Suppose I am going to have more than 10m documents and it grows every day.
(probably in 1 years it reaches to more than 100m docs. I want to use Solr
as tool for indexing these documents but the problem is I have some data
fields that could change frequently. (not too much but it could change)

Scenarios:

1- Using SolrCloud as database for all data. (even the one that could be
changed)

2- Using SolrCloud as database for static data and using RDBMS (such as
oracle) for storing dynamic fields.

3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
data.

Best regards.

--
A.Nazemian 



Re: Using SolrCloud with RDBMS or without

2014-05-26 Thread Ali Nazemian
The fact that I ignore Cassandra is because of it seems Cassandra is
perfect when you have too much write operation. In my case it is true that
I have some update operation but for sure read operations are much more
than write ones. By the way there are probably more scenarios for my
application. My question would be which one is probably the best?
Best regards.


On Mon, May 26, 2014 at 6:27 PM, Jack Krupansky wrote:

> You could also consider DataStax Enterprise, which integrates Apache
> Cassandra as the primary database and Solr for indexing and query.
>
> See:
> http://www.datastax.com/what-we-offer/products-services/
> datastax-enterprise
>
> -- Jack Krupansky
>
> -Original Message- From: Ali Nazemian
> Sent: Monday, May 26, 2014 9:50 AM
> To: solr-user@lucene.apache.org
> Subject: Using SolrCloud with RDBMS or without
>
>
> Hi everybody,
>
> I was wondering which scenario (or the combination) would be better for my
> application. From the aspect of performance, scalability and high
> availability. Here is my application:
>
> Suppose I am going to have more than 10m documents and it grows every day.
> (probably in 1 years it reaches to more than 100m docs. I want to use Solr
> as tool for indexing these documents but the problem is I have some data
> fields that could change frequently. (not too much but it could change)
>
> Scenarios:
>
> 1- Using SolrCloud as database for all data. (even the one that could be
> changed)
>
> 2- Using SolrCloud as database for static data and using RDBMS (such as
> oracle) for storing dynamic fields.
>
> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
> data.
>
> Best regards.
>
> --
> A.Nazemian
>



-- 
A.Nazemian


Compression vs FieldCache for doc ids retrieval

2014-05-26 Thread jim ferenczi
Dear Solr users,

we migrated our solution from Solr 4.0 to Solr 4.3 and we noticed a
degradation of the search performance. We compared the two versions and
found out that most of the time is spent in the decompression of the
retrievable fields in Solr 4.3. The block compression of the documents is a
great feature for us because it reduces the size of our index but we don’t
have enough resources (I mean cpus) to safely migrate to the new version.
In order to reduce the cost of the decompression we tried a simple patch in
the BinaryResponseWriter; during the first phase of the distributed search
the response writer gets the documents from the index reader to only
extract the doc ids of the top N results. Our patch uses the field cache to
get the doc ids during the first phase and thus replaces a full
decompression of 16k blocks (for a single document) by a simple get in an
array (the field cache or the doc values). Thanks to this patch we are now
able to handle the same number of QPS than before (with Solr 4.0). Of
course the document cache could help as well but but not as much as one
would have though (mainly because we have a lot of deep paging queries).

I am sure that the idea we implemented is not new but I haven’t seen any
Jira about it. Should we create one (I mean does it have a chance to be
included in future release of Solr or does anybody already working on this)
?

Cheers,

Jim


Re: How does query on AND work

2014-05-26 Thread Alexandre Rafalovitch
Did not follow the whole story but " post-query-value-filter" does exist in
Solr. Have you tried searching for pretty much that expression. and maybe
something about cost-based filter.

Regards,
Alex
On 26/05/2014 6:49 pm, "Per Steffensen"  wrote:

> Do not know if this is a special-case. I guess an AND-query where one side
> hits 500-1000 and the other side hits billions is a special-case. But this
> way of carrying out the query might also be an optimization in less uneven
> cases.
> It does not require that the "lots of hits"-part of the query is a
> range-query, and it does not necessarily require that the field used in
> this part is DocValue (you can go fetch the values from "slow" store). But
> I guess it has to be a very uneven case if this approach should be faster
> on a non-DocValue field.
>
> I think this can be generalized. I think of it as something similar as
> being able to "hint" relational databases not to use an specific index. I
> do not know that much about Solr/Lucene query-syntax, but I believe
> "filter-queries" (fq) are kinda queries that will be AND'ed onto the real
> query (q), and in order not to have to change the query-syntax too much
> (adding hits or something), I guess a first step for a feature doing what I
> am doing here, could be introduce something similar to "filter-queries" -
> queries that will be carried out on the result of (q + fqs) but looking a
> the values of the documents in that result instead of intersecting with
> doc-sets found from index. Lets call it "post-query-value-filter"s (yes, we
> can definitely come up with a better/shorter name)
>
> 1) q=no_dlng_doc_ind_sto:() AND timestamp_dlng_doc_ind_sto:([
> TO ])
> 2) q=no_dlng_doc_ind_sto:(),fq=timestamp_dlng_doc_ind_sto:([
> TO ])
> 3) q=no_dlng_doc_ind_sto:(),post-query-value-filter=
> timestamp_dlng_doc_ind_sto:([ TO ])
>
> 1) and 2) both use index on both no_dlng_doc_ind_sto and
> timestamp_dlng_doc_ind_sto. 3) uses only index on no_dlng_doc_ind_sto and
> does the time-interval filter part by fetching values (using DocValue if
> possible) for timestamp_dlng_doc_ind_sto for each of the docs found through
> the no_dlng_doc_ind_sto-index to see if this doc should really be included.
>
> There are some things that I did not initially tell about actually wanting
> to do a facet search etc. Well, here is the full story:
> http://solrlucene.blogspot.dk/2014/05/performance-of-and-
> queries-with-uneven.html
>
> Regards, Per Steffensen
>
> On 23/05/14 17:37, Toke Eskildsen wrote:
>
>> Per Steffensen [st...@designware.dk] wrote:
>>
>>> * It IS more efficient to just use the index for the
>>> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
>>> part and then fetch timestamp-doc-values for those doc-ids to filter out
>>> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
>>> the query.
>>>
>> Thank you for the follow up. It sounds rather special-case though, with
>> requirement of DocValues for the range-field. Do you think this can be
>> generalized?
>>
>> - Toke Eskildsen
>>
>>
>


MergeReduceIndexerTool takes a lot of time for a limited number of documents

2014-05-26 Thread Costi Muraru
Hey guys,

I'm using the MergeReduceIndexerTool to import data into a SolrCloud
cluster made out of 3 decent machines.
Looking in the JobTracker, I can see that the mapper jobs finish quite
fast. The reduce jobs get to ~80% quite fast as well. It is here where
they get stucked for a long period of time (picture + log attached).
I'm only trying to insert ~80k documents with 10-50 different fields
each. Why is this happening? Am I not setting something correctly? Is
the fact that most of the documents have different field names, or too
many for that matter?
Any tips are gladly appreciated.

Thanks,
Costi

>From the reduce logs:
60208 [main] INFO  org.apache.solr.update.UpdateHandler  - start
commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false}
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[IW][main]: commit: start
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[IW][main]: commit: enter lock
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[IW][main]: commit: now prepare
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[IW][main]: prepareCommit: flush
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[IW][main]:   index before flush
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[DW][main]: main startFullFlush
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[DW][main]: anyChanges? numDocsInRam=25603 deletes=true
hasTickets:false pendingChangesInFullFlush: false
60209 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[DWFC][main]: addFlushableState DocumentsWriterPerThread
[pendingDeletes=gen=0 25602 deleted terms (unique count=25602)
bytesUsed=5171604, segment=_0, aborting=false, numDocsInRAM=25603,
deleteQueue=DWDQ: [ generation: 0 ]]
61542 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[DWPT][main]: flush postings as segment _0 numDocs=25603
61664 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
125115 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
199408 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
271088 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
336754 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
417810 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
479495 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
552357 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
621450 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
683173 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads

This is the run command I'm using:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.MapReduceIndexerTool \
 --log4j /home/cmuraru/solr/log4j.properties \
 --morphline-file morphline.conf \
 --output-dir hdfs://nameservice1:8020/tmp/outdir \
 --verbose --go-live --zk-host localhost:2181/solr \
 --collection collection1 \
hdfs://nameservice1:8020/tmp/indir


Re: Using SolrCloud with RDBMS or without

2014-05-26 Thread Shawn Heisey
On 5/26/2014 7:50 AM, Ali Nazemian wrote:
> I was wondering which scenario (or the combination) would be better for my
> application. From the aspect of performance, scalability and high
> availability. Here is my application:
> 
> Suppose I am going to have more than 10m documents and it grows every day.
> (probably in 1 years it reaches to more than 100m docs. I want to use Solr
> as tool for indexing these documents but the problem is I have some data
> fields that could change frequently. (not too much but it could change)

Choosing which database software to use to hold your data is a problem
with many possible solutions.  Everyone will have a different answer for
you.  Each solution has strengths and weaknesses, and in the end, only
you can really know what your requirements are.

> Scenarios:
> 
> 1- Using SolrCloud as database for all data. (even the one that could be
> changed)

If you choose to use Solr as a NoSQL, I would strongly recommend that
you have two Solr installs.  The first install would be purely for data
storage and would have no indexed fields.  If you can get machines with
enough RAM, it would also probably be preferable to use a single index
(or SolrCloud with one shard) for that install.  The other install would
be for searching.  Sharding would not be an issue on that index.  The
reason that I make this recommendation is that when you use Solr for
searching, you have to do a complete reindex if you change your search
schema.  It's difficult to reindex if the search index is also your
canonical data source.

> 2- Using SolrCloud as database for static data and using RDBMS (such as
> oracle) for storing dynamic fields.

I don't think it would be a good idea to have two canonical data
sources.  Pick one.  As already mentioned, Solr is better as a search
technology, serving up pointers to data in another data source, than as
a database.

If you want to use RDBMS technology, why would you spend all that money
on Oracle?  Just use one of the free databases.  Our really large Solr
index comes from a database.  At one time that database was in Oracle.
When my employer purchased the company with that database, we thought we
were obtaining a full Oracle license.  It turns out we weren't.  It
would have cost about half a million dollars to buy that license, so we
switched to MySQL.

Since making that move to MySQL, performance is actually *better*.  The
source table for our data has 96 million rows right now, growing at a
rate of a few million per year.  This is completely in line with your
100 million document requirement.  For the massive table that feeds
Solr, we might switch to MongoDB, but that has not been decided yet.

Later we switched from EasyAsk to Solr, a move that has *also* given us
better performance.  Because both MySQL and Solr are free, we've
achieved a substantial cost savings.

> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
> data.

I have no experience with this technology, but I think that if you are
thinking about a database on HDFS, you're probably actually talking
about HBase, the Apache implementation of Google's BigTable.

Thanks,
Shawn



Re: pdfs

2014-05-26 Thread Erick Erickson
Brian:

Yeah, if you can share the PDF that would be great. Parsing via Tika should
not bring down Solr, although I supposed there could be something in Tika
that is pathologically bad.

You could also try using Tika itself in SolrJ and indexing from a client. That
might let you
1> more gracefully handle this without shutting down Solr
2> use different versions of Tika.

Personally I like offloading the document parsing to clients anyway since it
lessens the load on the Solr server and scales much better, but YMMV.

It's not actually very difficult, here's a skeleton (rip out the DB parts)
http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick

On Sun, May 25, 2014 at 2:07 AM, Siegfried Goeschl  wrote:
> Sorry typo :- can you send me the PDF by email directly :-)
>
> Siegfried Goeschl
>
> On 25 May 2014, at 10:06, Siegfried Goeschl  wrote:
>
>> Hi Brian,
>>
>> can you send me the email? I would like to play around :-)
>>
>> Have you opened a JIRA for PdfBox? If not I willl open one if I can 
>> reproduce the issue …
>>
>> Thanks in advance
>>
>> Siegfried Goeschl
>>
>>
>> On 25 May 2014, at 04:18, Brian McDowell  wrote:
>>
>>> Our feeding (indexing) tool halts because Solr becomes unresponsive after
>>> getting some really bad pdfs. There are levels of pdf "badness." Some just
>>> will not parse and that's fine, but others are more problematic in that our
>>> Operations team has to restart Solr because it just hangs and accepts no
>>> more documents. I actually have identified a pdf that will bring down Solr
>>> every time. Does anyone think that doing pre-validation using the pdfbox
>>> jar will work? Or, will trying to validate just hang as well? Any help is
>>> appreciated.
>>>
>>>
>>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky 
>>> wrote:
>>>
 Yeah, I recall running into infinite loop issues with PDFBox in Solr years
 ago. They keep fixing these issues, but they keep popping up again. Sigh.

 -- Jack Krupansky

 -Original Message- From: Siegfried Goeschl
 Sent: Thursday, May 22, 2014 4:35 AM
 To: solr-user@lucene.apache.org
 Subject: Re: pdfs


 Hi folks,

 for a small customer project I'm running SOLR with embedded Tikka.

 * memory consumption is an issue but can be handled
 * there is an issue with PDFBox hitting an infinite loop which causes
 excessive CPU usage - requires SOLR restart but happens only once
 withing 400.000 documents (PDF, Word, ect) but is seems a little bit
 erratic since I was never able to track the problem back to a particular
 PDF document

 Having said that we wire SOLR with Nagios to get an alarm when CPU
 consumption goes through the roof

 If you doing really serious stuff I would recommend
 * moving the document extraction stuff out of SOLR
 * provide monitoring and recovery and stuck document extractions
 ** killing worker threads
 ** using external processed and kill them when spinning out of control

 Cheers,

 Siegfried Goeschl

 On 22.05.14 06:46, Jack Krupansky wrote:

> Yeah, PDF extraction has always been at least somewhat problematic. It
> has improved over the years, but still not likely to be perfect.
>
> That said, I'm not aware of any specific PDF extraction issue that would
> bring down Solr - as opposed to causing a 500 status with an exception
> in PDF extraction, with the exception of memory usage. Some PDF
> documents, especially those which are graphic-intense can require a lot
> of memory. The rest of Solr could be adversely affected if all available
> JVM heap is consumed. The solution is to give the JVM more heap space.
>
> So, what is your specific symptom?
>
> -- Jack Krupansky
>
> -Original Message- From: Brian McDowell
> Sent: Thursday, May 22, 2014 12:24 AM
> To: solr-user@lucene.apache.org
> Subject: pdfs
>
> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
> Solr completely so that it actually needs to be manually restarted. We are
> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
> problem because the release notes associated with the new tika version and
> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
> this issue is causing us to reevaluate using Solr. Any help on this matter
> would be greatly appreciated. Thank you!
>


>>
>


ExtractingRequestHandler indexing zip files

2014-05-26 Thread marotosg
Hi,

I am using ExtractingRequestHandler to be able to index different type of
documents (doc,pdf,txt,html)
but when I try to index compressed files like zip files solr returns the
name of the file inside the field
which I am using to map the content.

Any idea is this is actually working?

I tried with Solr4.5 and Solr4.8

Regards
Sergio




--
View this message in context: 
http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip-files-tp4138172.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud Nodes autoSoftCommit and (temporary) missing documents

2014-05-26 Thread Erick Erickson
Siegfried's comment is spot-on. Your filter query will not be re-used
unless you submit two within the same millisecond! Here's more than
you want to know about why.

http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/

Best,
Erick

On Sun, May 25, 2014 at 10:56 AM, Siegfried Goeschl  wrote:
> Hi folks,
>
> I think that the timestamp should be rounded down to a minute (or whatever) 
> to avoid trashing the filter query cache
>
> Cheers,
>
> Siegfried Goeschl
>
> On 25 May 2014, at 18:19, Steve McKay  wrote:
>
>> Solr can add the filter for you:
>>
>> 
>>
>>timestamp:[* TO NOW-30SECOND]
>>
>> 
>>
>> Increasing soft commit frequency isn't a bad idea, though. I'd probably do 
>> both. :)
>>
>> On May 23, 2014, at 6:51 PM, Michael Tracey  wrote:
>>
>>> Hey all,
>>>
>>> I've got a number of nodes (Solr 4.4 Cloud) that I'm balancing with HaProxy 
>>> for queries.  I'm indexing pretty much constantly, and have autoCommit and 
>>> autoSoftCommit on for Near Realtime Searching.  All works nicely, except 
>>> that occasionally the auto-commit cycles are far enough off that one node 
>>> will return a document that another node doesn't.  I don't want to have to 
>>> add something like this: timestamp:[* TO NOW-30MINUTE] to every query to 
>>> make sure that all the nodes have the record.  Ideas? autoSoftCommit more 
>>> often?
>>>
>>> 
>>>  10
>>>  720
>>>  false
>>> 
>>>
>>> 
>>>  3
>>>  5000
>>> 
>>>
>>> Thanks,
>>>
>>> M.
>>
>


Re: Using fq as OR

2014-05-26 Thread Erick Erickson
Dmitry:

You have a valid point. That said I'm pretty sure you could have the
filter query use your custom parser by something like
fq={!customparser} whatever

Of course if you were doing something in your custom qparser that
needed both halves, that wouldn't work either..

Best,
Erick

On Mon, May 26, 2014 at 12:14 AM, Dmitry Kan  wrote:
> Erick,
>
> correct me if the following's wrong, but if you have a custom query parser
> configured to preprocess your searches, you'd need to send the
> corresponding bit of the search in the q= parameter, rather than fq=
> parameter. In that sense, q and fq are not exactly equal.
>
> Dmitry
>
>
> On Thu, May 22, 2014 at 5:57 PM, Erick Erickson 
> wrote:
>
>> Hmmm, not quite.
>>
>> AFAIK, anything you can put in a q clause can also be put in an fq
>> clause. So it's not a matter of whether your search is precise or not
>> that you should use for determining whether to use a q or fq clause.
>> What _should_ influence this is whether docs that satisfy the clause
>> should contribute to ranking.
>>
>> fq clauses do NOT contribute to ranking. They determine whether the
>> doc is returned at all.
>> q clauses contribute to the ranking.
>>
>> Additionally, the results of fq clauses are cached and may be re-used.
>>
>> That said, since fq clauses are often used in conjunction with
>> faceting, they are very often used more precisely. But it's still a
>> matter of caching and ranking that should determine where the clause
>> goes.
>>
>> FWIW,
>> Erick
>>
>> On Wed, May 21, 2014 at 9:09 PM, manju16832003 
>> wrote:
>> > The *fq* is used for searching more deterministic results something like
>> > WHERE type={}
>> > Where as *q* is something like WHERE type like '%%'
>> >
>> > user *fq*, if your are sure of what your going to search
>> > use *q*, if not sure what your trying to search
>> >
>> > If you are using fq and if you do not get any matching documents, solr
>> > throws 0 or error message
>> > where q would try to match nearest documents for your search query
>> >
>> > That's what I have experienced so far. :-).
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.nabble.com/Using-fq-as-OR-tp4137411p4137525.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Dmitry Kan
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan


Re: MergeReduceIndexerTool takes a lot of time for a limited number of documents

2014-05-26 Thread Erick Erickson
The MapReduceIndexerTool is really intended for very large data sets,
and by today's standards 80K doesn't qualify :).

Basically, MRIT creates N sub-indexes, then merges them, which it
may to in a tiered fashion. That is, it may merge gen1 to gen2, then
merge gen2 to gen3 etc. Which is great when indexing a bazillion
documents into 20 shards, but all that copying around may take
more time than you really gain for 80K docs.

Also be aware that MRIT does NOT update docs with the same ID, this
is due to the inherent limitation of the Lucene mergeIndex process.

How long is "a long time"? attachments tend to get filtered out, so if you
want us to see the graph you might paste it somewhere and provide a link.

Best,
Erick

On Mon, May 26, 2014 at 8:51 AM, Costi Muraru  wrote:
> Hey guys,
>
> I'm using the MergeReduceIndexerTool to import data into a SolrCloud
> cluster made out of 3 decent machines.
> Looking in the JobTracker, I can see that the mapper jobs finish quite
> fast. The reduce jobs get to ~80% quite fast as well. It is here where
> they get stucked for a long period of time (picture + log attached).
> I'm only trying to insert ~80k documents with 10-50 different fields
> each. Why is this happening? Am I not setting something correctly? Is
> the fact that most of the documents have different field names, or too
> many for that matter?
> Any tips are gladly appreciated.
>
> Thanks,
> Costi
>
> From the reduce logs:
> 60208 [main] INFO  org.apache.solr.update.UpdateHandler  - start
> commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false}
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [IW][main]: commit: start
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [IW][main]: commit: enter lock
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [IW][main]: commit: now prepare
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [IW][main]: prepareCommit: flush
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [IW][main]:   index before flush
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [DW][main]: main startFullFlush
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [DW][main]: anyChanges? numDocsInRam=25603 deletes=true
> hasTickets:false pendingChangesInFullFlush: false
> 60209 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [DWFC][main]: addFlushableState DocumentsWriterPerThread
> [pendingDeletes=gen=0 25602 deleted terms (unique count=25602)
> bytesUsed=5171604, segment=_0, aborting=false, numDocsInRAM=25603,
> deleteQueue=DWDQ: [ generation: 0 ]]
> 61542 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [DWPT][main]: flush postings as segment _0 numDocs=25603
> 61664 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 125115 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 199408 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 271088 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 336754 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 417810 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 479495 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 552357 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 621450 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 683173 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
>
> This is the run command I'm using:
> hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
> org.apache.solr.hadoop.MapReduceIndexerTool \
>  --log4j /home/cmuraru/solr/log4j.properties \
>  --morphline-file morphline.conf \
>  --output-dir hdfs://nameservice1:8020/tmp/outdir \
>  --verbose --go-live --zk-host localhost:2181/solr \
>  --collection collection1 \
> hdfs://nameservice1:8020/tmp/indir


Re: Using SolrCloud with RDBMS or without

2014-05-26 Thread Erick Erickson
What you haven't told us is where the data comes from. But until
you put some numbers to it, it's hard to decide.

I tend to prefer storing the data somewhere else, filesystem, whatever
and indexing to Solr when data changes. Even if that means re-indexing
the entire corpus. I don't like going to more complicated solutions until
that proves untenable.

Backup/restore solutions for filesystems, DBs, whatever are are a very
mature technology, I rely on that first to store my original source.

Now you can re-index at will.

So let's claim your data comes in from some stream somewhere. I'd
1> store it to the file system.
2> write a program to pull it off the file system and index.
3> Your comment about MapReduceIndexerTool is germane. You can re-index
all that data very quickly. And it'll find files on your file system
for you too!

But I wouldn't even go there until I'd tried
indexing my 10M docs straight with SolrJ or similar. If you can index
your 10M docs
in 1 hour and, by extrapolation your 100M docs in 10 hours, is that good enough?
I don't know, it's your problem space after all ;). And is it acceptable to not
see changes to the schema until tomorrow morning? If so, there's no need to get
more complicated

Best,
Erick

On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey  wrote:
> On 5/26/2014 7:50 AM, Ali Nazemian wrote:
>> I was wondering which scenario (or the combination) would be better for my
>> application. From the aspect of performance, scalability and high
>> availability. Here is my application:
>>
>> Suppose I am going to have more than 10m documents and it grows every day.
>> (probably in 1 years it reaches to more than 100m docs. I want to use Solr
>> as tool for indexing these documents but the problem is I have some data
>> fields that could change frequently. (not too much but it could change)
>
> Choosing which database software to use to hold your data is a problem
> with many possible solutions.  Everyone will have a different answer for
> you.  Each solution has strengths and weaknesses, and in the end, only
> you can really know what your requirements are.
>
>> Scenarios:
>>
>> 1- Using SolrCloud as database for all data. (even the one that could be
>> changed)
>
> If you choose to use Solr as a NoSQL, I would strongly recommend that
> you have two Solr installs.  The first install would be purely for data
> storage and would have no indexed fields.  If you can get machines with
> enough RAM, it would also probably be preferable to use a single index
> (or SolrCloud with one shard) for that install.  The other install would
> be for searching.  Sharding would not be an issue on that index.  The
> reason that I make this recommendation is that when you use Solr for
> searching, you have to do a complete reindex if you change your search
> schema.  It's difficult to reindex if the search index is also your
> canonical data source.
>
>> 2- Using SolrCloud as database for static data and using RDBMS (such as
>> oracle) for storing dynamic fields.
>
> I don't think it would be a good idea to have two canonical data
> sources.  Pick one.  As already mentioned, Solr is better as a search
> technology, serving up pointers to data in another data source, than as
> a database.
>
> If you want to use RDBMS technology, why would you spend all that money
> on Oracle?  Just use one of the free databases.  Our really large Solr
> index comes from a database.  At one time that database was in Oracle.
> When my employer purchased the company with that database, we thought we
> were obtaining a full Oracle license.  It turns out we weren't.  It
> would have cost about half a million dollars to buy that license, so we
> switched to MySQL.
>
> Since making that move to MySQL, performance is actually *better*.  The
> source table for our data has 96 million rows right now, growing at a
> rate of a few million per year.  This is completely in line with your
> 100 million document requirement.  For the massive table that feeds
> Solr, we might switch to MongoDB, but that has not been decided yet.
>
> Later we switched from EasyAsk to Solr, a move that has *also* given us
> better performance.  Because both MySQL and Solr are free, we've
> achieved a substantial cost savings.
>
>> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
>> data.
>
> I have no experience with this technology, but I think that if you are
> thinking about a database on HDFS, you're probably actually talking
> about HBase, the Apache implementation of Google's BigTable.
>
> Thanks,
> Shawn
>


How to Configure Solr For Test Purposes?

2014-05-26 Thread Furkan KAMACI
Hi;

I run Solr within my Test Suite. I delete documents or atomically update
them and check whether if it works or not. I know that I have to setup a
hard/soft commit timing for my test Solr. However even I have that settings:

 
   1
   true
 

   
 1
   

and even I wait (Thread.sleep()) for a time to wait Solr *sometimes* my
tests are failed. I get fail error even I increase wait time.  Example of a
sometimes failed code piece:

for (int i = 0; i < dummyDocumentSize; i++) {
 deleteById("id" + i);
 dummyDocumentSize--;
 queryResponse = query(solrParams);
 assertTrue(queryResponse.getResults().size() == dummyDocumentSize);
  }

at debug mode if I wait for Solr to reflect changes I see that I do not get
error. What do you think, what kind of configuration I should have for such
kind of purposes?

Thanks;
Furkan KAMACI


Re: “ClientAbortException: java.io.IOException” in solr query

2014-05-26 Thread Shawn Heisey
On 8/3/2013 7:18 AM, Alexandre Rafalovitch wrote:
> The client closed the web-browser page or stopped loading or some other
> timeout/connection close. Then, the server tries to write to no-longer
> existing connection and fails.
> 
> If you control the client, then you might have some sort of timeout value,
> which kills connections after very long queries.
> 
> Regards,
>Alex.
> 
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> 
> 
> On Fri, Aug 2, 2013 at 11:46 PM, aniljayanti wrote:
> 
>> Hi,
>>
>> I am generating solr indexing using apache-tomcat-7.0.19 and solr 3.3.
>> Indexing generated successfully with count of "3350128" records. Now i am
>> testing my solr index search performance continuously by hitting with
>> different search queries.
>>
>> while testing some search queries are getting failed, and getting below
>> error in tomcat error logs.
>>
>> org.apache.solr.common.SolrException log
>> SEVERE: ClientAbortException:  java.io.IOException

I second what Alexandre said.  Here's Tomcat's own javadoc saying the
same thing:

https://tomcat.apache.org/tomcat-5.5-doc/catalina/docs/api/org/apache/catalina/connector/ClientAbortException.html

The specific part of this javadoc that is relevant here: "Wrap an
IOException identifying it as being caused by an abort of a request by a
remote client."

The client making the query chose to disconnect before Solr had
responded.  You may need to increase the timeout on the client.  The
only thing you can do on the Solr end is make the query respond faster,
which is a performance issue.  Usually (but not always), performance
issues are caused by some variation of "not enough memory."

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



Re: MergeReduceIndexerTool takes a lot of time for a limited number of documents

2014-05-26 Thread Costi Muraru
Hey Erick,

The job reducers began to die with "Error: Java heap space", after 1h and
22 minutes being stucked at ~80%.

I did a few more tests:

Test 1.
80,000 documents
Each document had *20* fields. The field names were* the same *for all the
documents. Values were different.
Job status: successful
Execution time: 33 seconds.

Test 2.
80,000 documents
Each document had *20* fields. The field names were *different* for all the
documents. Values were also different.
Job status: successful
Execution time: 643 seconds.

Test 3.
80,000 documents
Each document had *50* fields. The field names were *the same* for all the
documents. Values were different.
Job status: successful
Execution time: 45.96 seconds.

Test 4.
80,000 documents
Each document had *50* fields. The field names were *different* for all the
documents. Values were also different.
Job status: failed
Execution time: after 1h reducers failed.
Unfortunately, this is my use case.

My guess is that the reduce time (to perform the merges) depends if the
field names are the same across the documents. If they are different the
merge time increases very much. I don't have any knowledge behind the solr
merge operation, but is it possible that it tries to group the fields with
the same name across all the documents?
In the first case, when the field names are the same across documents, the
number of buckets is equal to the number of unique field names which is 20.
In the second case, where all the field names are different (my use case),
it creates a lot more buckets (80k documents * 50 different field names = 4
million buckets) and the process gets slowed down significantly.
Is this assumption correct / Is there any way to get around it?

Thanks again for reaching out. Hope this is more clear now.

This is how one of the 80k documents looks like (json format):
{
"id" : "442247098240414508034066540706561683636",
"items" : {
   "IT49597_1180_i" : 76,
   "IT25363_1218_i" : 4,
   "IT12418_1291_i" : 95,
   "IT55979_1051_i" : 31,
   "IT9841_1224_i" : 36,
   "IT40463_1010_i" : 87,
   "IT37932_1346_i" : 11,
   "IT17653_1054_i" : 37,
   "IT59414_1025_i" : 96,
   "IT51080_1133_i" : 5,
   "IT7369_1395_i" : 90,
   "IT59974_1245_i" : 25,
   "IT25374_1345_i" : 75,
   "IT16825_1458_i" : 28,
   "IT56643_1050_i" : 76,
   "IT46274_1398_i" : 50,
   "IT47411_1275_i" : 11,
   "IT2791_1000_i" : 97,
   "IT7708_1053_i" : 96,
   "IT46622_1112_i" : 90,
   "IT47161_1382_i" : 64
   }
}

Costi


On Mon, May 26, 2014 at 7:45 PM, Erick Erickson wrote:

> The MapReduceIndexerTool is really intended for very large data sets,
> and by today's standards 80K doesn't qualify :).
>
> Basically, MRIT creates N sub-indexes, then merges them, which it
> may to in a tiered fashion. That is, it may merge gen1 to gen2, then
> merge gen2 to gen3 etc. Which is great when indexing a bazillion
> documents into 20 shards, but all that copying around may take
> more time than you really gain for 80K docs.
>
> Also be aware that MRIT does NOT update docs with the same ID, this
> is due to the inherent limitation of the Lucene mergeIndex process.
>
> How long is "a long time"? attachments tend to get filtered out, so if you
> want us to see the graph you might paste it somewhere and provide a link.
>
> Best,
> Erick
>
> On Mon, May 26, 2014 at 8:51 AM, Costi Muraru 
> wrote:
> > Hey guys,
> >
> > I'm using the MergeReduceIndexerTool to import data into a SolrCloud
> > cluster made out of 3 decent machines.
> > Looking in the JobTracker, I can see that the mapper jobs finish quite
> > fast. The reduce jobs get to ~80% quite fast as well. It is here where
> > they get stucked for a long period of time (picture + log attached).
> > I'm only trying to insert ~80k documents with 10-50 different fields
> > each. Why is this happening? Am I not setting something correctly? Is
> > the fact that most of the documents have different field names, or too
> > many for that matter?
> > Any tips are gladly appreciated.
> >
> > Thanks,
> > Costi
> >
> > From the reduce logs:
> > 60208 [main] INFO  org.apache.solr.update.UpdateHandler  - start
> >
> commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [IW][main]: commit: start
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [IW][main]: commit: enter lock
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [IW][main]: commit: now prepare
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [IW][main]: prepareCommit: flush
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [IW][main]:   index before flush
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [DW][main]: main startFullFlush
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [DW][main]: anyChanges? numDocsInRam=25603 deletes=true
> > hasTickets:false pendingCha

Re: How to Configure Solr For Test Purposes?

2014-05-26 Thread Shawn Heisey
On 5/26/2014 10:57 AM, Furkan KAMACI wrote:
> Hi;
> 
> I run Solr within my Test Suite. I delete documents or atomically update
> them and check whether if it works or not. I know that I have to setup a
> hard/soft commit timing for my test Solr. However even I have that settings:
> 
>  
>1
>true
>  
> 
>
>  1
>

I hope you know that this is BAD configuration.  Doing automatic commits
on an interval of 1 millisecond is asking for a whole host of problems.
 In some cases, this could do a commit after every single document that
is indexed, which is NOT recommended at all.  The openSearcher setting
of "true" on autoCommit makes it even worse.  There's no reason to do
both autoSoftCommit and autoCommit with openSearcher=true.  I don't know
which one "wins" between autoCommit and autoSoftCommit if they both have
the same config, but I would guess the hard commit does.

> and even I wait (Thread.sleep()) for a time to wait Solr *sometimes* my
> tests are failed. I get fail error even I increase wait time.  Example of a
> sometimes failed code piece:
> 
> for (int i = 0; i < dummyDocumentSize; i++) {
>  deleteById("id" + i);
>  dummyDocumentSize--;
>  queryResponse = query(solrParams);
>  assertTrue(queryResponse.getResults().size() == dummyDocumentSize);
>   }
> 
> at debug mode if I wait for Solr to reflect changes I see that I do not get
> error. What do you think, what kind of configuration I should have for such
> kind of purposes?

Chances are that commits are going to take longer than 1 millisecond.
If you're actively indexing, the system is going to be trying to stack
up lots of commits at the same time.  The maxWarmingSearchers value will
limit the number of new searchers that can be opened, but it will not
stop the commits themselves.  When lots of commits are going on, each
one will take *even longer* to complete, which probably explains the
problem.

Thanks,
Shawn



Solr Deduplicate - Class Not Found Exception

2014-05-26 Thread Manikandan Saravanan
Hi,

I’m running Nutch 2 on a Hadoop 1.2.1 cluster with 2 nodes. I’m running Solr 4 
separately on a box and I replaced Solr’s schema with Nutch’s Solr-4 schema. 
When I run a crawl, I get the following error at the end of the job

14/05/26 14:08:32 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: 
starting...
14/05/26 14:08:32 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: Solr 
url: http://10.130.231.16:8983/solr/nutch
14/05/26 14:08:33 WARN mapred.JobClient: No job jar file set.  User classes may 
not be found. See JobConf(Class) or JobConf#setJar(String).
14/05/26 14:08:33 INFO mapred.JobClient: Running job: job_201405261214_0014
14/05/26 14:08:34 INFO mapred.JobClient:  map 0% reduce 0%
14/05/26 14:08:43 INFO mapred.JobClient: Task Id : 
attempt_201405261214_0014_m_00_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
at 
org.apache.hadoop.mapreduce.JobContext.getInputFormatClass(JobContext.java:187)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:722)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassNotFoundException: 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
... 8 more

14/05/26 14:08:43 WARN mapred.JobClient: Error reading task 
outputnutch-two-qontifi
14/05/26 14:08:43 WARN mapred.JobClient: Error reading task 
outputnutch-two-qontifi
14/05/26 14:08:44 INFO mapred.JobClient: Task Id : 
attempt_201405261214_0014_m_01_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
at 
org.apache.hadoop.mapreduce.JobContext.getInputFormatClass(JobContext.java:187)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:722)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassNotFoundException: 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
... 8 more

14/05/26 14:08:50 INFO mapred.JobClient: Task Id : 
attempt_201405261214_0014_m_01_1, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
at 
org.apache.hadoop.mapreduce.JobContext.getInputFormatClass(JobContext.java:187)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:722)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:3

Re: Solr Deduplicate - Class Not Found Exception

2014-05-26 Thread Shawn Heisey
On 5/26/2014 12:20 PM, Manikandan Saravanan wrote:
> I’m running Nutch 2 on a Hadoop 1.2.1 cluster with 2 nodes. I’m running Solr 
> 4 separately on a box and I replaced Solr’s schema with Nutch’s Solr-4 
> schema. When I run a crawl, I get the following error at the end of the job
> 
> 14/05/26 14:08:32 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: 
> starting...
> 14/05/26 14:08:32 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: Solr 
> url: http://10.130.231.16:8983/solr/nutch
> 14/05/26 14:08:33 WARN mapred.JobClient: No job jar file set.  User classes 
> may not be found. See JobConf(Class) or JobConf#setJar(String).
> 14/05/26 14:08:33 INFO mapred.JobClient: Running job: job_201405261214_0014
> 14/05/26 14:08:34 INFO mapred.JobClient:  map 0% reduce 0%
> 14/05/26 14:08:43 INFO mapred.JobClient: Task Id : 
> attempt_201405261214_0014_m_00_0, Status : FAILED
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat
>   at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
>   at 
> org.apache.hadoop.mapreduce.JobContext.getInputFormatClass(JobContext.java:187)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:722)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat

I am not subscribed to the nutch mailing list, so I have removed that
list from the recipients here.

If you look at the last line that I quoted above, you'll see that the
exception is caused by the inability of Java to locate a class, and that
the class is a Nutch class.  I just built Nutch 2.2.1 on my server, and
the strange thing here is that this class seems to be part of the main
apache nutch jar, so I have no idea how you are using nutch without this
class being present.

Because this is a nutch class that is missing and not a Solr class, the
Solr mailing list can't really provide much help.

Thanks,
Shawn



Re: Wordbreak spellchecker excessive breaking.

2014-05-26 Thread S.L
Anyone ?


On Sat, May 24, 2014 at 5:21 PM, S.L  wrote:

>
> I am using Solr wordbreak spellchecker and the issue is that when I search
> for a term like "mob ile" expecting that the wordbreak spellchecker would
> actually resutn a suggestion for "mobile" it breaks the search term into
> letters like "m o b"  I have two issues with this behavior.
>
>  1. How can I make Solr combine "mob ile" to mobile?
>  2. Not withstanding the fact that my search term "mob ile" is being
> broken incorrectly into individual letters , I realize that the wordbreak
> is needed in certain cases, how do I control the wordbreak so that it does
> not break it into letters like "m o b" which seems like excessive breaking
> to me ?
>
> Thanks.
>
>


Re: Using SolrCloud with RDBMS or without

2014-05-26 Thread Ali Nazemian
Dear Erick,
Thank you for you reply.
Some parts of documents come from Nutch crawler and the other parts come
from processing those documents.
I really need it to be as fast as possible and 10 hours for indexing is not
acceptable for my application.
Regards.


On Mon, May 26, 2014 at 9:25 PM, Erick Erickson wrote:

> What you haven't told us is where the data comes from. But until
> you put some numbers to it, it's hard to decide.
>
> I tend to prefer storing the data somewhere else, filesystem, whatever
> and indexing to Solr when data changes. Even if that means re-indexing
> the entire corpus. I don't like going to more complicated solutions until
> that proves untenable.
>
> Backup/restore solutions for filesystems, DBs, whatever are are a very
> mature technology, I rely on that first to store my original source.
>
> Now you can re-index at will.
>
> So let's claim your data comes in from some stream somewhere. I'd
> 1> store it to the file system.
> 2> write a program to pull it off the file system and index.
> 3> Your comment about MapReduceIndexerTool is germane. You can re-index
> all that data very quickly. And it'll find files on your file system
> for you too!
>
> But I wouldn't even go there until I'd tried
> indexing my 10M docs straight with SolrJ or similar. If you can index
> your 10M docs
> in 1 hour and, by extrapolation your 100M docs in 10 hours, is that good
> enough?
> I don't know, it's your problem space after all ;). And is it acceptable
> to not
> see changes to the schema until tomorrow morning? If so, there's no need
> to get
> more complicated
>
> Best,
> Erick
>
> On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey  wrote:
> > On 5/26/2014 7:50 AM, Ali Nazemian wrote:
> >> I was wondering which scenario (or the combination) would be better for
> my
> >> application. From the aspect of performance, scalability and high
> >> availability. Here is my application:
> >>
> >> Suppose I am going to have more than 10m documents and it grows every
> day.
> >> (probably in 1 years it reaches to more than 100m docs. I want to use
> Solr
> >> as tool for indexing these documents but the problem is I have some data
> >> fields that could change frequently. (not too much but it could change)
> >
> > Choosing which database software to use to hold your data is a problem
> > with many possible solutions.  Everyone will have a different answer for
> > you.  Each solution has strengths and weaknesses, and in the end, only
> > you can really know what your requirements are.
> >
> >> Scenarios:
> >>
> >> 1- Using SolrCloud as database for all data. (even the one that could be
> >> changed)
> >
> > If you choose to use Solr as a NoSQL, I would strongly recommend that
> > you have two Solr installs.  The first install would be purely for data
> > storage and would have no indexed fields.  If you can get machines with
> > enough RAM, it would also probably be preferable to use a single index
> > (or SolrCloud with one shard) for that install.  The other install would
> > be for searching.  Sharding would not be an issue on that index.  The
> > reason that I make this recommendation is that when you use Solr for
> > searching, you have to do a complete reindex if you change your search
> > schema.  It's difficult to reindex if the search index is also your
> > canonical data source.
> >
> >> 2- Using SolrCloud as database for static data and using RDBMS (such as
> >> oracle) for storing dynamic fields.
> >
> > I don't think it would be a good idea to have two canonical data
> > sources.  Pick one.  As already mentioned, Solr is better as a search
> > technology, serving up pointers to data in another data source, than as
> > a database.
> >
> > If you want to use RDBMS technology, why would you spend all that money
> > on Oracle?  Just use one of the free databases.  Our really large Solr
> > index comes from a database.  At one time that database was in Oracle.
> > When my employer purchased the company with that database, we thought we
> > were obtaining a full Oracle license.  It turns out we weren't.  It
> > would have cost about half a million dollars to buy that license, so we
> > switched to MySQL.
> >
> > Since making that move to MySQL, performance is actually *better*.  The
> > source table for our data has 96 million rows right now, growing at a
> > rate of a few million per year.  This is completely in line with your
> > 100 million document requirement.  For the massive table that feeds
> > Solr, we might switch to MongoDB, but that has not been decided yet.
> >
> > Later we switched from EasyAsk to Solr, a move that has *also* given us
> > better performance.  Because both MySQL and Solr are free, we've
> > achieved a substantial cost savings.
> >
> >> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for
> all
> >> data.
> >
> > I have no experience with this technology, but I think that if you are
> > thinking about a database on HDFS, you're probably actually talking
>

Re: Using SolrCloud with RDBMS or without

2014-05-26 Thread Ali Nazemian
Dear Shawn,
Hi and thank you for you reply.
Could you please tell me about the performance and scalability of the
mentioned solutions? Suppose I have a SolrCloud with 4 different machine.
Would it scale linearly if I add another 4 machines to that? I mean when
the documents number increases from 10m to 100m documents.
Regards.


On Mon, May 26, 2014 at 8:30 PM, Shawn Heisey  wrote:

> On 5/26/2014 7:50 AM, Ali Nazemian wrote:
> > I was wondering which scenario (or the combination) would be better for
> my
> > application. From the aspect of performance, scalability and high
> > availability. Here is my application:
> >
> > Suppose I am going to have more than 10m documents and it grows every
> day.
> > (probably in 1 years it reaches to more than 100m docs. I want to use
> Solr
> > as tool for indexing these documents but the problem is I have some data
> > fields that could change frequently. (not too much but it could change)
>
> Choosing which database software to use to hold your data is a problem
> with many possible solutions.  Everyone will have a different answer for
> you.  Each solution has strengths and weaknesses, and in the end, only
> you can really know what your requirements are.
>
> > Scenarios:
> >
> > 1- Using SolrCloud as database for all data. (even the one that could be
> > changed)
>
> If you choose to use Solr as a NoSQL, I would strongly recommend that
> you have two Solr installs.  The first install would be purely for data
> storage and would have no indexed fields.  If you can get machines with
> enough RAM, it would also probably be preferable to use a single index
> (or SolrCloud with one shard) for that install.  The other install would
> be for searching.  Sharding would not be an issue on that index.  The
> reason that I make this recommendation is that when you use Solr for
> searching, you have to do a complete reindex if you change your search
> schema.  It's difficult to reindex if the search index is also your
> canonical data source.
>
> > 2- Using SolrCloud as database for static data and using RDBMS (such as
> > oracle) for storing dynamic fields.
>
> I don't think it would be a good idea to have two canonical data
> sources.  Pick one.  As already mentioned, Solr is better as a search
> technology, serving up pointers to data in another data source, than as
> a database.
>
> If you want to use RDBMS technology, why would you spend all that money
> on Oracle?  Just use one of the free databases.  Our really large Solr
> index comes from a database.  At one time that database was in Oracle.
> When my employer purchased the company with that database, we thought we
> were obtaining a full Oracle license.  It turns out we weren't.  It
> would have cost about half a million dollars to buy that license, so we
> switched to MySQL.
>
> Since making that move to MySQL, performance is actually *better*.  The
> source table for our data has 96 million rows right now, growing at a
> rate of a few million per year.  This is completely in line with your
> 100 million document requirement.  For the massive table that feeds
> Solr, we might switch to MongoDB, but that has not been decided yet.
>
> Later we switched from EasyAsk to Solr, a move that has *also* given us
> better performance.  Because both MySQL and Solr are free, we've
> achieved a substantial cost savings.
>
> > 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
> > data.
>
> I have no experience with this technology, but I think that if you are
> thinking about a database on HDFS, you're probably actually talking
> about HBase, the Apache implementation of Google's BigTable.
>
> Thanks,
> Shawn
>
>


-- 
A.Nazemian


Re: Using SolrCloud with RDBMS or without

2014-05-26 Thread Shawn Heisey
On 5/26/2014 1:48 PM, Ali Nazemian wrote:
> Dear Shawn,
> Hi and thank you for you reply.
> Could you please tell me about the performance and scalability of the
> mentioned solutions? Suppose I have a SolrCloud with 4 different machine.
> Would it scale linearly if I add another 4 machines to that? I mean when
> the documents number increases from 10m to 100m documents.

I am completely unable to give you any kind of definitive answer to
that.  The only way to estimate what kind of performance and scalability
to expect with your data is to actually build a test system with your data.

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Thanks,
Shawn



RE: Using SolrCloud with RDBMS or without

2014-05-26 Thread Susheel Kumar
Few things will help here if you can clarify what is acceptable in terms of 
indexing hours & what is the use case for indexing

· Are you looking to re-index all data (say 100 m) frequently that you 
need indexing hours to be on lower side (<10 or <5 etc.). If so how many 
reasonable hours you expect it take

· Or you can afford to not re-index all data and add incremental 
indexing  (Not sure how frequently your schema fields gets changed as mentioned 
by you)



Also as Eric pointed out using SolrJ and using Parallelism you can achieve 
indexing quickly. We recently had a use case where we indexed around 10m docs 
from database in  less than ½ hr.



Thanks,

Susheel



-Original Message-
From: Ali Nazemian [mailto:alinazem...@gmail.com]
Sent: Monday, May 26, 2014 2:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Using SolrCloud with RDBMS or without



Dear Erick,

Thank you for you reply.

Some parts of documents come from Nutch crawler and the other parts come from 
processing those documents.

I really need it to be as fast as possible and 10 hours for indexing is not 
acceptable for my application.

Regards.





On Mon, May 26, 2014 at 9:25 PM, Erick Erickson 
mailto:erickerick...@gmail.com>>wrote:



> What you haven't told us is where the data comes from. But until you

> put some numbers to it, it's hard to decide.

>

> I tend to prefer storing the data somewhere else, filesystem, whatever

> and indexing to Solr when data changes. Even if that means re-indexing

> the entire corpus. I don't like going to more complicated solutions

> until that proves untenable.

>

> Backup/restore solutions for filesystems, DBs, whatever are are a very

> mature technology, I rely on that first to store my original source.

>

> Now you can re-index at will.

>

> So let's claim your data comes in from some stream somewhere. I'd

> 1> store it to the file system.

> 2> write a program to pull it off the file system and index.

> 3> Your comment about MapReduceIndexerTool is germane. You can

> 3> re-index

> all that data very quickly. And it'll find files on your file system

> for you too!

>

> But I wouldn't even go there until I'd tried indexing my 10M docs

> straight with SolrJ or similar. If you can index your 10M docs in 1

> hour and, by extrapolation your 100M docs in 10 hours, is that good

> enough?

> I don't know, it's your problem space after all ;). And is it

> acceptable to not see changes to the schema until tomorrow morning? If

> so, there's no need to get more complicated

>

> Best,

> Erick

>

> On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey 
> mailto:s...@elyograg.org>> wrote:

> > On 5/26/2014 7:50 AM, Ali Nazemian wrote:

> >> I was wondering which scenario (or the combination) would be better

> >> for

> my

> >> application. From the aspect of performance, scalability and high

> >> availability. Here is my application:

> >>

> >> Suppose I am going to have more than 10m documents and it grows

> >> every

> day.

> >> (probably in 1 years it reaches to more than 100m docs. I want to

> >> use

> Solr

> >> as tool for indexing these documents but the problem is I have some

> >> data fields that could change frequently. (not too much but it

> >> could change)

> >

> > Choosing which database software to use to hold your data is a

> > problem with many possible solutions.  Everyone will have a

> > different answer for you.  Each solution has strengths and

> > weaknesses, and in the end, only you can really know what your requirements 
> > are.

> >

> >> Scenarios:

> >>

> >> 1- Using SolrCloud as database for all data. (even the one that

> >> could be

> >> changed)

> >

> > If you choose to use Solr as a NoSQL, I would strongly recommend

> > that you have two Solr installs.  The first install would be purely

> > for data storage and would have no indexed fields.  If you can get

> > machines with enough RAM, it would also probably be preferable to

> > use a single index (or SolrCloud with one shard) for that install.

> > The other install would be for searching.  Sharding would not be an

> > issue on that index.  The reason that I make this recommendation is

> > that when you use Solr for searching, you have to do a complete

> > reindex if you change your search schema.  It's difficult to reindex

> > if the search index is also your canonical data source.

> >

> >> 2- Using SolrCloud as database for static data and using RDBMS

> >> (such as

> >> oracle) for storing dynamic fields.

> >

> > I don't think it would be a good idea to have two canonical data

> > sources.  Pick one.  As already mentioned, Solr is better as a

> > search technology, serving up pointers to data in another data

> > source, than as a database.

> >

> > If you want to use RDBMS technology, why would you spend all that

> > money on Oracle?  Just use one of the free databases.  Our really

> > large Solr index comes from a database.  At one time tha

Re: How to Configure Solr For Test Purposes?

2014-05-26 Thread Furkan KAMACI
Hi Shawn;

I know that it is a bad practise but I just commit up to 5 documents and
there will not be more than 5 documents at any time at any test method. It
is just for test purpose to see that my API works. I want to have
automatic tests.

What do you suggest for my purpose? If a test case fails re-running it for
some times maybe a solution? What kind of configuration do you suggest for
my Solr configuration?

Thanks;
Furkan KAMACI
26 May 2014 21:03 tarihinde "Shawn Heisey"  yazdı:

> On 5/26/2014 10:57 AM, Furkan KAMACI wrote:
> > Hi;
> >
> > I run Solr within my Test Suite. I delete documents or atomically update
> > them and check whether if it works or not. I know that I have to setup a
> > hard/soft commit timing for my test Solr. However even I have that
> settings:
> >
> >  
> >1
> >true
> >  
> >
> >
> >  1
> >
>
> I hope you know that this is BAD configuration.  Doing automatic commits
> on an interval of 1 millisecond is asking for a whole host of problems.
>  In some cases, this could do a commit after every single document that
> is indexed, which is NOT recommended at all.  The openSearcher setting
> of "true" on autoCommit makes it even worse.  There's no reason to do
> both autoSoftCommit and autoCommit with openSearcher=true.  I don't know
> which one "wins" between autoCommit and autoSoftCommit if they both have
> the same config, but I would guess the hard commit does.
>
> > and even I wait (Thread.sleep()) for a time to wait Solr *sometimes* my
> > tests are failed. I get fail error even I increase wait time.  Example
> of a
> > sometimes failed code piece:
> >
> > for (int i = 0; i < dummyDocumentSize; i++) {
> >  deleteById("id" + i);
> >  dummyDocumentSize--;
> >  queryResponse = query(solrParams);
> >  assertTrue(queryResponse.getResults().size() ==
> dummyDocumentSize);
> >   }
> >
> > at debug mode if I wait for Solr to reflect changes I see that I do not
> get
> > error. What do you think, what kind of configuration I should have for
> such
> > kind of purposes?
>
> Chances are that commits are going to take longer than 1 millisecond.
> If you're actively indexing, the system is going to be trying to stack
> up lots of commits at the same time.  The maxWarmingSearchers value will
> limit the number of new searchers that can be opened, but it will not
> stop the commits themselves.  When lots of commits are going on, each
> one will take *even longer* to complete, which probably explains the
> problem.
>
> Thanks,
> Shawn
>
>


RE: 答复: Internals about "Too many values for UnInvertedField faceting on field xxx"

2014-05-26 Thread 张月祥
Thanks a lot.

> There are only 256 byte arrays to hold all of the ord data, and the
pointers into those arrays are only 24 bits long.  That gets you back
to 32 bits, or 4GB of ord data max.  It's practically less since you
only have to overflow one array before the exception is thrown.

What does the ord data mean? Term Id or Term-Document Relation or Document-Term 
Relation ? 






Re: ExtractingRequestHandler indexing zip files

2014-05-26 Thread Alexandre Rafalovitch
A zip file can contain many files and directories in a nested
structure. With files of any type and size.

What would you expect Solr to do facing a generic Zip file?

And what would you like it to do for _your_ - one assumes more
restricted - scenario?

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Mon, May 26, 2014 at 11:21 PM, marotosg  wrote:
> Hi,
>
> I am using ExtractingRequestHandler to be able to index different type of
> documents (doc,pdf,txt,html)
> but when I try to index compressed files like zip files solr returns the
> name of the file inside the field
> which I am using to map the content.
>
> Any idea is this is actually working?
>
> I tried with Solr4.5 and Solr4.8
>
> Regards
> Sergio
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip-files-tp4138172.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: about analyzer and tokenizer

2014-05-26 Thread rachun
Thank you very much  for your suggestion both of you.
I will try more to figure out which way will be match with my case.

Chun.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/about-analyzer-and-tokenizer-tp4138129p4138227.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: sort by spatial distance in faceting

2014-05-26 Thread david.w.smi...@gmail.com
Hi Aman,

That’s an interesting feature request that I haven’t heard before.

First reaction:  Helliosearch (a fork of Solr that is kept up to date with
changes from Solr) is extremely close to supporting such a thing because it
supports sorting facets by Helliosearch specific aggregation functions.
http://heliosearch.org/solr-facet-functions/   However, none of its
aggregation functions are spatial oriented.  If this feature is important
enough to you, you could very well add it.  It would likely involve
encoding the coordinate into the place name to avoid unnecessary redundant
calculations that would be needed if another field were used.

Second reaction: You could do a secondary search just for these facet
values that works using Result Grouping (AKA Field Collapsing). Add to each
document the coordinates of the city indexed using a LatLonType field.  On
this request, sort the documents using geodist(), and group on the city
name.  Perhaps you can even get away with returning no documents per group
if Solr lets you — you don’t need the doc data after all.  The main thing I
don’t like about this approach is that it’s going to internally calculate
the distance very redundantly since all documents for a city are going to
have the coordinate.  Well see if it’s fast enough and give it a try.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, May 26, 2014 at 8:31 AM, Aman Tandon wrote:

> Hi,
>
> Is it possible to sort the results return on faceting by geo spatial
> distance instead of result count.
>
> Currently i am faceting on city, which returns me the top facets on behalf
> of the docs matched for that particular city.
>
> e.g.:
> Delhi,400
> Noida, 380
> .
> .
> .
> etc.
>
> If the user selects the city then the facets should be according to the geo
> spatial distance instead of results, Is it possible with the solr 4.7.x.?
>
> With Regards
> Aman Tandon
>


Solr shut down by itself

2014-05-26 Thread rachun
Dear all,

Could anyone tell me what wrong with this?
How can I fix this problem?


INFO  - 2014-05-27 03:08:00.252; org.eclipse.jetty.server.Server; Graceful
shutdown SocketConnector@0.0.0.0:8983
INFO  - 2014-05-27 03:08:00.254; org.eclipse.jetty.server.Server; Graceful
shutdown
o.e.j.w.WebAppContext{/solr,file:/root/solr-4.6.0/example/solr-webapp/webapp/},/root/solr-4.6.0/example/webapps/solr.war
INFO  - 2014-05-27 03:08:01.259; org.apache.solr.core.CoreContainer;
Shutting down CoreContainer instance=803126531
INFO  - 2014-05-27 03:08:01.264; org.apache.solr.core.SolrCore;
[collection1]  CLOSING SolrCore org.apache.solr.core.SolrCore@3c9db373
INFO  - 2014-05-27 03:08:01.265;
org.apache.solr.update.DirectUpdateHandler2; closing
DirectUpdateHandler2{commits=0,autocommit maxTime=15000ms,autocommits=0,soft
autocommits=0,optimizes=1,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=1,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0}
INFO  - 2014-05-27 03:08:01.267; org.apache.solr.update.SolrCoreState;
Closing SolrCoreState
INFO  - 2014-05-27 03:08:01.268;
org.apache.solr.update.DefaultSolrCoreState; SolrCoreState ref count has
reached 0 - closing IndexWriter
INFO  - 2014-05-27 03:08:01.268;
org.apache.solr.update.DefaultSolrCoreState; closing IndexWriter with
IndexWriterCloser
INFO  - 2014-05-27 03:08:01.290; org.apache.solr.core.SolrCore;
[collection1] Closing main searcher on request.
INFO  - 2014-05-27 03:08:01.426;
org.apache.solr.core.CachingDirectoryFactory; Closing
NRTCachingDirectoryFactory - 2 directories currently being tracked
INFO  - 2014-05-27 03:08:01.448;
org.apache.solr.core.CachingDirectoryFactory; looking to close
/root/solr-4.6.0/example/solr/collection1/data/index
[CachedDir<>]
INFO  - 2014-05-27 03:08:01.448;
org.apache.solr.core.CachingDirectoryFactory; Closing directory:
/root/solr-4.6.0/example/solr/collection1/data/index
INFO  - 2014-05-27 03:08:01.449;
org.apache.solr.core.CachingDirectoryFactory; looking to close
/root/solr-4.6.0/example/solr/collection1/data
[CachedDir<>]
INFO  - 2014-05-27 03:08:01.450;
org.apache.solr.core.CachingDirectoryFactory; Closing directory:
/root/solr-4.6.0/example/solr/collection1/data
INFO  - 2014-05-27 03:08:01.455;
org.eclipse.jetty.server.handler.ContextHandler; stopped
o.e.j.w.WebAppContext{/solr,file:/root/solr-4.6.0/example/solr-webapp/webapp/},/root/solr-4.6.0/example/webapps/solr.war


thank you very much,
Chun.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-shut-down-by-itself-tp4138233.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr shut down by itself

2014-05-26 Thread Alexandre Rafalovitch
INFO  - 2014-05-27 03:08:00.252; org.eclipse.jetty.server.Server; Graceful
shutdown SocketConnector@0.0.0.0:8983

That's the first line. Looks like a normal non-aborted shutdown. I
would actually look at the messages before that first line. Also, why
do you think it was abnormal? Is that something that keeps happening
randomly? What's the pattern? That's probably your best clue.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, May 27, 2014 at 11:17 AM, rachun  wrote:
> Dear all,
>
> Could anyone tell me what wrong with this?
> How can I fix this problem?
>
>
> INFO  - 2014-05-27 03:08:00.252; org.eclipse.jetty.server.Server; Graceful
> shutdown SocketConnector@0.0.0.0:8983
> INFO  - 2014-05-27 03:08:00.254; org.eclipse.jetty.server.Server; Graceful
> shutdown
> o.e.j.w.WebAppContext{/solr,file:/root/solr-4.6.0/example/solr-webapp/webapp/},/root/solr-4.6.0/example/webapps/solr.war
> INFO  - 2014-05-27 03:08:01.259; org.apache.solr.core.CoreContainer;
> Shutting down CoreContainer instance=803126531
> INFO  - 2014-05-27 03:08:01.264; org.apache.solr.core.SolrCore;
> [collection1]  CLOSING SolrCore org.apache.solr.core.SolrCore@3c9db373
> INFO  - 2014-05-27 03:08:01.265;
> org.apache.solr.update.DirectUpdateHandler2; closing
> DirectUpdateHandler2{commits=0,autocommit maxTime=15000ms,autocommits=0,soft
> autocommits=0,optimizes=1,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=1,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0}
> INFO  - 2014-05-27 03:08:01.267; org.apache.solr.update.SolrCoreState;
> Closing SolrCoreState
> INFO  - 2014-05-27 03:08:01.268;
> org.apache.solr.update.DefaultSolrCoreState; SolrCoreState ref count has
> reached 0 - closing IndexWriter
> INFO  - 2014-05-27 03:08:01.268;
> org.apache.solr.update.DefaultSolrCoreState; closing IndexWriter with
> IndexWriterCloser
> INFO  - 2014-05-27 03:08:01.290; org.apache.solr.core.SolrCore;
> [collection1] Closing main searcher on request.
> INFO  - 2014-05-27 03:08:01.426;
> org.apache.solr.core.CachingDirectoryFactory; Closing
> NRTCachingDirectoryFactory - 2 directories currently being tracked
> INFO  - 2014-05-27 03:08:01.448;
> org.apache.solr.core.CachingDirectoryFactory; looking to close
> /root/solr-4.6.0/example/solr/collection1/data/index
> [CachedDir<>]
> INFO  - 2014-05-27 03:08:01.448;
> org.apache.solr.core.CachingDirectoryFactory; Closing directory:
> /root/solr-4.6.0/example/solr/collection1/data/index
> INFO  - 2014-05-27 03:08:01.449;
> org.apache.solr.core.CachingDirectoryFactory; looking to close
> /root/solr-4.6.0/example/solr/collection1/data
> [CachedDir<>]
> INFO  - 2014-05-27 03:08:01.450;
> org.apache.solr.core.CachingDirectoryFactory; Closing directory:
> /root/solr-4.6.0/example/solr/collection1/data
> INFO  - 2014-05-27 03:08:01.455;
> org.eclipse.jetty.server.handler.ContextHandler; stopped
> o.e.j.w.WebAppContext{/solr,file:/root/solr-4.6.0/example/solr-webapp/webapp/},/root/solr-4.6.0/example/webapps/solr.war
>
>
> thank you very much,
> Chun.
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-shut-down-by-itself-tp4138233.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Grouping on a multi-valued field

2014-05-26 Thread Bhoomit Vasani
 Hi,

Does latest release of solr supports grouping on a multi-valued field?

According to this
https://wiki.apache.org/solr/FieldCollapsing#Known_Limitations it doesn't,
but the doc was last updated 14 months ago...

-- 
-- 
Thanks & Regards,
Bhoomit Vasani | SE @ Mygola
WE are LIVE !
91-8892949849


Re: Applying boosting for keyword search

2014-05-26 Thread manju16832003
Hi Jack,
Thank you for the suggestions. :-)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Applying-boosting-for-keyword-search-tp4137523p4138239.html
Sent from the Solr - User mailing list archive at Nabble.com.