Re: Jetty rerturning HTTP error code 413

2012-05-22 Thread Sai
Hi Alexandre,

Can you please let me know how did you fix this issue. I am also getting this
error when I pass very large query to Solr.

An reply is highly appreciated.

Thanks,
Sai



Re: Solr 4.5.1 and Illegal to have multiple roots (start tag in epilog?). (perhaps SOLR-4327 bug?)

2013-10-25 Thread Sai Gadde
We were trying to migrate to 4.5 from 4.0 and faced similar issue as well.
I saw the ticket raised by Chris and tried setting formdataUploadLimitInKB
to a higher value and which did not resolve this issue.

We use Solr 4.0.0 currently and no additional container settings are
required. But it is very strange since when I tested with a single instance
there was no problem at all. How come it is so difficult for two Solr
instances to communicate with each other! I except Solr cloud setup should
be independent of container configuration.

Anyway thanks Chris for the info we will try these tomcat settings and see
if this issue goes away.


On Fri, Oct 25, 2013 at 4:35 PM, Chris Geeringh  wrote:

> Hi Michael,
>
> I opened that ticket, and it looks like there is indeed a buffer or limit I
> was exceeding. As per the ticket I guess the stream is cut off at that
> limit, and is then malformed. I am using Tomcat, and since increasing some
> limits on the connector, I haven't had any issues since. I'll close that
> ticket.
>
> connectionTimeout="6"
>redirectPort="8443" maxPostSize="104857600"
> maxHttpHeaderSize="819200" maxThreads="1"/>
>
> Hope that helps.
>
> Cheers,
> Chris
>
>
> On 25 October 2013 03:48, Michael Tracey  wrote:
>
> > Hey Solr-users,
> >
> > I've got a single solr 4.5.1 node with 96GB ram, a 65GB index (105
> million
> > records) and a lot of daily churn of newly indexed files (auto softcommit
> > and commits).  I'm trying to bring another matching node into the mix,
> and
> > am getting these errors on the new node:
> >
> > org.apache.solr.common.SolrException;
> > org.apache.solr.common.SolrException: Illegal to have multiple roots
> (start
> > tag in epilog?).
> >
> > On the old server, still running, I'm getting:
> >
> > shard update error StdNode: http://server1:
> /solr/collection/:org.apache.solr.client.solrj.SolrServerException:
> > Server refused connection at: http://server2:/solr/collection
> >
> > the new core never actually comes online, stays in recovery mode.  The
> > other two tiny cores (100,000+ records each and not updated frequently),
> > work just fine.
> >
> > is this SOLR-4327 bug?  https://issues.apache.org/jira/browse/SOLR-5331
> > And if so, how can I get the new node up and running so I can get back in
> > production with some redundancy and speed?
> >
> > I'm running an external zookeeper, and that is all running just fine.
> >  Also internal Solrj/jetty with little to no modifications.
> >
> > Any ideas would be appreciated, thanks,
> >
> > M.
> >
>


Solr 4.5.1 replication Bug? "Illegal to have multiple roots (start tag in epilog?)."

2013-10-28 Thread Sai Gadde
)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
at 
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:213)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
... 22 more



We tried with 4.5.0 first and then with 4.5.1 later. Both produce
exact same error.


Any ideas regarding how to resolve this? or is this a bug?

Looks like it is a common error as it affects cloud setup and there
must a workaround but we cannot figure it out. Any help appreciated.


Thanks in advance

Sai


Re: Solr 4.5.1 replication Bug? "Illegal to have multiple roots (start tag in epilog?)."

2013-10-28 Thread Sai Gadde
Hi Michael,

I downgraded to Solr 4.4.0 and this issue is gone. No additional settings
or tweaks are done.

This is not a fix or solution I guess but, in our case we wanted something
working and we were running out of time.

I will watch this thread if there are any suggestions but, possibly we will
stay with 4.4.0 for sometime.

Regards
Sai


On Tue, Oct 29, 2013 at 4:36 AM, Michael Tracey  wrote:

> Hey, this is Michael, who was having the exact error on the Jetty side
> with an update.  I've upgraded jetty from the 4.5.1 embedded version (in
> the example directory) to version 9.0.6, which means I had to upgrade my
> OpenJDK from 1.6 to 1.7.0_45.  Also, I added the suggested (very large)
> settings to my solrconfig.xml:
>
>  formdataUploadLimitInKB="2048000" multipartUploadLimitInKB="2048000" />
>
> but I am still getting the errors when I put a second server in the cloud.
> Single servers (external zookeeper, but no cloud partner) works just fine.
>
> I suppose my next step is to try Tomcat, but according to your post, it
> will not help!
>
> Any help is appreciated,
>
> M.
>
> - Original Message -
> From: "Sai Gadde" 
> To: solr-user@lucene.apache.org
> Sent: Monday, October 28, 2013 7:10:41 AM
> Subject: Solr 4.5.1 replication Bug? "Illegal to have multiple roots
> (start tag in epilog?)."
>
> we have a similar error as this thread.
>
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg90748.html
>
> Tried tomcat setting from this post. We used exact setting sepecified
> here. we merge 500 documents at a time. I am creating a new thread
> because Michael is using Jetty where as we use Tomcat.
>
>
> formdataUploadLimitInKB and multipartUploadLimitInKB limits are set to very
> high value 2GB. As suggested in the following thread.
> https://issues.apache.org/jira/browse/SOLR-5331
>
>
> We use out of the box Solr 4.5.1 no customization done. If we merge
> documents via SolrJ to a single server it is perfectly working fine.
>
>
>  But as soon as we add another node to the cloud we are getting
> following while merging documents.
>
>
>
> This is the error we are getting on the server (10.10.10.116 - IP is
> irrelavent just for clarity)where merging is happening. 10.10.10.119
> is the new node here. This server gets RemoteSolrException
>
>
> shard update error StdNode:
>
> http://10.10.10.119:8980/solr/mycore/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException
> :
> Illegal to have multiple roots (start tag in epilog?).
>  at [row,col {unknown-source}]: [1,12468]
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:425)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
> at
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:401)
> at
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:1)
> at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
> at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> at java.lang.Thread.run(Unknown Source)
>
>
>
>
>
> On the other server 10.10.10.119 we get following error
>
>
> org.apache.solr.common.SolrException: Illegal to have multiple roots
> (start tag in epilog?).
>  at [row,col {unknown-source}]: [1,12468]
> at
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:176)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> at
> org.apache.catalina.c

Re: Solr 4.5.1 replication Bug? "Illegal to have multiple roots (start tag in epilog?)."

2013-10-29 Thread Sai Gadde
I just opened a JIRA issue https://issues.apache.org/jira/browse/SOLR-5402

SOLR-5331 was closed and i could not open it again so, created a new one.

Thanks
Sai


On Wed, Oct 30, 2013 at 5:49 AM, Mark Miller  wrote:

> Has someone filed a JIRA issue with the current known info yet?
>
> - Mark
>
> > On Oct 29, 2013, at 12:36 AM, Sai Gadde  wrote:
> >
> > Hi Michael,
> >
> > I downgraded to Solr 4.4.0 and this issue is gone. No additional settings
> > or tweaks are done.
> >
> > This is not a fix or solution I guess but, in our case we wanted
> something
> > working and we were running out of time.
> >
> > I will watch this thread if there are any suggestions but, possibly we
> will
> > stay with 4.4.0 for sometime.
> >
> > Regards
> > Sai
> >
> >
> >> On Tue, Oct 29, 2013 at 4:36 AM, Michael Tracey 
> wrote:
> >>
> >> Hey, this is Michael, who was having the exact error on the Jetty side
> >> with an update.  I've upgraded jetty from the 4.5.1 embedded version (in
> >> the example directory) to version 9.0.6, which means I had to upgrade my
> >> OpenJDK from 1.6 to 1.7.0_45.  Also, I added the suggested (very large)
> >> settings to my solrconfig.xml:
> >>
> >>  >> formdataUploadLimitInKB="2048000" multipartUploadLimitInKB="2048000" />
> >>
> >> but I am still getting the errors when I put a second server in the
> cloud.
> >> Single servers (external zookeeper, but no cloud partner) works just
> fine.
> >>
> >> I suppose my next step is to try Tomcat, but according to your post, it
> >> will not help!
> >>
> >> Any help is appreciated,
> >>
> >> M.
> >>
> >> - Original Message -
> >> From: "Sai Gadde" 
> >> To: solr-user@lucene.apache.org
> >> Sent: Monday, October 28, 2013 7:10:41 AM
> >> Subject: Solr 4.5.1 replication Bug? "Illegal to have multiple roots
> >> (start tag in epilog?)."
> >>
> >> we have a similar error as this thread.
> >>
> >> http://www.mail-archive.com/solr-user@lucene.apache.org/msg90748.html
> >>
> >> Tried tomcat setting from this post. We used exact setting sepecified
> >> here. we merge 500 documents at a time. I am creating a new thread
> >> because Michael is using Jetty where as we use Tomcat.
> >>
> >>
> >> formdataUploadLimitInKB and multipartUploadLimitInKB limits are set to
> very
> >> high value 2GB. As suggested in the following thread.
> >> https://issues.apache.org/jira/browse/SOLR-5331
> >>
> >>
> >> We use out of the box Solr 4.5.1 no customization done. If we merge
> >> documents via SolrJ to a single server it is perfectly working fine.
> >>
> >>
> >> But as soon as we add another node to the cloud we are getting
> >> following while merging documents.
> >>
> >>
> >>
> >> This is the error we are getting on the server (10.10.10.116 - IP is
> >> irrelavent just for clarity)where merging is happening. 10.10.10.119
> >> is the new node here. This server gets RemoteSolrException
> >>
> >>
> >> shard update error StdNode:
> >>
> >>
> http://10.10.10.119:8980/solr/mycore/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException
> >> :
> >> Illegal to have multiple roots (start tag in epilog?).
> >> at [row,col {unknown-source}]: [1,12468]
> >>at
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:425)
> >>at
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
> >>at
> >>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:401)
> >>at
> >>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:1)
> >>at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
> >>at java.util.concurrent.FutureTask.run(Unknown Source)
> >>at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> >> Source)
> >>at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
> >>at java.util.concurrent.FutureTask.run(Unknown Source)
> >>at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> >> Source)
> >>at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unkno

Boost function for custom sorting.

2014-10-01 Thread sai suman
Hi,

I have some records which include a source_id field which is an integer and
a datetime field. I want the records to be ordered such that the adjacent
records should not have the same source ids. It should perform some sort of
round robin on the records with the source_id as kay and they should be
sorted by date.

For example:
[1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,4] after the sort should give -->
[1,2,3,4,1,2,3,1,2,3,1,2,3,1,2,2] and these should be sorted by datetime.
meaning the first '1' should be the latest '1' and the last '1' should be
the oldest '1'. I was wondering if it was possible to write a boost
function for such a requirement.

Thanks in advance!

Sai


Tweaking boosts for more search results variety

2013-09-04 Thread Sai Gadde
Our index is aggregated content from various sites on the web. We want good
user experience by showing multiple sites in the search results. In our
setup we are seeing most of the results from same site on the top.

Here is some information regarding queries and schema
site - String field. We have about 1000 sites in index
sitetype - String field.  we have 3 site types
omitNorms="true" for both the fields

Doc count varies largely based on site and sitetype by a factor of 10 -
1000 times
Total index size is about 5 million docs.
Solr Version: 4.0

In our queries we have a fixed and preferential boost for certain sites.
sitetype has different and fixed boosts for 3 possible values. We turned
off Inverse Document Frequency (IDF) for these boosts to work properly.
Other text fields are boosted based on search keywords only.

With this setup we often see a bunch of hits from a single site followed by
next etc.,
Is there any solution to see results from variety of sites and still keep
the preferential boosts in place?


Re: Tweaking boosts for more search results variety

2013-09-06 Thread Sai Gadde
Thank you Jack for the suggestion.

We can try group by site. But considering that number of sites are only
about 1000 against the index size of 5 million, One can expect most of the
hits would be hidden and for certain specific keywords only a handful of
actual results could be displayed if results are grouped by site.

we already group on a signature field to identify duplicate content in
these 5 million+ docs. But here the number of duplicates are only about
3-5% maximum.

Is there any workaround for these limitations with grouping?

Thanks
Shyam



On Thu, Sep 5, 2013 at 9:16 PM, Jack Krupansky wrote:

> The grouping (field collapsing) feature somewhat addresses this - group by
> a "site" field and then if more than one or a few top pages are from the
> same site they get grouped or collapsed so that you can see more sites in a
> few results.
>
> See:
> http://wiki.apache.org/solr/**FieldCollapsing<http://wiki.apache.org/solr/FieldCollapsing>
> https://cwiki.apache.org/**confluence/display/solr/**Result+Grouping<https://cwiki.apache.org/confluence/display/solr/Result+Grouping>
>
> -- Jack Krupansky
>
> -Original Message- From: Sai Gadde
> Sent: Thursday, September 05, 2013 2:27 AM
> To: solr-user@lucene.apache.org
> Subject: Tweaking boosts for more search results variety
>
>
> Our index is aggregated content from various sites on the web. We want good
> user experience by showing multiple sites in the search results. In our
> setup we are seeing most of the results from same site on the top.
>
> Here is some information regarding queries and schema
>site - String field. We have about 1000 sites in index
>sitetype - String field.  we have 3 site types
> omitNorms="true" for both the fields
>
> Doc count varies largely based on site and sitetype by a factor of 10 -
> 1000 times
> Total index size is about 5 million docs.
> Solr Version: 4.0
>
> In our queries we have a fixed and preferential boost for certain sites.
> sitetype has different and fixed boosts for 3 possible values. We turned
> off Inverse Document Frequency (IDF) for these boosts to work properly.
> Other text fields are boosted based on search keywords only.
>
> With this setup we often see a bunch of hits from a single site followed by
> next etc.,
> Is there any solution to see results from variety of sites and still keep
> the preferential boosts in place?
>


Re: Tweaking boosts for more search results variety

2013-09-08 Thread Sai Gadde
Sorry for the delayed response.

Limitations in this scenario where we have 5 million indexed documents from
about only 1000 sites. If results are grouped by site we will not be able
to show more than a couple of pages for lot of search keywords.


Ex: Search for "Solr" has 1000 matches but only from 20 sites.
In these 20 sites
10 sites are of sitetype A - boost 5
7 sites are of sitetype B - boost 2
3 sites are of sitetype C - boost 1

Limitation 1: If these are grouped by site only 20 results would be
displayed in 2 pages (10 per page).

We still want to display all the results. For a better user experience
"Ideally" we would like to have 10 results in page 1  from 10 distinct
sites of sitetype A (which has higher boost already) or In a real world
scenario from 7-8 distinct sites. In our case we see like 7 matches on a
page from a single site.

Limitation 2: Inverse Document frequency (IDF) would have helped here but,
in that case our preferential boost for sitetypes is ignored and some
results from sitetype C would come on top due to IDF boost.

What we want to achieve is any way to control variety of sites displayed in
search results with preferential boost still in place.

Thanks in advance




On Sun, Sep 8, 2013 at 6:36 AM, Furkan KAMACI wrote:

> What do you mean with "*these limitations" *Do you want to make multiple
> grouping at same time?
>
>
> 2013/9/6 Sai Gadde 
>
> > Thank you Jack for the suggestion.
> >
> > We can try group by site. But considering that number of sites are only
> > about 1000 against the index size of 5 million, One can expect most of
> the
> > hits would be hidden and for certain specific keywords only a handful of
> > actual results could be displayed if results are grouped by site.
> >
> > we already group on a signature field to identify duplicate content in
> > these 5 million+ docs. But here the number of duplicates are only about
> > 3-5% maximum.
> >
> > Is there any workaround for these limitations with grouping?
> >
> > Thanks
> > Shyam
> >
> >
> >
> > On Thu, Sep 5, 2013 at 9:16 PM, Jack Krupansky  > >wrote:
> >
> > > The grouping (field collapsing) feature somewhat addresses this - group
> > by
> > > a "site" field and then if more than one or a few top pages are from
> the
> > > same site they get grouped or collapsed so that you can see more sites
> > in a
> > > few results.
> > >
> > > See:
> > > http://wiki.apache.org/solr/**FieldCollapsing<
> > http://wiki.apache.org/solr/FieldCollapsing>
> > > https://cwiki.apache.org/**confluence/display/solr/**Result+Grouping<
> > https://cwiki.apache.org/confluence/display/solr/Result+Grouping>
> > >
> > > -- Jack Krupansky
> > >
> > > -Original Message- From: Sai Gadde
> > > Sent: Thursday, September 05, 2013 2:27 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Tweaking boosts for more search results variety
> > >
> > >
> > > Our index is aggregated content from various sites on the web. We want
> > good
> > > user experience by showing multiple sites in the search results. In our
> > > setup we are seeing most of the results from same site on the top.
> > >
> > > Here is some information regarding queries and schema
> > >site - String field. We have about 1000 sites in index
> > >sitetype - String field.  we have 3 site types
> > > omitNorms="true" for both the fields
> > >
> > > Doc count varies largely based on site and sitetype by a factor of 10 -
> > > 1000 times
> > > Total index size is about 5 million docs.
> > > Solr Version: 4.0
> > >
> > > In our queries we have a fixed and preferential boost for certain
> sites.
> > > sitetype has different and fixed boosts for 3 possible values. We
> turned
> > > off Inverse Document Frequency (IDF) for these boosts to work properly.
> > > Other text fields are boosted based on search keywords only.
> > >
> > > With this setup we often see a bunch of hits from a single site
> followed
> > by
> > > next etc.,
> > > Is there any solution to see results from variety of sites and still
> keep
> > > the preferential boosts in place?
> > >
> >
>


Re: Tweaking boosts for more search results variety

2013-09-10 Thread Sai Gadde
Perfect. This is exactly what we need!

I wish there is an option for plugin (or) if there is some feature like
this in mainstream Solr release.

Still this is a great resource for us. Thanks Marc for pointing to very
useful information.

Thanks all for the help.




On Tue, Sep 10, 2013 at 5:30 PM, Marc Sturlese wrote:

> This is totally deprecated but maybe can be helpful if you want to re-sort
> some documents
> https://issues.apache.org/jira/browse/SOLR-1311
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tweaking-boosts-for-more-search-results-variety-tp4088302p4089044.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


get-colt

2010-08-05 Thread Sai . Thumuluri
Hi - I am trying to compile Solr source and during "ant dist" step, the
build times out on 

get-colt:
  [get] Getting:
http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar
  [get] To:
/opt/solr/apache-solr-1.4.0/contrib/clustering/lib/downloads/colt-1.2.0.
jar

After a while - the steps fails giving the following message

BUILD FAILED
/opt/solr/apache-solr-1.4.0/common-build.xml:356: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/common-build.xml:219: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/contrib/clustering/build.xml:79:
java.net.ConnectException: Connection timed out

Any help is greatly appreciated?

Sai Thumuluri




RE: get-colt

2010-08-05 Thread Sai . Thumuluri
This is the message I am getting 

Error getting
http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar

-Original Message-
From: sai.thumul...@verizonwireless.com
[mailto:sai.thumul...@verizonwireless.com] 
Sent: Thursday, August 05, 2010 1:15 PM
To: solr-user@lucene.apache.org
Subject: get-colt

Hi - I am trying to compile Solr source and during "ant dist" step, the
build times out on 

get-colt:
  [get] Getting:
http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar
  [get] To:
/opt/solr/apache-solr-1.4.0/contrib/clustering/lib/downloads/colt-1.2.0.
jar

After a while - the steps fails giving the following message

BUILD FAILED
/opt/solr/apache-solr-1.4.0/common-build.xml:356: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/common-build.xml:219: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/contrib/clustering/build.xml:79:
java.net.ConnectException: Connection timed out

Any help is greatly appreciated?

Sai Thumuluri




RE: get-colt

2010-08-05 Thread Sai . Thumuluri
Got it working - had to manually copy the jar files under the contrib
directories

-Original Message-
From: sai.thumul...@verizonwireless.com
[mailto:sai.thumul...@verizonwireless.com] 
Sent: Thursday, August 05, 2010 2:00 PM
To: solr-user@lucene.apache.org
Subject: RE: get-colt

This is the message I am getting 

Error getting
http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar

-Original Message-
From: sai.thumul...@verizonwireless.com
[mailto:sai.thumul...@verizonwireless.com] 
Sent: Thursday, August 05, 2010 1:15 PM
To: solr-user@lucene.apache.org
Subject: get-colt

Hi - I am trying to compile Solr source and during "ant dist" step, the
build times out on 

get-colt:
  [get] Getting:
http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar
  [get] To:
/opt/solr/apache-solr-1.4.0/contrib/clustering/lib/downloads/colt-1.2.0.
jar

After a while - the steps fails giving the following message

BUILD FAILED
/opt/solr/apache-solr-1.4.0/common-build.xml:356: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/common-build.xml:219: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/contrib/clustering/build.xml:79:
java.net.ConnectException: Connection timed out

Any help is greatly appreciated?

Sai Thumuluri




Dismax Request handler and Solrconfig.xml

2010-09-28 Thread Thumuluri, Sai
Hi,

I am using Solr 1.4.1 with Nutch to index some of our intranet content.
In Solrconfig.xml, default request handler is set to "standard". I am
planning to change that to use dismax as the request handler but when I
set "default=true" for dismax - Solr does not return any results - I get
results only when I comment out "dismax". 

This works
  

 
   explicit
   10
   *
   title^20.0 pagedescription^15.0
   2.1
 
  

DOES NOT WORK
  

 dismax
 explicit

THIS WORKS
  


 explicit

Please let me know what I am doing wrong here. 

Sai Thumuluri
Sr. Member - Application Staff
IT Intranet & Knoweldge Mgmt. Systems
614 560-8041 (Desk)
614 327-7200 (Mobile)




RE: Dismax Request handler and Solrconfig.xml

2010-09-28 Thread Thumuluri, Sai
I removed default=true from standard request handler

-Original Message-
From: Luke Crouch [mailto:lcro...@geek.net] 
Sent: Tuesday, September 28, 2010 12:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Dismax Request handler and Solrconfig.xml

Are you removing the standard default requestHandler when you do this?
Or
are you specifying two requestHandler's with default="true" ?

-L

On Tue, Sep 28, 2010 at 11:14 AM, Thumuluri, Sai <
sai.thumul...@verizonwireless.com> wrote:

> Hi,
>
> I am using Solr 1.4.1 with Nutch to index some of our intranet
content.
> In Solrconfig.xml, default request handler is set to "standard". I am
> planning to change that to use dismax as the request handler but when
I
> set "default=true" for dismax - Solr does not return any results - I
get
> results only when I comment out "dismax".
>
> This works
>   default="true">
>
> 
>   explicit
>   10
>   *
>   title^20.0 pagedescription^15.0
>   2.1
> 
>  
>
> DOES NOT WORK
>   default="true">
>
> dismax
> explicit
>
> THIS WORKS
>   default="true">
>
> 
> 
name="echoParams">explicit
>
> Please let me know what I am doing wrong here.
>
> Sai Thumuluri
> Sr. Member - Application Staff
> IT Intranet & Knoweldge Mgmt. Systems
> 614 560-8041 (Desk)
> 614 327-7200 (Mobile)
>
>
>


RE: Dismax Request handler and Solrconfig.xml

2010-09-28 Thread Thumuluri, Sai
Can I please get some help here? I am in a tight timeline to get this
done - any ideas/suggestions would be greatly appreciated. 

-Original Message-
From: Thumuluri, Sai [mailto:sai.thumul...@verizonwireless.com] 
Sent: Tuesday, September 28, 2010 12:15 PM
To: solr-user@lucene.apache.org
Subject: Dismax Request handler and Solrconfig.xml
Importance: High

Hi,

I am using Solr 1.4.1 with Nutch to index some of our intranet content.
In Solrconfig.xml, default request handler is set to "standard". I am
planning to change that to use dismax as the request handler but when I
set "default=true" for dismax - Solr does not return any results - I get
results only when I comment out "dismax". 

This works
  

 
   explicit
   10
   *
   title^20.0 pagedescription^15.0
   2.1
 
  

DOES NOT WORK
  

 dismax
 explicit

THIS WORKS
  


 explicit




RE: Looking for help with Solr implementation

2010-11-13 Thread Thumuluri, Sai
Please refrain using this mailing group for soliciting and take it offline


-Original Message-
From: AC [mailto:acanuc...@yahoo.com]
Sent: Sat 11/13/2010 1:12 AM
To: solr-user@lucene.apache.org
Subject: Re: Looking for help with Solr implementation
 
Hey Jean-Sebastien,

Thanks for the reply.  It sounds like your experience is exactly what is needed 
for my project.  


To give you some background this project is for a personal project related to 
biomedical field that I'm trying to get up off the ground.  


The site is www.antibodyreview.com     

It is a portal site for researchers in the biotech industry specifically 
focused 
on antibodies - not sure how up you may be on biomedical research :)  

Anyway I have collected a lot of information about proteins and antibodies from 
various sources which people can search and browse.  The site is and will be 
free to access by anyone.  

  
The current search uses MySQL but our requirements for how the site needs to 
operate cannot be properly handled by MySQL.  Searches can take ~8-10 sec and 
this is clearly not acceptable.  


If you try the default search on the index page you can see how slow it is.  
Suggested terms to try:  Akt, p53, PTEN, AIF.   


So there are several different items indexed in solr that we want to search:

1. Protein Information (~42,000 MySQL DB records)
2. Products (expect to host >200,000 product records, currently ~20,000 
products) http://www.antibodyreview.com/products.php  (current product search 
is 
faceted but also takes way too long)
3. Articles (text from ~120,000 articles) Article search can be accessed from 
the protein pages and advanced search page: 
http://www.antibodyreview.com/advsearch.php
4. Images (~100,000 image captions) Image search is found on this page  
http://www.antibodyreview.com/gallery.php

The current solr search which has been set-up can be seen on this page: 
www.antibodyreview.com/proteins3.php  (search bar on this page uses solr).  It 
is clearly much faster and meets our needs so it seems clear that using solr is 
the solution to the search issue.  


The last programmer mentioned that he had indexed all the data and it is now 
just a matter of setting up the search queries in solr.  The most complicated 
query to set-up will be the products as it requires faceted search.  The other 
searches are failry routine or have more limited facets/options.  


If it looks like their is mutual interest I can share with you a document that 
he created that explains how things have been set-up which should help you get 
started.  


Please let me know what you think. 

Regards,

Abe


 




From: Jean-Sebastien Vachon 
To: solr-user@lucene.apache.org
Sent: Fri, November 12, 2010 7:09:06 PM
Subject: Re: Looking for help with Solr implementation

Hi,

If you're still looking for someone, I might be interested in getting more 
information
about your project. From you initial message that does not seem to be a lot of 
work
so I might be willing to give you some time.

I've been working with Solr for the last 7 months on my full-time job and I'm 
currently managing
a Solr based project that use Field collapsing, facetting, custom scoring with 
function queries and
a custom query handler.

Contact me if you're interested

- Original Message - From: "AC" 
To: 
Sent: Thursday, November 11, 2010 7:43 PM
Subject: Looking for help with Solr implementation


Hi,

Not sure if this is the correct place to post but I'm looking for someone to
help finish a Solr install on our LAMP based website. This would be a paid
project.


The programmer that started the project got too busy with his full-time job to
finish the project. Solr has been installed and a basic search is working but
we need to configure it to work across the site and also set-up faceted
search. I tried posting on some popular freelance sites but haven't been able
to find anyone with real Solr expertise / experience.


If you think you can help me with this project please let me know and I can
supply more details.


Regards,

Abe


  



Re: Solr Cloud index refreshes after restart

2013-01-04 Thread Sai Gadde
Hi Erick,

The issue was with zookeeper when we tried to force full replication by
cleaning the datadir in zookeeper, caused the index removal.

Our index always replicated full even on short outage or restart. I think
"too far out of date" could be the reason. We felt zookeeper was to blame
here. We continuously add documents to index on the leader node. Usually we
would have 1k - 2k docs more by time time server restarts. We only do
softcommits and use commit within call while indexing.

Is there a way to change the this "too far out of date" property through
solr config?

Thanks
Shyam

On Jan 4, 2013 8:48 PM, "Erick Erickson"  wrote:
>
> That is very odd. Have there been any hard commits performed at all? Even
> if not, there should still be an index directory.
>
> Solr will do a full replication if the replica is too far out of date, but
> that shouldn't
> create (I don't think) a new index directory unless it's a misleading
> message.
> Is the cluster still receiving updates while the instance is down? "too
far
> out
> of date" is about 100 documents currently.
>
> Are you sure you aren't just seeing a full replication happen? When you
say
> "only replicates new documents" how long are you waiting?
>
> If none of this is germane, we need more details on how you're bringing
the
> nodes
> up and down. Because this shouldn't be happening as you describe. Also,
> there have been a lot of changes since 4.0, if you have the bandwidth you
> might
> try with a current build.
>
> Best
> Erick
>
>
> On Fri, Jan 4, 2013 at 2:02 AM, Sai Gadde  wrote:
>
> > I have a single collection and shard in my Solr cloud setup with 3
> > nodes. zookeeper ensemble running on three different machines.
> >
> > When we restart one of the server other than leader in the cloud the
index
> > directory is getting deleted in that Solr instance. Index starts with
'0'
> > documents and the instance only replicates new documents.
> >
> > These are the messages from solr admin panel logging. Solr version:
4.0.0
> >
> > 10:48:26WARNINGSolrCoreNew index directory detected: old=null
> > new=/solr/mycore/data/index/10:48:26WARNINGSolrCore[mycore] Solr index
> > directory '/solr/mycore/data/index' doesn't exist. Creating new index...
> >
> > Any help regarding this issue would be appreciated.
> >
> > Thanks
> > Shyam
> > gadde@gmail.com
> >


Re: Solr Cloud index refreshes after restart

2013-01-06 Thread Sai Gadde
We made some cache config changes, That is when we noticed incomplete
replicas. We also bootstrap the configuration from script every time server
restarts.

Would cache config changes cause any issue with SolrCloud replication?
mostly when different nodes have different config(cache setting in this
case) in a cloud? and what would be the best way to handle configuration
changes which are backward compatible with the index?

Example for the cache changes made




We do not optimize the index or force merge on our index. Restarting node
works well and index is replicated properly but, as mentioned before if the
the down time is even 5-10 minutes it copies the whole index since we are
adding docs continuously.

Thanks
Shyam


On Sun, Jan 6, 2013 at 10:25 PM, Erick Erickson wrote:

> Not at this point, the limit is, I think, 100 documents.
> I actually spoke imprecisely. Over that limit, an old-style
> replication happens which _may_ cause a full index copy,
> but usually will only move over the most recent segments
> that have changed. If you're optimizing, this will
> be the whole index (and you shouldn't optimize or
> forceMerge as optimize is called now).
>
> Why do you want to force a full replication? If you have
> a suspicious replica, just shut it down and delete it's index
> directory and start it up back up again perhaps?
>
> Best
> Erick
>
>
> On Sat, Jan 5, 2013 at 1:33 AM, Sai Gadde  wrote:
>
> > Hi Erick,
> >
> > The issue was with zookeeper when we tried to force full replication by
> > cleaning the datadir in zookeeper, caused the index removal.
> >
> > Our index always replicated full even on short outage or restart. I think
> > "too far out of date" could be the reason. We felt zookeeper was to blame
> > here. We continuously add documents to index on the leader node. Usually
> we
> > would have 1k - 2k docs more by time time server restarts. We only do
> > softcommits and use commit within call while indexing.
> >
> > Is there a way to change the this "too far out of date" property through
> > solr config?
> >
> > Thanks
> > Shyam
> >
> > On Jan 4, 2013 8:48 PM, "Erick Erickson" 
> wrote:
> > >
> > > That is very odd. Have there been any hard commits performed at all?
> Even
> > > if not, there should still be an index directory.
> > >
> > > Solr will do a full replication if the replica is too far out of date,
> > but
> > > that shouldn't
> > > create (I don't think) a new index directory unless it's a misleading
> > > message.
> > > Is the cluster still receiving updates while the instance is down? "too
> > far
> > > out
> > > of date" is about 100 documents currently.
> > >
> > > Are you sure you aren't just seeing a full replication happen? When you
> > say
> > > "only replicates new documents" how long are you waiting?
> > >
> > > If none of this is germane, we need more details on how you're bringing
> > the
> > > nodes
> > > up and down. Because this shouldn't be happening as you describe. Also,
> > > there have been a lot of changes since 4.0, if you have the bandwidth
> you
> > > might
> > > try with a current build.
> > >
> > > Best
> > > Erick
> > >
> > >
> > > On Fri, Jan 4, 2013 at 2:02 AM, Sai Gadde  wrote:
> > >
> > > > I have a single collection and shard in my Solr cloud setup with 3
> > > > nodes. zookeeper ensemble running on three different machines.
> > > >
> > > > When we restart one of the server other than leader in the cloud the
> > index
> > > > directory is getting deleted in that Solr instance. Index starts with
> > '0'
> > > > documents and the instance only replicates new documents.
> > > >
> > > > These are the messages from solr admin panel logging. Solr version:
> > 4.0.0
> > > >
> > > > 10:48:26WARNINGSolrCoreNew index directory detected: old=null
> > > > new=/solr/mycore/data/index/10:48:26WARNINGSolrCore[mycore] Solr
> index
> > > > directory '/solr/mycore/data/index' doesn't exist. Creating new
> > index...
> > > >
> > > > Any help regarding this issue would be appreciated.
> > > >
> > > > Thanks
> > > > Shyam
> > > > gadde@gmail.com
> > > >
> >
>


Re: Large transaction logs

2013-01-10 Thread Sai Gadde
Thanks for the Info . I was thinking that autocommit will be propagated
through the cloud like the explicit commit command. If it is not logged
into tlogs as mentioned we can just set autocommit and forget about it.

Thanks
Shyam


On Thu, Jan 10, 2013 at 8:15 PM, Tomás Fernández Löbbe <
tomasflo...@gmail.com> wrote:

> Yes, you must issue hard commits. You can use autocommit and use
> openSearcher=false. Autocommit is not distributed, it has to be configured
> in every node (which will automatically be, because you are using the exact
> same solrconfig for all your nodes).
>
> Other option is to issue an explicit hard commit command, those ARE
> distributed across all shards and replicas. You should also use
> "openSearcher=false" option for explicit hard commits (the searcher is now
> being opened by the soft commits).
>
> Both options are fine. Personally I prefer autocommit because then you can
> just "forget" about commits.
>
> Tomás
>
>
> On Thu, Jan 10, 2013 at 7:51 AM, gadde  wrote:
>
> > we have a SolrCloud with 3 nodes. we add documents to leader node and use
> > commitwithin(100secs) option in SolrJ to add documents. AutoSoftCommit in
> > SolrConfig is 1000ms.
> >
> > Transaction logs on replicas grew bigger than the index and we ran out of
> > disk space in few days. Leader's tlogs are very small in few hundred MBs.
> >
> > The following post suggest hard commit is required for "relieving the
> > memory
> > pressure of the transactionlog"
> >
> >
> http://lucene.472066.n3.nabble.com/SolrCloud-is-softcommit-cluster-wide-for-the-collection-td4021584.html#a4021631
> >
> > what is the best way to do a hard commit on this setup in SolrCloud?
> >
> > a. Through autoCommit in SolrConfig? which would cause hard commit on all
> > the nodes at different times
> > b. Trigger hard commit on leader while updating through SolrJ?
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Large-transaction-logs-tp4032144.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


RE: Does edismax support wildcard queries?

2010-11-18 Thread Thumuluri, Sai
It does support wildcard queries - we are using that feature from
edismax

-Original Message-
From: Swapnonil Mukherjee [mailto:swapnonil.mukher...@gettyimages.com] 
Sent: Thursday, November 18, 2010 1:39 AM
To: solr-user@lucene.apache.org
Subject: Does edismax support wildcard queries?

Hi Everybody,

We have started to the use the dismax query handler but one serious
limitation of it is that it does not support wild card queries? I think
I have 2 ways to overcome this problem 

1. Apply some old patches to the dismax parser itself from here
https://issues.apache.org/jira/browse/SOLR-756
2. Or start using the Solr trunk which will allow me to switch to
edismax. I am specially hopeful of moving to solr trunk and using
edismax as I believe this well help me support fuzzy search in future as
well. 

So my question is does edismax support wildcard queries? I could not
understand by looking at the source code though.

Thanks
Swapnonil Mukherjee





Index MS office

2011-02-02 Thread Thumuluri, Sai
Good Morning,

 I am planning to get started on indexing MS office using ApacheSolr -
can someone please direct me where I should start? 

Thanks,
Sai Thumuluri




Solr suggestions

2011-02-11 Thread Thumuluri, Sai
Good Morning, 
I have implemented Solr 1.4.1 in our UAT environment and I get weird
suggestions for any misspellings. For instance when I search for
"cabinet award winders" as opposed to "cabinet award winners", I get a
suggestion of "cabinet abarc pindeks
<http://nextgen-uat.sdc.vzwcorp.com/search/apachesolr_search/cabinet%20a
barc%20pindeks> ". How can I get more meaningful suggestions? Any help
is greatly appreciated. 

Thanks,
Sai Thumuluri




RE: Solr suggestions

2011-02-11 Thread Thumuluri, Sai
Please let me know if there is any other information that could help. 

My request handler config is
- 
- 
  edismax 
  explicit 
  
  
-  
- 
- 
  edismax 
  explicit 
  0.01 
  body^1.0 title^20.0 ts_vid_9_names^10.0
ts_vid_10_names^10.0 name^3.0 taxonomy_names^2.0 tags_h1^5.0
tags_h2_h3^3.0 tags_h4_h5_h6^2.0 tags_inline^1.0 
  body 
  2 
  3 
  *:* 
-  
  true 
  body 
  3 
  true 
-  
  body 
  256 
-  
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
-  
   
   
   
-  
   
-  
   
   
-  
   
   
   
-  
   
   
-  
   
   
   
-  
   
   
   
   
   
   
-  
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
-  
   
  

   
   
   
  

   
   
   
-  
   
   
-  
   
-  
   
-  
-  
-  
   
   
   
- 

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, February 11, 2011 9:53 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr suggestions

Well, you have to tell us how you're accessing the info and what's
in your index.

Please include the relevant schema file definitions and the calls you're
making to get spelling suggestions.

Best
Erick

On Fri, Feb 11, 2011 at 8:55 AM, Thumuluri, Sai
 wrote:
> Good Morning,
> I have implemented Solr 1.4.1 in our UAT environment and I get weird
> suggestions for any misspellings. For instance when I search for
> "cabinet award winders" as opposed to "cabinet award winners", I get a
> suggestion of "cabinet abarc pindeks
>
<http://nextgen-uat.sdc.vzwcorp.com/search/apachesolr_search/cabinet%20a
> barc%20pindeks> ". How can I get more meaningful suggestions? Any help
> is greatly appreciated.
>
> Thanks,
> Sai Thumuluri
>
>
>


Solr multi cores or not

2011-02-16 Thread Thumuluri, Sai
Hi, 

I have a need to index multiple applications using Solr, I also have the
need to share indexes or run a search query across these application
indexes. Is solr multi-core - the way to go?  My server config is
2virtual CPUs @ 1.8 GHz and has about 32GB of memory. What is the
recommendation?

Thanks,
Sai Thumuluri




RE: Solr multi cores or not

2011-02-17 Thread Thumuluri, Sai
We have 3 applications and they need to have different relevancy models,
synonyms, stop words etc. 

App A - content size - 20 GB - MySQL and Drupal based app
App B - # of documents ~ 400K; index size ~ 25 GB - primarily a portal
with links to different applications, data sources include crawl pages
and db sources
App C - PeopleSoft based application - underlying Oracle DB ~ content
size ~ 10 GB 

App A - approx 60k hits/week
App B - approx 1 million hits/week
App C - approx 250k hits/wk

Frequency of updates
App A - near real time indexing - every 20 minutes
App B - every 2 hours
App C - daily

All applications need personalization based on appl specific biz rules.
Yes, we must enforce security and Clients are in our control

Reason, our server (Virtual Machine) was configured that way - is when
we first installed - we were told to throw lot of memory for Solr. App A
runs on our production server, it hardly does anything to the server -
our CPUs are less than 4% and our memory is hardly troubled. 

Our business need now is that all the three apps wants to use Solr for
their search needs and with the ability to share indexes. I need to not
only separate indexes, but also selectively query across the
applications. 
 
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, February 16, 2011 6:25 PM
To: solr-user@lucene.apache.org
Cc: Thumuluri, Sai
Subject: Re: Solr multi cores or not

Hi,

That depends (as usual) on your scenario. Let me ask some questions:

1. what is the sum of documents for your applications?
2. what is the expected load in queries/minute
3. what is the update frequency in documents/minute and how many
documents per 
commit?
4. how many different applications do you have?
5. are the query demands for the business the same (or very similar) for
all 
applications?
6. can you easily upgrade hardware or demand more machines?
7. must you enforce security between applications and are the clients
not 
under your control?

I'm puzzled though, you have so much memory but so little CPU. What
about the 
disks? Size? Spinning or SSD?

Cheers,

> Hi,
> 
> I have a need to index multiple applications using Solr, I also have
the
> need to share indexes or run a search query across these application
> indexes. Is solr multi-core - the way to go?  My server config is
> 2virtual CPUs @ 1.8 GHz and has about 32GB of memory. What is the
> recommendation?
> 
> Thanks,
> Sai Thumuluri


RE: Solr multi cores or not

2011-02-18 Thread Thumuluri, Sai
Thank you, I will go the multi-core route and see how that works out. I
guess, if we have to run queries across the cores, I may have to just
run separate queries. 

-Original Message-
From: Marc SCHNEIDER [mailto:marc.schneide...@gmail.com] 
Sent: Friday, February 18, 2011 8:01 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr multi cores or not

Hi,

It depends on what kind of data you are indexing between your multiple
applications.
If app1 has many fields to be indexed and app2 too and if theses fields
are
differents then it would probably be better to have multi cores.
If you have a lot of common fields between app1 and app2 then one index
is
probably the best choice as it will avoid you configuring / implementing
several indexes. In this case you can also have a differentiating field
(like 'type') so that you can get data corresponding to your app.
It really depends on your data structure.

Hope this helps,
Marc.

On Wed, Feb 16, 2011 at 9:45 PM, Thumuluri, Sai <
sai.thumul...@verizonwireless.com> wrote:

> Hi,
>
> I have a need to index multiple applications using Solr, I also have
the
> need to share indexes or run a search query across these application
> indexes. Is solr multi-core - the way to go?  My server config is
> 2virtual CPUs @ 1.8 GHz and has about 32GB of memory. What is the
> recommendation?
>
> Thanks,
> Sai Thumuluri
>
>
>


RE: [ANNOUNCE] Web Crawler

2011-03-02 Thread Thumuluri, Sai
Dominique, Does your crawler support NTLM2 authentication? We have content 
under SiteMinder which uses NTLM2 and that is posing challenges with Nutch?

-Original Message-
From: Dominique Bejean [mailto:dominique.bej...@eolya.fr] 
Sent: Wednesday, March 02, 2011 6:22 AM
To: solr-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

Aditya,

The crawler is not open source and won't be in the next future. Anyway, 
I have to change the license because it can be use for any personal or 
commercial projects.

Sincerely,

Dominique

Le 02/03/11 10:02, findbestopensource a écrit :
> Hello Dominique Bejean,
>
> Good job.
>
> We identified almost 8 open source web crawlers 
> http://www.findbestopensource.com/tagged/webcrawler   I don't know how 
> far yours would be different from the rest.
>
> Your license states that it is not open source but it is free for 
> personnel use.
>
> Regards
> Aditya
> www.findbestopensource.com 
>
>
> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean 
> mailto:dominique.bej...@eolya.fr>> wrote:
>
> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
> Web Crawler. It includes :
>
>   * a crawler
>   * a document processing pipeline
>   * a solr indexer
>
> The crawler has a web administration in order to manage web sites
> to be crawled. Each web site crawl is configured with a lot of
> possible parameters (no all mandatory) :
>
>   * number of simultaneous items crawled by site
>   * recrawl period rules based on item type (html, PDF, ...)
>   * item type inclusion / exclusion rules
>   * item path inclusion / exclusion / strategy rules
>   * max depth
>   * web site authentication
>   * language
>   * country
>   * tags
>   * collections
>   * ...
>
> The pileline includes various ready to use stages (text
> extraction, language detection, Solr ready to index xml writer, ...).
>
> All is very configurable and extendible either by scripting or
> java coding.
>
> With scripting technology, you can help the crawler to handle
> javascript links or help the pipeline to extract relevant title
> and cleanup the html pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage stage
>
> The Crawl Anywhere web site provides good explanations and screen
> shots. All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it out from
> here : www.crawl-anywhere.com 
>
>
> Regards
>
> Dominique
>
>


Solr under Tomcat

2011-03-02 Thread Thumuluri, Sai
Good Morning, 
We have deployed Solr 1.4.1 under Tomcat and it works great, however I
cannot find where the index (directory) is created. I set solr home in
web.xml under /webapps/solr/WEB-INF/, but not sure where the data
directory is. I have a need where I need to completely index the site
and it would help for me to stop solr, delete index directory and
restart solr prior to re-indexing the content. 

Thanks,
Sai Thumuluri




RE: Solr under Tomcat

2011-03-02 Thread Thumuluri, Sai
Thank you - I found it. 

-Original Message-
From: rajini maski [mailto:rajinima...@gmail.com] 
Sent: Thursday, March 03, 2011 12:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr under Tomcat

Sai,

 The index directory will be in your Solr_home//Conf//data
directory..
The path for this directory need to be given where ever you want to
by changing the data-dir path in config XML that is present in the same
//conf folder . You need to stop tomcat service to delete this directory
and
then restart tomcat. The tomcat itself generates the data folder at the
path
specified in config if this folder is not available. The folder usually
has
two sub-folders- index and spell-check

Regards,
Rajani Maski




On Wed, Mar 2, 2011 at 7:39 PM, Thumuluri, Sai <
sai.thumul...@verizonwireless.com> wrote:

> Good Morning,
> We have deployed Solr 1.4.1 under Tomcat and it works great, however I
> cannot find where the index (directory) is created. I set solr home in
> web.xml under /webapps/solr/WEB-INF/, but not sure where the data
> directory is. I have a need where I need to completely index the site
> and it would help for me to stop solr, delete index directory and
> restart solr prior to re-indexing the content.
>
> Thanks,
> Sai Thumuluri
>
>
>


Index content behind siteminder

2011-05-24 Thread Thumuluri, Sai
Good morning, I am trying to index some PDFs which are protected by
siteminder, any ideas as to how I can go about it? I am using Solr 1.4



Direct hits using Solr

2010-05-17 Thread Sai . Thumuluri
Hi, Is there a way to have Solr return a URL that is not part of index. We have 
a need that search engine return a specific URL for a specific search term and 
that result is supposed to be the first result (per Biz) among the result set. 
The URL is an external URL and there is no intent to index contents of that 
site.  

any help towards feasibility of this issue is greatly appreciated

Thanks,
Sai Thumuluri


RE: Direct hits using Solr

2010-05-17 Thread Sai . Thumuluri
How do I index an URL without indexing the content? Basically our requirement 
is that - we have certain search terms for which there need to be a URL that 
should come right on top. I tried to use elevate option within Solr - but from 
what I know - I need to have an id of the indexed content for me to elevate a 
particular URL. 

Sai Thumuluri 

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Monday, May 17, 2010 6:12 AM
To: solr-user@lucene.apache.org
Subject: Re: Direct hits using Solr

> We have a need that
> search engine return a specific URL for a specific search
> term and that result is supposed to be the first result (per
> Biz) among the result set. 

This part seems like http://wiki.apache.org/solr/QueryElevationComponent

> The URL is an external URL and
> there is no intent to index contents of that site.  

Can you explain in more detail? Even if you don't index content of that site, 
you may have to index that URL.




  


RE: Direct hits using Solr

2010-05-17 Thread Sai . Thumuluri
Thank you Erik, I will follow this route

Sai Thumuluri 

-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Monday, May 17, 2010 10:22 AM
To: solr-user@lucene.apache.org
Subject: Re: Direct hits using Solr

Sai - this seems to be best built into your application tier above  
Solr, such that you have a database of special terms and URL mappings  
and simply present them above the results returned from Solr.

Erik
http://www.lucidimagination.com

On May 17, 2010, at 3:11 PM, sai.thumul...@verizonwireless.com wrote:

> How do I index an URL without indexing the content? Basically our  
> requirement is that - we have certain search terms for which there  
> need to be a URL that should come right on top. I tried to use  
> elevate option within Solr - but from what I know - I need to have  
> an id of the indexed content for me to elevate a particular URL.
>
> Sai Thumuluri
>
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Monday, May 17, 2010 6:12 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Direct hits using Solr
>
>> We have a need that
>> search engine return a specific URL for a specific search
>> term and that result is supposed to be the first result (per
>> Biz) among the result set.
>
> This part seems like
http://wiki.apache.org/solr/QueryElevationComponent
>
>> The URL is an external URL and
>> there is no intent to index contents of that site.
>
> Can you explain in more detail? Even if you don't index content of  
> that site, you may have to index that URL.
>
>
>
>
>



Rebuild an index

2010-05-28 Thread Sai . Thumuluri
Hi, 
We use Drupal as the CMS and Solr for our search engine needs and are
planning to have Solr Master-Slave replication setup across the data
centers. I am in the process of testing my replication - what is the
best means to delete the index on the Solr slave and then replicate a
fresh copy from Master?  We use Solr 1.3.

Thanks,
Sai Thumuluri

My Master solrconfig.xml is 

  

  startup
  commit
  commit
  schema.xml,synonyms.txt,stopwords.txt,elevate.xml

  

And my slave solrconfig.xml


  

  http://masterURL:8080/solr/replication
  01:00:00

  


Spellcheck and Solrconfig

2010-06-10 Thread Sai . Thumuluri
Hi,
We use Solr along with Drupal for our content management needs. The
solrconfig.xml that we have from Drupal mentions that "we do not
spellcheck by default" and here is our request handler from
solrconfig.xml. 

First question - why is it recommended that we do not spellcheck by
default
Secondly  - if we add spellcheck in  tag - will
spellcheck be enabled?

We are using basic Solr and Drupal configurations - only now - we are
looking at tweaking solrconfig and schema files. Any help is greatly
appreciated. 

Thanks,
Sai

  

 dismax
 explicit
 0.01
 
body^1.0 title^10.0 name^3.0 taxonomy_names^2.0 tags_h1^5.0
tags_h2_h3^3.0 tags_h4_h5_h6^2.0 tags_inline^1.0
 
 
body^2.0
 
 15
 
2<-35%
 
 *:*

   
 true
 body
 3
 true
   
 body
 256
 
  


  false

  false
  false
  
  1


  spellcheck
  elevator

  


joins

2015-08-13 Thread Kiran Sai Veerubhotla
does solr support joins?

we have a use case where two collections have to be joined and the join has
to be on the faceted results of the two collections. is this possible?


caches with faceting

2015-08-20 Thread Kiran Sai Veerubhotla
i have used json facet api and noticed that its relying heavily on filter
cache.

index is optimized and all my fields are with docValues='true'  and the
number of documents are 2.6 million and always faceting on almost all the
documents with 'fq'

the size of documentCache and queryResultCache are very minimal < 10 ? is
it ok ? i understand that documentCache stores the documents that are
fetched from disk(segment merged) and the size is set to 2000

fieldCache is always zero is it because of docValues?

ver 5.2.1


Re: caches with faceting

2015-08-21 Thread Kiran Sai Veerubhotla
Kindly help on this

On Thu, Aug 20, 2015 at 2:46 PM, Kiran Sai Veerubhotla 
wrote:

> i have used json facet api and noticed that its relying heavily on filter
> cache.
>
> index is optimized and all my fields are with docValues='true'  and the
> number of documents are 2.6 million and always faceting on almost all the
> documents with 'fq'
>
> the size of documentCache and queryResultCache are very minimal < 10 ? is
> it ok ? i understand that documentCache stores the documents that are
> fetched from disk(segment merged) and the size is set to 2000
>
> fieldCache is always zero is it because of docValues?
>
> ver 5.2.1
>
>
>
>


Re: caches with faceting

2015-08-21 Thread Kiran Sai Veerubhotla
thank you Yonik

On Fri, Aug 21, 2015 at 12:43 PM, Yonik Seeley  wrote:

> On Thu, Aug 20, 2015 at 3:46 PM, Kiran Sai Veerubhotla
>  wrote:
> > i have used json facet api and noticed that its relying heavily on filter
> > cache.
>
> Yes.  The root domain (the set of documents that match the base query
> and filters) is cached in the filter cache.
> For sub-facets, the set of documents that matches a particular bucket
> also utilizes the filter cache.
>
> > index is optimized and all my fields are with docValues='true'  and the
> > number of documents are 2.6 million and always faceting on almost all the
> > documents with 'fq'
> >
> > the size of documentCache and queryResultCache are very minimal < 10 ? is
> > it ok ? i understand that documentCache stores the documents that are
> > fetched from disk(segment merged) and the size is set to 2000
>
> If your document size is large at all, you could probably reduce the
> size of the doc cache with little impact.
>
> > fieldCache is always zero is it because of docValues?
>
> Right.
>
> > ver 5.2.1
>
> Version 5.3 is out now.  The official "latest version" link hasn't
> been changed yet, but I maintain a list of download links for
> different versions here:
> http://yonik.com/download/
>
> -Yonik
>


Collapse & Expand

2015-08-21 Thread Kiran Sai Veerubhotla
how can i use collapse & expand on the docValues with json facet api?


Solr cloud clusterstate.json update query ?

2015-05-05 Thread Sai Sreenivas K
Could you clarify on the following questions,
1. Is there a way to avoid all the nodes simultaneously getting into
recovery state when a bulk indexing happens ? Is there an api to disable
replication on one node for a while ?

2. We recently changed the host name on nodes in solr.xml. But the old host
entries still exist in the clusterstate.json marked as active state. Though
live_nodes has the correct information. Who updates clusterstate.json if
the node goes down in an ungraceful fashion without notifying its down
state ?

Thanks,
Sai Sreenivas K


Reg: Indexing Date Fields

2010-04-15 Thread Venkata Sai Krishna Vepakomma
Hi,

1) How do I query for Data between 2 date ranges.  I have specified the 
following field definition in Schema.xml.

   

I have long values for Date fields.  When I query with long values, I am always 
getting all the results.

2) For indexing to be working efficiently and for querying between Date ranges, 
Is it OK to use long values or Do I need to use 'Date' type with specific 
formats.

Please Let me know your thoughts.

Thanks & Regards
Venkat