Re: EmbeddedSolrServer and StreamingUpdateSolrServer

2012-04-17 Thread pcrao
Hi Mikhail Khludnev,

You are partially right. i.e. We have two separate processes accessing the
same Lucene Directory
but they do not run simultaneously. They run one after the other and only
after the first one is
completed. The commit from the EmbeddedServer is successful and I am posting
the log below.

---

INFO: [] webapp=null path=/update/extract
params={stream.type=text%2Fhtml&collectionName=docs} status=0 QTime=5
Apr 5, 2012 7:28:34 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit(optimize=false,waitSearcher=true,expungeDeletes=false,softCommit=false)
Apr 5, 2012 7:28:34 AM org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2
   
commit{dir=/opt/solr/home/data/docs_index/index,segFN=segments_4v,version=1333471748253,generation=175,filenames=[_5a.fdt,
_5a_0.tip, _5a.fdx, _5a.tvf, _5a.tvx, segments_4v, _5a.tvd, _5a_0.prx,
_5a.per, _5a_0.frq, _5a_0.tim, _5a.fnm]
   
commit{dir=/opt/solr/home/data/docs_index/index,segFN=segments_4w,version=1333471748256,generation=176,filenames=[_5b.fnm,
_5b.tvd, _5b.tvf, _5b_0.tip, _5b.nrm, _5b_0.tim, _5b.fdx, _5b_0.prx,
_5b_0.frq, segments_4w, _5b.tvx, _5b.per, _5b.fdt]
Apr 5, 2012 7:28:34 AM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1333471748256
Apr 5, 2012 7:28:34 AM org.apache.solr.search.SolrIndexSearcher 
INFO: Opening Searcher@17c232ee main
Apr 5, 2012 7:28:34 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
Apr 5, 2012 7:28:34 AM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@17c232ee
main{DirectoryReader(segments_4w:1333471748256 _5b(4.0):Cv1000)} from
Searcher@658f7386 main{DirectoryReader(segments_4v:1333471748253
_5a(4.0):Cv16787)}
   
fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
Apr 5, 2012 7:28:34 AM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@17c232ee
main{DirectoryReader(segments_4w:1333471748256 _5b(4.0):Cv1000)}
   
fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
Apr 5, 2012 7:28:34 AM org.apache.solr.core.SolrCore registerSearcher
INFO: [] Registered new searcher Searcher@17c232ee
main{DirectoryReader(segments_4w:1333471748256 _5b(4.0):Cv1000)}
Apr 5, 2012 7:28:34 AM org.apache.solr.search.SolrIndexSearcher close
INFO: Closing Searcher@658f7386 main
   
fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
Apr 5, 2012 7:28:34 AM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {commit=} 0 344
Apr 5, 2012 7:28:34 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=/update params={commit=true&waitSearcher=true}
status=0 QTime=344
Apr 5, 2012 7:28:34 AM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {add=[009658]} 0 9
Apr 5, 2012 7:28:34 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=/update/extract
params={stream.type=text%2Fhtml&collectionName=docs} status=0 QTime=9
-

Please let me know your thoughts.

Thanks,
PC Rao.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-and-StreamingUpdateSolrServer-tp3889073p3916521.html
Sent from the Solr - User mailing list archive at Nabble.com.


Unable to execute query (Transactions not supported) on running fullimport

2012-04-17 Thread nigmail
I am using the full import option with the files as mentioned below:


*data-config.xml*


  
  
  





  



*schema.xml* 

...
 


...
   
   
*solrconfig.xml*



data-config.xml




The query runs perfectly in informix but on running the full-import option I
am getting the error mentioned below:

0851data-config.xmlfull-importdebugSELECT nif ID, tipo_ent TIPO_ENT,
tipo_dir TIPO DIR FROM
cdt_empresasorg.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query: SELECT nif ID, tipo_ent TIPO_ENT, tipo_dir TIPO DIR
FROM cdt_empresas Processing Document # 1 at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:253)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
at
org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:188)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:206)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326) at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.sql.SQLException: Transactions not supported at
com.informix.util.IfxErrMsg.getSQLException(IfxErrMsg.java:373) at
com.informix.jdbc.IfxSqliConnect.setAutoCommit(IfxSqliConnect.java:1820) at
org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:172)
at
org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:128)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getConnection(JdbcDataSource.java:363)
at
org.apache.solr.handler.dataimport.JdbcDataSource.access$200(JdbcDataSource.java:39)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:240)
... 33 more 0:0:0.820idleConfiguration Re-loaded
sucessfully0:0:0.83510002012-04-12 11:46:46Indexing failed. Rolled back all
changes.2012-04-12 11:46:47This response format is experimental. It is
likely to change in the future.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-execute-query-Transactions-not-supported-on-running-fullimport-tp3916529p3916529.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr TransformerException, SocketException: Broken pipe

2012-04-17 Thread JM
> 
> Hi Guys,
> We are experiencing SEVERE exceptions in SOLR (stacktrace below)
> Please let me know if anyone has experienced this and have some insight /
> pointers on to where and what should I look for to resolve this.
> ERROR [solr.servlet.SolrDispatchFilter] - : java.io.IOException: XSLT
> transformation error
> 
> After the exception the SOLR goes in Unstable state and the response time
> increases from less than 50ms to more than 5000ms.
> 
> Thanks,
> Bhawna

Hello,

I have the same strange exception on my index. Did you solved it already or have
any usefull advices?

best regards
JM




Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-17 Thread Jan Høydahl
Hi,

I think Katta integration is nice, but it is not very real-time. What if you 
want both?
Perhaps a Katta/SolrCloud integration could make the two frameworks play 
together, so that some shards in SolrCloud may be marked as "static" while 
others are "realtime". SolrCloud will handle indexing the realtime shards as 
today, but indexing the static shards will be handled by Katta. If Katta adds a 
shard it will tell SolrCloud by updating the ZK tree, and SolrCloud will pick 
up the shard and start serving search for it..

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 17. apr. 2012, at 02:42, Jason Rutherglen wrote:

> One of big weaknesses of Solr Cloud (and ES?) is the lack of the
> ability to redistribute shards across servers.  Meaning, as a single
> shard grows too large, splitting the shard, while live updates.
> 
> How do you plan on elastically adding more servers without this feature?
> 
> Cassandra and HBase handle elasticity in their own ways.  Cassandra
> has successfully implemented the Dynamo model and HBase uses the
> traditional BigTable 'split'.  Both systems are complex though are at
> a singular level of maturity.
> 
> Also Cassandra [successfully] implements multiple data center support,
> is that available in SC or ES?
> 
> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
>  wrote:
>> Hello Ali,
>> 
>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>> 
>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>> seconds.
>> 
>> 
>> That's fine.  Whether it's doable with any tech will depend on how much 
>> hardware you give it, among other things.
>> 
>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>> is also possible that I might need to store multiple indexes -- one for
>>> crawled content, and one for ancillary data that is also very large. Each
>>> of these indices would likely require a logically distributed and
>>> replicated index.
>> 
>> 
>> Yup, OK.
>> 
>>> However, I would like for such a system to be homogenous with the Hadoop
>>> infrastructure that is already installed on the cluster (for the crawl). In
>>> other words, I would much prefer if the replication and distribution of the
>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>> using another scalability framework (such as SolrCloud). In addition, it
>>> would be ideal if this environment was flexible enough to be dynamically
>>> scaled based on the size requirements of the index and the search traffic
>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>>> enough to automatically provision additional processing power into the
>>> cluster without requiring server re-starts).
>> 
>> 
>> There is no such thing just yet.
>> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
>> automatically index HBase content, but that was either not completed or not 
>> committed into HBase.
>> 
>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>>> mature enough and would be the right architectural choice to go along with
>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>>> above.
>> 
>> 
>> Here is a summary on all of them:
>> * Search on HBase - I assume you are referring to the same thing I mentioned 
>> above.  Not ready.
>> * Solandra - uses Cassandra+Solr, plus DataStax now has a different 
>> (commercial) offering that combines search and Cassandra.  Looks good.
>> * Lily - data stored in HBase cluster gets indexed to a separate Solr 
>> instance(s)  on the side.  Not really integrated the way you want it to be.
>> * ElasticSearch - solid at this point, the most dynamic solution today, can 
>> scale well (we are working on a mny-B documents index and hundreds of 
>> nodes with ElasticSearch right now), etc.  But again, not integrated with 
>> Hadoop the way you want it.
>> * IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
>> sure about its future considering LinkedIn uses Zoie and Sensei already.
>> * And there is SolrCloud, which is coming soon and will be solid, but is 
>> again not integrated.
>> 
>> If I were you and I had to pick today - I'd pick ElasticSearch if I were 
>> completely open.  If I had Solr bias I'd give SolrCloud a try first.
>> 
>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>>> estimate my needing with this setup, for regular web-data (HTML text) at
>>> this scale?
>> 
>> I don't know off the topic of my head, but I'm guessing several hundred for 
>> serving search requests.
>> 
>> HTH,
>> 
>> Otis
>> --
>> Search Analytic

Re: Can I use Field Aliasing/Renaming on Solr3.3?

2012-04-17 Thread Jan Høydahl
You'll have to upgrade to 3.6. Upgrading is really easy and should be 100% 
back-compat. Just keep your old config and drop in the new solr.war, then 
you'll get the new features.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 17. apr. 2012, at 04:16, bing wrote:

> Hi, all, 
> 
> I am working on Solr3.3. Recently I found out  a new feature (Field
> Aliasing/Renaming) in Solr3.6, and I want to use it in Solr3.3. Can  I do
> that, and how? 
> 
> Thank you. 
> 
> Best Regards, 
> Bing
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Can-I-use-Field-Aliasing-Renaming-on-Solr3-3-tp3916103p3916103.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Jira 1540

2012-04-17 Thread Ramprakash Ramamoorthy
I am using solr to perform a distributed search. I am using version 1.3 to
accommodate older indices in the already existing system. I am able to
perform a search over a single shard, even faceting and highlighting works.

However, when it comes to distributed search, I get an exception 500. The
stacktrace is almost similar to the one mentioned in
https://issues.apache.org/jira/browse/SOLR-1540 . The issue is also almost
the same.

Any idea/fix to tackle?

-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
SASTRA University.


Wrong categorization with DIH

2012-04-17 Thread Ramo Karahasan
Hi,

 

i currently face the followin issue:

Testing the following sql statement which is also used in SOLR (DIH) leads
to a wrong categorization in solr:

select p.title as title, p.id, p.category_id, p.pic_thumb, c.name as
category, c.id as category_id from product p, category c WHERE p.category_id
= c.id AND p.id = 3091328

 

This returns in my sql client:

Apple MacBook Pro MD313D/A 33,8 cm (13,3 Zoll) Notebook (Intel Core
i5-2435M, 2,4GHz, 4GB RAM, 500GB HDD, Intel HD 3000, Mac OS), 3091328, 1003,
http://m-d.ww.cdn.com/images/I/41teWbp-uAL._SL75_.jpg, Computer, 1003

 

As you see, the categoryid 1003 points to "Computer"

 

Via the solr searchadmin i get the following result when searchgin for
id:3091328 

Sport

1003

 

Which is wrong. ID 1003 points to the correct category name "Computer" in my
sql client

 

 

Does anyone knows what's going wrong here?

I'm using solr 3.5 on a tomcat 6

 

Thanks,

Ramo



Re: Can Solr solve this simple problem?

2012-04-17 Thread Jan Høydahl
Hi,

You have many basic questions about search. Can I recommend one of the books? 
http://lucene.apache.org/solr/books.html
Also, you'll find a lot of answers on the Solr WIKI: 
http://wiki.apache.org/solr/ if you're not aware of it.

I think Solr may solve your performance problems well.
Whether it's the right tool for the job depends on several factors.
Also, sometimes it is useful to step back and think fresh. Perhaps the reason 
why you implemented things like you did was technical reasons driven by your DB 
capabilities.
When re-implementing on top of Solr, perhaps there are better ways to do what 
you REALLY wanted instead of limiting yourself to the ORDER BY syntax etc. 
One of Solr's strengths is relevancy and FunctionQueries and it can do amazing 
things :)

Further answers below..

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 17. apr. 2012, at 07:20, Alexandr Bocharov wrote:

> Thanks for your reply :)
> I have some new questions now:
> 1. How stable is trunk version? Has anyone used it on any kind of highload
> project in production?
It's stable. Used in production many places. Soon expected in alpha or beta 
release
> 2. Does version 3.6 support near real time index update?
No
> 3. What is scheme of Solr index storing? Is it all in memory for each shard
> or in disk with caching for frequently asked queries in memory?
On disk but with many caching optimizations
> 4. The best practice for index updating is - to do delta imports each 5
> minutes for example, and once a day - full rebuild index, does it take long
> time for ~100 mln users? Am I right?
You can do deltas only, as often as you choose. Solr will handle the backend 
details
> 5. Does sharding and replications have native support in Solr, so everyting
> I need to care about is config file for nodes? Are there any limitations of
> usage such sorting if we use sharding?
Yes, sharding and replication is natively supported. See the Wiki
> The reason why we want to move from our DB search scheme (data is sharded
> into small tables at several servers and managed in code) is that:
> 1. response time of our search isn't what we need (3-5 s now in production,
> we want <1 s)
> 2. growing amount of data
> 3. we want automatically clustering any amount of data and search by it,
> without need to care about how data stores and does it has durability or not
> 
> That's why we also looking other solutions with autosharding of huge amount
> of data with ability to make such types of query and sorting (thinking
> about Mysql Cluster, but it's not stable yet, or Oracle Cluster). If anyone
> can give advice for such technology, I'll be glad to hear it.
What do you expect from "Autosharding"?
> 
> 2012/4/17 Jan Høydahl 
> 
>>> Hi everyone :)
>> 
>> Hi :)
>> 
>>> So, these are my 3 questions:
>>> 1. Does Solr provide searching among different count fields with
>> different
>>> types like in WHERE condition?
>> 
>> Yes. As long as these are not full-text you should use filter queries for
>> these, e.g.
>> &q=*:*
>> &fq=country:USA
>> &fq=language:SPA
>> &fq=age:[30 TO 40]
>> &fq=(bool_field1:1 OR bool_field2:1)
>> 
>> The reason why I put multiple "fq" instead of one long is to optimize for
>> caching of filters
>> 
>>> 2. Does Solr provide such sorting, that depends on other fields (like
>> sums
>>> in ORDER BY), other words - does it provide any kind of function, which
>> is
>>> used to sort results from q1?
>> 
>> Yes. In trunk version you can sort by function which can do sums and all
>> crezy things
>> &sort=sum(product(has_photo,10),if(exists(query($agequery)),50,0))
>> asc&agequery=age:[53 TO *]
>> See http://wiki.apache.org/solr/FunctionQuery for more functions
>> 
>> But you could also to much of this through boost queries
>> &sort=score desc
>> &bq=language:FRA^50
>> %bq=age:[53 TO *]^20
>> 
>>> 3. Does Solr provide realtime index updating or updating every N minutes?
>> 
>> Sure, there is Near Real-time indexing in TRUNK (coming 4.0)
>> 
>> Jan



Re: Jira 1540

2012-04-17 Thread Jan Høydahl
Simply try using Solr3.6 to read your old 1.3 indices. Chances are that it will 
work - without the exceptions :)

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 17. apr. 2012, at 11:08, Ramprakash Ramamoorthy wrote:

> I am using solr to perform a distributed search. I am using version 1.3 to
> accommodate older indices in the already existing system. I am able to
> perform a search over a single shard, even faceting and highlighting works.
> 
> However, when it comes to distributed search, I get an exception 500. The
> stacktrace is almost similar to the one mentioned in
> https://issues.apache.org/jira/browse/SOLR-1540 . The issue is also almost
> the same.
> 
> Any idea/fix to tackle?
> 
> -- 
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> SASTRA University.



Re: Wrong categorization with DIH

2012-04-17 Thread Gora Mohanty
On 17 April 2012 14:47, Ramo Karahasan  wrote:
> Hi,
>
>
>
> i currently face the followin issue:
>
> Testing the following sql statement which is also used in SOLR (DIH) leads
> to a wrong categorization in solr:
>
> select p.title as title, p.id, p.category_id, p.pic_thumb, c.name as
> category, c.id as category_id from product p, category c WHERE p.category_id
> = c.id AND p.id = 3091328
>
>
>
> This returns in my sql client:
>
> Apple MacBook Pro MD313D/A 33,8 cm (13,3 Zoll) Notebook (Intel Core
> i5-2435M, 2,4GHz, 4GB RAM, 500GB HDD, Intel HD 3000, Mac OS), 3091328, 1003,
> http://m-d.ww.cdn.com/images/I/41teWbp-uAL._SL75_.jpg, Computer, 1003
>
>
>
> As you see, the categoryid 1003 points to "Computer"
>
>
>
> Via the solr searchadmin i get the following result when searchgin for
> id:3091328
>
> Sport
>
> 1003
[...]

Please share with us the rest of the DIH configuration file, i.e.,
the part where these data are saved to the Solr index.

Regards,
Gora


AW: Wrong categorization with DIH

2012-04-17 Thread Ramo Karahasan
I've figured out, that this wrong categorization comes when doing an Delta
import...

I'm doing the delta import as described her: 
http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

my data-properties.xml looks like:


  
  
   

  


So I run the delta import command like:
http://localhost:8983/solr/core0/dataimport?command=full-import&clean=false

If I do this... the categories are set wrong as described below.

If I do a full import  with
http://localhost:8983/solr/core0/dataimport?command=full-import&clean=true
and delete the index previously, than categorization is correct.

Is there somewhere an caching issue?

Thanks,
Ramo

-Ursprüngliche Nachricht-
Von: Gora Mohanty [mailto:g...@mimirtech.com] 
Gesendet: Dienstag, 17. April 2012 11:34
An: solr-user@lucene.apache.org
Betreff: Re: Wrong categorization with DIH

On 17 April 2012 14:47, Ramo Karahasan 
wrote:
> Hi,
>
>
>
> i currently face the followin issue:
>
> Testing the following sql statement which is also used in SOLR (DIH) 
> leads to a wrong categorization in solr:
>
> select p.title as title, p.id, p.category_id, p.pic_thumb, c.name as 
> category, c.id as category_id from product p, category c WHERE 
> p.category_id = c.id AND p.id = 3091328
>
>
>
> This returns in my sql client:
>
> Apple MacBook Pro MD313D/A 33,8 cm (13,3 Zoll) Notebook (Intel Core 
> i5-2435M, 2,4GHz, 4GB RAM, 500GB HDD, Intel HD 3000, Mac OS), 3091328, 
> 1003, http://m-d.ww.cdn.com/images/I/41teWbp-uAL._SL75_.jpg, Computer, 
> 1003
>
>
>
> As you see, the categoryid 1003 points to "Computer"
>
>
>
> Via the solr searchadmin i get the following result when searchgin for
> id:3091328
>
> Sport
>
> 1003
[...]

Please share with us the rest of the DIH configuration file, i.e., the part
where these data are saved to the Solr index.

Regards,
Gora



making query in query result

2012-04-17 Thread halil
Hi List,

I want to make query in a query result whish is done previously. I googled
the net but couldnot find anything. How can I do that? I need a starting
point.


thanks in advance,


-halil agin.


Re: Can Solr solve this simple problem?

2012-04-17 Thread Alexandr Bocharov
Thanks for your replies, you're good expert :)
I've read documentation on Solr basicaly, I'm familiar with it around 2
days.
The documentation is very huge at first sight :). Me and my company is
being deciding to use Solr or other solution.
Maybe you're right about re-implementing our sorting functions to something
new.

1. If index is stored at disk, what way good performance is achieved (if
index changes frequently, ~50,000 - 100,000 records are updating each 10
minutes, so maybe caching won't be effective)?
2. What can you say about semantic search Solr capabilities? Are there any
examples of it in production?
3. Can you please give some examples projects/sites with Solr 4.0 usage in
production?


2012/4/17 Jan Høydahl 

> Hi,
>
> You have many basic questions about search. Can I recommend one of the
> books? http://lucene.apache.org/solr/books.html
> Also, you'll find a lot of answers on the Solr WIKI:
> http://wiki.apache.org/solr/ if you're not aware of it.
>
> I think Solr may solve your performance problems well.
> Whether it's the right tool for the job depends on several factors.
> Also, sometimes it is useful to step back and think fresh. Perhaps the
> reason why you implemented things like you did was technical reasons driven
> by your DB capabilities.
> When re-implementing on top of Solr, perhaps there are better ways to do
> what you REALLY wanted instead of limiting yourself to the ORDER BY syntax
> etc.
> One of Solr's strengths is relevancy and FunctionQueries and it can do
> amazing things :)
>
> Further answers below..
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 17. apr. 2012, at 07:20, Alexandr Bocharov wrote:
>
> > Thanks for your reply :)
> > I have some new questions now:
> > 1. How stable is trunk version? Has anyone used it on any kind of
> highload
> > project in production?
> It's stable. Used in production many places. Soon expected in alpha or
> beta release
> > 2. Does version 3.6 support near real time index update?
> No
> > 3. What is scheme of Solr index storing? Is it all in memory for each
> shard
> > or in disk with caching for frequently asked queries in memory?
> On disk but with many caching optimizations
> > 4. The best practice for index updating is - to do delta imports each 5
> > minutes for example, and once a day - full rebuild index, does it take
> long
> > time for ~100 mln users? Am I right?
> You can do deltas only, as often as you choose. Solr will handle the
> backend details
> > 5. Does sharding and replications have native support in Solr, so
> everyting
> > I need to care about is config file for nodes? Are there any limitations
> of
> > usage such sorting if we use sharding?
> Yes, sharding and replication is natively supported. See the Wiki
> > The reason why we want to move from our DB search scheme (data is sharded
> > into small tables at several servers and managed in code) is that:
> > 1. response time of our search isn't what we need (3-5 s now in
> production,
> > we want <1 s)
> > 2. growing amount of data
> > 3. we want automatically clustering any amount of data and search by it,
> > without need to care about how data stores and does it has durability or
> not
> >
> > That's why we also looking other solutions with autosharding of huge
> amount
> > of data with ability to make such types of query and sorting (thinking
> > about Mysql Cluster, but it's not stable yet, or Oracle Cluster). If
> anyone
> > can give advice for such technology, I'll be glad to hear it.
> What do you expect from "Autosharding"?
> >
> > 2012/4/17 Jan Høydahl 
> >
> >>> Hi everyone :)
> >>
> >> Hi :)
> >>
> >>> So, these are my 3 questions:
> >>> 1. Does Solr provide searching among different count fields with
> >> different
> >>> types like in WHERE condition?
> >>
> >> Yes. As long as these are not full-text you should use filter queries
> for
> >> these, e.g.
> >> &q=*:*
> >> &fq=country:USA
> >> &fq=language:SPA
> >> &fq=age:[30 TO 40]
> >> &fq=(bool_field1:1 OR bool_field2:1)
> >>
> >> The reason why I put multiple "fq" instead of one long is to optimize
> for
> >> caching of filters
> >>
> >>> 2. Does Solr provide such sorting, that depends on other fields (like
> >> sums
> >>> in ORDER BY), other words - does it provide any kind of function, which
> >> is
> >>> used to sort results from q1?
> >>
> >> Yes. In trunk version you can sort by function which can do sums and all
> >> crezy things
> >> &sort=sum(product(has_photo,10),if(exists(query($agequery)),50,0))
> >> asc&agequery=age:[53 TO *]
> >> See http://wiki.apache.org/solr/FunctionQuery for more functions
> >>
> >> But you could also to much of this through boost queries
> >> &sort=score desc
> >> &bq=language:FRA^50
> >> %bq=age:[53 TO *]^20
> >>
> >>> 3. Does Solr provide realtime index updating or updating every N
> minutes?
> >>
> >> Sure, there is Near Real-time indexing in TRUNK (coming 4.0)
> >>
> >> Jan
>
>


[Solr 4.0] what is stored in .tim index file format?

2012-04-17 Thread Lyuba Romanchuk
Hi,

I have index ~31G where
27% of the index size is .fdt files (8.5G)
20% - .fdx files (6.2G)
37% - .frq files (11.6G)
16% - .tim files (5G)

I didn't manage to find the description for .tim files. Can you help me
with this?

Thank you.
Best regards,
Lyuba


Re: Can Solr solve this simple problem?

2012-04-17 Thread Jan Høydahl
1. Just trust that Lucene will perform :)
   Incremental updates are actually stored in separate new index segments with 
own caches, so all the old existing data is left un-touched with caches in 
place.

2. Please explain what you expect from "semantic search" which is an overloaded 
word.

3. On http://wiki.apache.org/solr/PublicServers the only one saying so 
explicitly is Jeeran - I'm sure others can fill in with more examples

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 17. apr. 2012, at 12:10, Alexandr Bocharov wrote:

> Thanks for your replies, you're good expert :)
> I've read documentation on Solr basicaly, I'm familiar with it around 2
> days.
> The documentation is very huge at first sight :). Me and my company is
> being deciding to use Solr or other solution.
> Maybe you're right about re-implementing our sorting functions to something
> new.
> 
> 1. If index is stored at disk, what way good performance is achieved (if
> index changes frequently, ~50,000 - 100,000 records are updating each 10
> minutes, so maybe caching won't be effective)?
> 2. What can you say about semantic search Solr capabilities? Are there any
> examples of it in production?
> 3. Can you please give some examples projects/sites with Solr 4.0 usage in
> production?
> 
> 
> 2012/4/17 Jan Høydahl 
> 
>> Hi,
>> 
>> You have many basic questions about search. Can I recommend one of the
>> books? http://lucene.apache.org/solr/books.html
>> Also, you'll find a lot of answers on the Solr WIKI:
>> http://wiki.apache.org/solr/ if you're not aware of it.
>> 
>> I think Solr may solve your performance problems well.
>> Whether it's the right tool for the job depends on several factors.
>> Also, sometimes it is useful to step back and think fresh. Perhaps the
>> reason why you implemented things like you did was technical reasons driven
>> by your DB capabilities.
>> When re-implementing on top of Solr, perhaps there are better ways to do
>> what you REALLY wanted instead of limiting yourself to the ORDER BY syntax
>> etc.
>> One of Solr's strengths is relevancy and FunctionQueries and it can do
>> amazing things :)
>> 
>> Further answers below..
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>> 
>> On 17. apr. 2012, at 07:20, Alexandr Bocharov wrote:
>> 
>>> Thanks for your reply :)
>>> I have some new questions now:
>>> 1. How stable is trunk version? Has anyone used it on any kind of
>> highload
>>> project in production?
>> It's stable. Used in production many places. Soon expected in alpha or
>> beta release
>>> 2. Does version 3.6 support near real time index update?
>> No
>>> 3. What is scheme of Solr index storing? Is it all in memory for each
>> shard
>>> or in disk with caching for frequently asked queries in memory?
>> On disk but with many caching optimizations
>>> 4. The best practice for index updating is - to do delta imports each 5
>>> minutes for example, and once a day - full rebuild index, does it take
>> long
>>> time for ~100 mln users? Am I right?
>> You can do deltas only, as often as you choose. Solr will handle the
>> backend details
>>> 5. Does sharding and replications have native support in Solr, so
>> everyting
>>> I need to care about is config file for nodes? Are there any limitations
>> of
>>> usage such sorting if we use sharding?
>> Yes, sharding and replication is natively supported. See the Wiki
>>> The reason why we want to move from our DB search scheme (data is sharded
>>> into small tables at several servers and managed in code) is that:
>>> 1. response time of our search isn't what we need (3-5 s now in
>> production,
>>> we want <1 s)
>>> 2. growing amount of data
>>> 3. we want automatically clustering any amount of data and search by it,
>>> without need to care about how data stores and does it has durability or
>> not
>>> 
>>> That's why we also looking other solutions with autosharding of huge
>> amount
>>> of data with ability to make such types of query and sorting (thinking
>>> about Mysql Cluster, but it's not stable yet, or Oracle Cluster). If
>> anyone
>>> can give advice for such technology, I'll be glad to hear it.
>> What do you expect from "Autosharding"?
>>> 
>>> 2012/4/17 Jan Høydahl 
>>> 
> Hi everyone :)
 
 Hi :)
 
> So, these are my 3 questions:
> 1. Does Solr provide searching among different count fields with
 different
> types like in WHERE condition?
 
 Yes. As long as these are not full-text you should use filter queries
>> for
 these, e.g.
 &q=*:*
 &fq=country:USA
 &fq=language:SPA
 &fq=age:[30 TO 40]
 &fq=(bool_field1:1 OR bool_field2:1)
 
 The reason why I put multiple "fq" instead of one long is to optimize
>> for
 caching of filters
 
> 2. Does Solr provide such sorting, that depends on other fields (like
 sums
> in ORDER

Re: making query in query result

2012-04-17 Thread halil
I think the answer is the "nested query". thanks...

On Tue, Apr 17, 2012 at 12:52 PM, halil  wrote:

> Hi List,
>
> I want to make query in a query result whish is done previously. I googled
> the net but couldnot find anything. How can I do that? I need a starting
> point.
>
>
> thanks in advance,
>
>
> -halil agin.
>


Re: making query in query result

2012-04-17 Thread Jeevanandam

Halil -

I'm describing scenario with sample query below:

query 1: (cat:"electronics") - lets say it returns 25 docs in search 
result
query 2: (features:"power") - will be applied on above result i.e. 
'query 1'(25 docs)


so final result refined to 16 docs in search result.


If above scenario matches your need, please try like this:

q=(cat%3A"electronics")
&fq=(features:"power")

"fq" means Filter Query


-Jeevanandam


On 17-04-2012 4:22 pm, halil wrote:

Hi List,

I want to make query in a query result whish is done previously. I 
googled
the net but couldnot find anything. How can I do that? I need a 
starting

point.


thanks in advance,


-halil agin.


SolrCloud: Programmatically create multiple collections?

2012-04-17 Thread ravi
Hi,

I have recently started experimenting with solrCloud. I want to use it for
below mentioned requirements:

 - create one collection per client 
 - create several shards per collection (say 1 shard for each day of month)
 - all of the collection would follow the same schema. 
 - may need to add more machines to the cluster to index faster or provide
quick turnaround for queries.  (think this is possible with zookeeper)
 - programmatic access ( and create, if possible) shard names and add data
to that shard. 

I was looking into tutorial online and could get a sense that one needs to
create collections and shards manually and there is no programmatic way to
create them dynamically from code, say, using solrj. 

I now know that adding shards to an already running cluster is not possible.
( http://wiki.apache.org/solr/SolrCloud Re-sizing your cluster )

Is there something wrong in the way i want to organise my indexes? Am i
missing something very obvious?

Thanks!
Ravi

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Programmatically-create-multiple-collections-tp3916927p3916927.html
Sent from the Solr - User mailing list archive at Nabble.com.


ConcurrentUpdateSolrServer - catching errors

2012-04-17 Thread ads_green
I'm trying to bulk update a number of documents using the
ConcurrentUpdateSolrServer.

Now under normal circumstances, this is working fine.
I create the ConcurrentUpdateSolrServer with suitable queue and connection
pool sizes and then submit documents to be added using multiple threads.
Watching the Solr output I can see the docs being added.

I'm not using autocommit as I want an "all or nothing" update - as such
after submitting all documents to the ConcurrentUpdateSolrServer I issue a
"blockUntilFinished()" and then a "commit()".

All good.

The problem I'm having now is that if one of the document contains an error
(they shouldn't but it happens). 
- Solr console shows the error details for the invalid field and returns a
400 status code.
- Eventually this will get caught by the ConcurrentUpdateSolrServer
"handleException(Throwable err)" method.

After this point any attempt stop whats going on seems to get stuck
- calling ConcurrentUpdateSolrServer.shutdownNow() doesn't seem to stop
anything.
- ConcurrentUpdateSolrServer.blockUntilFinished() will never return after
receiving the 400 error.
- the ConcurrentUpdateSolrServer instance refuses to send any more
communication to the solr server.

My question is - what is the best way to handle document errors when using
ConcurrentUpdateSolrServer?

TIA



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ConcurrentUpdateSolrServer-catching-errors-tp3916803p3916803.html
Sent from the Solr - User mailing list archive at Nabble.com.


Elevation togehter with grouping

2012-04-17 Thread notebook99

Hi,

is it posible to use query elevation togehter with result grouping?

I tried 

http://localhost:8983/solr/elevate?enableElevation=true&fl=score%2C[elevated]%2Cid%2Cname&forceElevation=true&group.field=manu&group=on&indent=on&q=ipod&wt=json


but the results ignored the elevation:

{
  "responseHeader":{
"status":0,
"QTime":2,
"params":{
  "enableElevation":"true",
  "fl":"score,[elevated],id,name",
  "indent":"on",
  "q":"ipod",
  "forceElevation":"true",
  "group.field":"manu",
  "group":"on",
  "wt":"json"}},
  "grouped":{
"manu":{
  "matches":2,
  "groups":[{
  "groupValue":"belkin",
  "doclist":{"numFound":1,"start":0,"maxScore":0.7698604,"docs":[
  {
"id":"F8V7067-APL-KIT",
"name":"Belkin Mobile Power Cord for iPod w/ Dock",
"score":0.7698604,
"[elevated]":false}]
  }},
{
  "groupValue":"inc",
  "doclist":{"numFound":1,"start":0,"maxScore":0.28869766,"docs":[
  {
"id":"MA147LL/A",
"name":"Apple 60 GB iPod with Video Playback Black",
"score":0.28869766,
"[elevated]":true}]
  }}]}}}


Best Regards,
  Michael 
-- 
NEU: FreePhone 3-fach-Flat mit kostenlosem Smartphone!  

Jetzt informieren: http://mobile.1und1.de/?ac=OM.PW.PW003K20328T7073a


Re: Difference between two solr indexes

2012-04-17 Thread nutchsolruser
I'm Also seeking solution for similar problem. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Difference-between-two-solr-indexes-tp3916328p3917050.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Merge results of search query to multiple cores

2012-04-17 Thread Erick Erickson
Why do you have these in two cores? Why not
have them in a single core (two different
document types) and do a join?

Best
Erick

On Tue, Apr 17, 2012 at 7:28 AM, nitinkhosla79  wrote:
> I have setup solr with multiple cores. Each core has its own schema(with
> common unique id).
> Example:
> Core0: Id, name
> Core1: Id, type.
>
> I am looking for a way to generate the resultset as Id, name, type.
> Is there any  solution for such problems?
> It is more like a database join query.
> 
> Problem: I had to keep 2 cores. Because one index is going to very
> large(indexed many pdf documents) and other is going to be very small but to
> be very frequently updated(for every indexed document).
> So combining the schema might make it impossible(with respect to time
> needed) to index millions of documents.
> 
> In previous posts(from 2010,) it has been mentioned that it is not possible
> with solr.  Wondering if this has changed.
> This link below mentions on using 3rd core and then using shard to merge the
> results. But for me, it does not merge but shows the results EITHER from
> core0 OR from core1   for the SAME id which is present in both core0/core1.
> http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set
>
> Please suggest further.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Merge-results-of-search-query-to-multiple-cores-tp3916976p3916976.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: A tool for frequent re-indexing...

2012-04-17 Thread Ravish Bhagdev
Thanks.  This is useful to know as well.

I was actually after
SolrEntityProcessor
which
I failed to notice until pointed out by previous reply because I'm using
1.4 still.

Cheers,
Ravish

On Fri, Apr 6, 2012 at 11:01 AM, Valeriy Felberg
wrote:

> I've implemented something like described in
> https://issues.apache.org/jira/browse/SOLR-3246. The idea is to add an
> update request processor at the end of the update chain in the core
> you want to copy. The processor converts the SolrInputDocument to XML
> (there is some utility method for doing this) and dumps the XML into a
> file which can be fed into Solr again with curl. If you have many
> documents you will probably want to distribute the XML files into
> different directories using some common prefix in the id field.
>
> On Fri, Apr 6, 2012 at 5:18 AM, Ahmet Arslan  wrote:
> >> I am considering writing a small tool that would read from
> >> one solr core
> >> and write to another as a means of quick re-indexing of
> >> data.  I have a
> >> large-ish set (hundreds of thousands) of documents that I've
> >> already parsed
> >> with Tika and I keep changing bits and pieces in schema and
> >> config to try
> >> new things often.  Instead of having to go through the
> >> process of
> >> re-indexing from docs (and some DBs), I thought it may be
> >> much more faster
> >> to just read from one core and write into new core with new
> >> schema, analysers and/or settings.
> >>
> >> I was wondering if anyone else has done anything similar
> >> already?  It would
> >> be handy if I can use this sort of thing to spin off another
> >> core write to
> >> it and then swap the two cores discarding the older one.
> >
> > You might find these relevant :
> >
> > https://issues.apache.org/jira/browse/SOLR-3246
> >
> > http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
> >
> >
>


How sorlcloud distribute data among shards of the same cluster?

2012-04-17 Thread emma1023
How solrcloud manage distribute data among shards of the same cluster when
you query? Is it distribute the data equally? What is the basis? Which part
of the code that I can find about it?Thank you so much!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3917323.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Distributed FacetComponent NullPointer Exception

2012-04-17 Thread Jamie Johnson
I'm noticing that this issue seems to be occurring with facet fields
which have some unexpected characters.  For instance the query that I
see going across the wire is as follows

facet=true&tie=0.1&ids=3F2504E0-4F89-11D3-9A0C-0305E82C3301&qf=%0a++author^0.5+type^0.5+content_mvtxt^10++subject_phonetic^1+subject_txt^20%0a+++&q.alt=*:*&distrib=false&Test+%0a%0a%0a%0a?+%0a%0a%0a%0aDaily+News,Test+Association,Toyota,U.S.,Washington+Post&rows=10&rows=10&NOW=1334670761188&shard.url=JamiesMac.local:8502/solr/shard5-core1/&fl=*,score&q=bob&facet.field={!terms%3D$organization__terms}organization&isShard=true

Now there is an obvious issue here with our data having these \n
characters in it which I will be fixing shortly (plan to use a set of
Character replace filters to remove extra white space).  I am assuming
that this is causing our issue, but would be nice if someone could
confirm.


On Tue, Apr 17, 2012 at 12:08 AM, Jamie Johnson  wrote:
> I created to track this.  https://issues.apache.org/jira/browse/SOLR-3362
>
> On Mon, Apr 16, 2012 at 11:18 PM, Jamie Johnson  wrote:
>> doing some debugging this is the relevant block in FacetComponent
>>
>>          String name = shardCounts.getName(j);
>>          long count = ((Number)shardCounts.getVal(j)).longValue();
>>          ShardFacetCount sfc = dff.counts.get(name);
>>          sfc.count += count;
>>
>>
>> the issue is sfc is null.  I don't know if that should or should not
>> occur, but if I add a check (if(sfc == null)continue;) then I think it
>> would work.  Is this appropriate?
>>
>> On Mon, Apr 16, 2012 at 10:45 PM, Jamie Johnson  wrote:
>>> worth notingthe error goes away at times depending on the number
>>> of facets asked for.
>>>
>>> On Mon, Apr 16, 2012 at 10:38 PM, Jamie Johnson  wrote:
 I found (what appears to be) the issue I am experiencing here
 http://lucene.472066.n3.nabble.com/NullPointerException-with-distributed-facets-td3528165.html
 but there were no responses to it.  I've included the stack trace I am
 seeing, any ideas why this would happen?


 SEVERE: java.lang.NullPointerException
        at 
 org.apache.solr.handler.component.FacetComponent.refineFacets(FacetComponent.java:489)
        at 
 org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:278)
        at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307)
        at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1550)
        at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
        at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
        at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
        at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
        at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
        at 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
        at 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
        at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
        at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
        at 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
        at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
        at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
        at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
        at 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
        at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
        at org.eclipse.jetty.server.Server.handle(Server.java:351)
        at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
        at 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
        at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
        at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
        at 
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
        at 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnectio

Re: Distributed FacetComponent NullPointer Exception

2012-04-17 Thread Yonik Seeley
facet.field={!terms=$organization__terms}organization

This is referring to another request parameter that Solr should have
added (organization__terms) .  Did you cut-n-paste all of the
parameters below?

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10



On Tue, Apr 17, 2012 at 10:13 AM, Jamie Johnson  wrote:
> I'm noticing that this issue seems to be occurring with facet fields
> which have some unexpected characters.  For instance the query that I
> see going across the wire is as follows
>
> facet=true&tie=0.1&ids=3F2504E0-4F89-11D3-9A0C-0305E82C3301&qf=%0a++author^0.5+type^0.5+content_mvtxt^10++subject_phonetic^1+subject_txt^20%0a+++&q.alt=*:*&distrib=false&Test+%0a%0a%0a%0a?+%0a%0a%0a%0aDaily+News,Test+Association,Toyota,U.S.,Washington+Post&rows=10&rows=10&NOW=1334670761188&shard.url=JamiesMac.local:8502/solr/shard5-core1/&fl=*,score&q=bob&facet.field={!terms%3D$organization__terms}organization&isShard=true
>
> Now there is an obvious issue here with our data having these \n
> characters in it which I will be fixing shortly (plan to use a set of
> Character replace filters to remove extra white space).  I am assuming
> that this is causing our issue, but would be nice if someone could
> confirm.
>
>
> On Tue, Apr 17, 2012 at 12:08 AM, Jamie Johnson  wrote:
>> I created to track this.  https://issues.apache.org/jira/browse/SOLR-3362
>>
>> On Mon, Apr 16, 2012 at 11:18 PM, Jamie Johnson  wrote:
>>> doing some debugging this is the relevant block in FacetComponent
>>>
>>>          String name = shardCounts.getName(j);
>>>          long count = ((Number)shardCounts.getVal(j)).longValue();
>>>          ShardFacetCount sfc = dff.counts.get(name);
>>>          sfc.count += count;
>>>
>>>
>>> the issue is sfc is null.  I don't know if that should or should not
>>> occur, but if I add a check (if(sfc == null)continue;) then I think it
>>> would work.  Is this appropriate?
>>>
>>> On Mon, Apr 16, 2012 at 10:45 PM, Jamie Johnson  wrote:
 worth notingthe error goes away at times depending on the number
 of facets asked for.

 On Mon, Apr 16, 2012 at 10:38 PM, Jamie Johnson  wrote:
> I found (what appears to be) the issue I am experiencing here
> http://lucene.472066.n3.nabble.com/NullPointerException-with-distributed-facets-td3528165.html
> but there were no responses to it.  I've included the stack trace I am
> seeing, any ideas why this would happen?
>
>
> SEVERE: java.lang.NullPointerException
>        at 
> org.apache.solr.handler.component.FacetComponent.refineFacets(FacetComponent.java:489)
>        at 
> org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:278)
>        at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307)
>        at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1550)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
>        at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
>        at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
>        at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
>        at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
>        at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
>        at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
>        at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
>        at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
>        at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
>        at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
>        at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
>        at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
>        at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
>        at org.eclipse.jetty.server.Server.handle(Server.java:351)
>        at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
>        at 
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
>        at 
> org.eclipse.jetty.server.AbstractHttpConnectio

Re: Distributed FacetComponent NullPointer Exception

2012-04-17 Thread Jamie Johnson
I tried to clean it up a little bit, but removed too much.  Here is an
example which more closely shows what is in the log.  I changed the
actual data, hopefully I didn't mess it up.

facet=true&tie=0.1&ids=urn:sha256:0bea3adf1415c6c737063122d8abd343b24167bdfc7134a3efaef79b263c0a43&qf=%0a++author^0.5+type^0.5+content_mvtxt^10++subject_phonetic^1+subject_txt^20%0a+++&q.alt=*:*&distrib=false&wt=javabin&se_organizationname_umvs__terms=Test+%0a%0a%0a%0a?+%0a%0a%0a%0aTest+Daily+News,The+Test+Association,Toyota,Washington+Post&version=2&rows=10&rows=10&NOW=1334673365414&shard.url=JamiesMac.local:8502/solr/shard5-core1/&fl=*,score&q=bob&facet.field={!terms%3D$se_organizationname_umvs__terms}se_organizationname_umvs&isShard=true

The only thing I changed (or tried to change) was reducing number of
terms listed in se_organizationname_umvs_terms field

On Tue, Apr 17, 2012 at 10:22 AM, Yonik Seeley
 wrote:
> facet.field={!terms=$organization__terms}organization
>
> This is referring to another request parameter that Solr should have
> added (organization__terms) .  Did you cut-n-paste all of the
> parameters below?
>
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10
>
>
>
> On Tue, Apr 17, 2012 at 10:13 AM, Jamie Johnson  wrote:
>> I'm noticing that this issue seems to be occurring with facet fields
>> which have some unexpected characters.  For instance the query that I
>> see going across the wire is as follows
>>
>> facet=true&tie=0.1&ids=3F2504E0-4F89-11D3-9A0C-0305E82C3301&qf=%0a++author^0.5+type^0.5+content_mvtxt^10++subject_phonetic^1+subject_txt^20%0a+++&q.alt=*:*&distrib=false&Test+%0a%0a%0a%0a?+%0a%0a%0a%0aDaily+News,Test+Association,Toyota,U.S.,Washington+Post&rows=10&rows=10&NOW=1334670761188&shard.url=JamiesMac.local:8502/solr/shard5-core1/&fl=*,score&q=bob&facet.field={!terms%3D$organization__terms}organization&isShard=true
>>
>> Now there is an obvious issue here with our data having these \n
>> characters in it which I will be fixing shortly (plan to use a set of
>> Character replace filters to remove extra white space).  I am assuming
>> that this is causing our issue, but would be nice if someone could
>> confirm.
>>
>>
>> On Tue, Apr 17, 2012 at 12:08 AM, Jamie Johnson  wrote:
>>> I created to track this.  https://issues.apache.org/jira/browse/SOLR-3362
>>>
>>> On Mon, Apr 16, 2012 at 11:18 PM, Jamie Johnson  wrote:
 doing some debugging this is the relevant block in FacetComponent

          String name = shardCounts.getName(j);
          long count = ((Number)shardCounts.getVal(j)).longValue();
          ShardFacetCount sfc = dff.counts.get(name);
          sfc.count += count;


 the issue is sfc is null.  I don't know if that should or should not
 occur, but if I add a check (if(sfc == null)continue;) then I think it
 would work.  Is this appropriate?

 On Mon, Apr 16, 2012 at 10:45 PM, Jamie Johnson  wrote:
> worth notingthe error goes away at times depending on the number
> of facets asked for.
>
> On Mon, Apr 16, 2012 at 10:38 PM, Jamie Johnson  wrote:
>> I found (what appears to be) the issue I am experiencing here
>> http://lucene.472066.n3.nabble.com/NullPointerException-with-distributed-facets-td3528165.html
>> but there were no responses to it.  I've included the stack trace I am
>> seeing, any ideas why this would happen?
>>
>>
>> SEVERE: java.lang.NullPointerException
>>        at 
>> org.apache.solr.handler.component.FacetComponent.refineFacets(FacetComponent.java:489)
>>        at 
>> org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:278)
>>        at 
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307)
>>        at 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1550)
>>        at 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
>>        at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
>>        at 
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
>>        at 
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
>>        at 
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
>>        at 
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
>>        at 
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
>>        at 
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
>>        at 
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
>>   

what's best to use for monitoring solr 3.6 farm on redhat/tomcat

2012-04-17 Thread Robert Petersen
Hello solr users,

 

Is there any lightweight tool of choice for monitoring multiple solr
boxes for memory consumption, heap usage, and other statistics?  We have
a pretty large farm of RHEL servers running solr now and up until
migrating from 1.4 to 3.6 we were running the lucid gaze component on
each box for these stats... and this doesn't function under solr 3.x and
this was cumbersome anyway as we had to hit each box separately.  What
do the rest of you guys use to keep tabs on your servers?  We're running
solr 3.6 in tomcat on RHEL

 

Red Hat Enterprise Linux Server release 5.3 (Tikanga)

Apache Tomcat Version 6.0.20

java.runtime.version = 1.6.0_25-b06

java.vm.name = Java HotSpot(TM) 64-Bit Server VM

 

 

Thanks,

 

Robert (Robi) Petersen

Senior Software Engineer

Site Search Specialist

 



Solr and TREC Enterprise Track 2007

2012-04-17 Thread obadayh
Dear All, 

Can anybody tell me how to index TREC Enterprise Track 2007 by sorl.

Thanx 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-TREC-Enterprise-Track-2007-tp3917893p3917893.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem with faceting on a boolean field

2012-04-17 Thread Yonik Seeley
On Tue, Apr 17, 2012 at 2:22 PM, Kissue Kissue  wrote:
> Hi,
>
> I am faceting on a boolean field called "usedItem". There are a total of
> 607601 items in the index and they all have value for "usedItem" set to
> false.
>
> However when i do a search for *:* and faceting on "usedItem", the num
> found is set correctly to 607601 but i get the facet result below:
>
> lst name="usedItem">17971

You can verify by changing the query from *:* to usedItem:false  (or
adding an additional fq to that effect).

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10


Re: Difference between two solr indexes

2012-04-17 Thread Pawel Rog
If there are only 100'000 documents dump all document ids and make diff
If you're using linux based system you can just use simple tools to do it.
Something like that can be helpful

curl "http://your.hostA:port/solr/index/select?*:*&fl=id&wt=csv"; > /tmp/idsA
curl "http://your.hostB:port/solr/index/select?*:*&fl=id&wt=csv"; > /tmp/idsB
diff /tmp/idsA /tmp/idsB | grep "<\|>" | awk '{print $2;}' | sed
's/\(.*\)/\1<\/id>/g' > /tmp/ids_to_delete.xml

Now you have file. Now you can just add to that file "" and
"" and upload that file into solr using curl
curl -X POST -d @/tmp/ids_to_delete.xml "http://your.hostA:port
/solr/index/upadte"

On Tue, Apr 17, 2012 at 2:09 PM, nutchsolruser wrote:

> I'm Also seeking solution for similar problem.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Difference-between-two-solr-indexes-tp3916328p3917050.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


HTML Indexing error

2012-04-17 Thread Chambeda
Hi All,

I am trying to parse some text that contains embedded HTML elements and am
getting the following error:

FATAL: Solr returned an error #400 Unexpected close tag ;
expected .

My set up is as follows:

schema.xml


  
   
   
  





XML snippet:
1Bose's best bookshelf speakers
are updated to provide an even more spacious, natural listening experience.
They're great for stereo, or as a front- or rear-channel solution for home
theater.Learn more about Bose products and proprietary technologies
in our Bose Store.

According to the documentation the  should be removed correctly.

Anything I am missing?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTML-Indexing-error-tp3918174p3918174.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How sorlcloud distribute data among shards of the same cluster?

2012-04-17 Thread Mark Miller

On Apr 17, 2012, at 9:56 AM, emma1023 wrote:

It hashes the id. The doc distribution is fairly even - but sizes may be fairly 
different.

> How solrcloud manage distribute data among shards of the same cluster when
> you query? Is it distribute the data equally? What is the basis? Which part
> of the code that I can find about it?Thank you so much!
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3917323.html
> Sent from the Solr - User mailing list archive at Nabble.com.

- Mark Miller
lucidimagination.com













Re: SolrCloud: Programmatically create multiple collections?

2012-04-17 Thread Mark Miller

On Apr 17, 2012, at 7:07 AM, ravi wrote:

> Hi,
> 
> I have recently started experimenting with solrCloud. I want to use it for
> below mentioned requirements:
> 
> - create one collection per client 
> - create several shards per collection (say 1 shard for each day of month)
> - all of the collection would follow the same schema. 
> - may need to add more machines to the cluster to index faster or provide
> quick turnaround for queries.  (think this is possible with zookeeper)
> - programmatic access ( and create, if possible) shard names and add data
> to that shard. 
> 
> I was looking into tutorial online and could get a sense that one needs to
> create collections and shards manually and there is no programmatic way to
> create them dynamically from code, say, using solrj. 

You can - see http://wiki.apache.org/solr/SolrCloud#Creating_cores_via_CoreAdmin
http://wiki.apache.org/solr/CoreAdmin#CoreAdminHandler


> 
> I now know that adding shards to an already running cluster is not possible.
> ( http://wiki.apache.org/solr/SolrCloud Re-sizing your cluster )
> 
> Is there something wrong in the way i want to organise my indexes? Am i
> missing something very obvious?
> 
> Thanks!
> Ravi
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-Programmatically-create-multiple-collections-tp3916927p3916927.html
> Sent from the Solr - User mailing list archive at Nabble.com.

- Mark Miller
lucidimagination.com













Re: [Solr 4.0] what is stored in .tim index file format?

2012-04-17 Thread Robert Muir
This is the term dictionary for 4.0's default codec (currently uses
BlockTree implementation)

.tim is the on-disk portion of the terms (similar in function to .tis
in previous releases)
.tip is the in-memory "terms index" (similar in function to .tii in
previous releases)

On Tue, Apr 17, 2012 at 6:37 AM, Lyuba Romanchuk
 wrote:
> Hi,
>
> I have index ~31G where
> 27% of the index size is .fdt files (8.5G)
> 20% - .fdx files (6.2G)
> 37% - .frq files (11.6G)
> 16% - .tim files (5G)
>
> I didn't manage to find the description for .tim files. Can you help me
> with this?
>
> Thank you.
> Best regards,
> Lyuba



-- 
lucidimagination.com


Re: what's best to use for monitoring solr 3.6 farm on redhat/tomcat

2012-04-17 Thread Otis Gospodnetic
Hi Robert,

Have a look at SPM for 
Solr: http://sematext.com/spm/solr-performance-monitoring/index.html

It has all Solr metrics, works with 3.*, has a bunch of system metrics, 
filtering, alerting, email subscriptions, no loss of granularity, and you can 
use it to monitor other types of systems (e.g. HBase, ElasticSearch, Sensei...) 
and, starting with the next versions pretty much any Java app (not necessarily 
a webapp).

Otis

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>
> From: Robert Petersen 
>To: solr-user@lucene.apache.org 
>Sent: Tuesday, April 17, 2012 12:02 PM
>Subject: what's best to use for monitoring solr 3.6 farm on redhat/tomcat
> 
>Hello solr users,
>
>
>
>Is there any lightweight tool of choice for monitoring multiple solr
>boxes for memory consumption, heap usage, and other statistics?  We have
>a pretty large farm of RHEL servers running solr now and up until
>migrating from 1.4 to 3.6 we were running the lucid gaze component on
>each box for these stats... and this doesn't function under solr 3.x and
>this was cumbersome anyway as we had to hit each box separately.  What
>do the rest of you guys use to keep tabs on your servers?  We're running
>solr 3.6 in tomcat on RHEL
>
>
>
>Red Hat Enterprise Linux Server release 5.3 (Tikanga)
>
>Apache Tomcat Version 6.0.20
>
>java.runtime.version = 1.6.0_25-b06
>
>java.vm.name = Java HotSpot(TM) 64-Bit Server VM
>
>
>
>
>
>Thanks,
>
>
>
>Robert (Robi) Petersen
>
>Senior Software Engineer
>
>Site Search Specialist
>
>
>
>
>
>

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-17 Thread Otis Gospodnetic
I think Jason is right - there is no index splitting in ES and SolrCloud, so 
one has to think ahead, "overshard", and then count on redistributing shards 
from oversubscribed nodes to other nodes.  No resharding on demand and no 
index/shard splitting yet.

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>
> From: Jason Rutherglen 
>To: solr-user@lucene.apache.org 
>Sent: Monday, April 16, 2012 8:42 PM
>Subject: Re: Options for automagically Scaling Solr (without needing 
>distributed index/replication) in a Hadoop environment
> 
>One of big weaknesses of Solr Cloud (and ES?) is the lack of the
>ability to redistribute shards across servers.  Meaning, as a single
>shard grows too large, splitting the shard, while live updates.
>
>How do you plan on elastically adding more servers without this feature?
>
>Cassandra and HBase handle elasticity in their own ways.  Cassandra
>has successfully implemented the Dynamo model and HBase uses the
>traditional BigTable 'split'.  Both systems are complex though are at
>a singular level of maturity.
>
>Also Cassandra [successfully] implements multiple data center support,
>is that available in SC or ES?
>
>On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
> wrote:
>> Hello Ali,
>>
>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>
>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>> seconds.
>>
>>
>> That's fine.  Whether it's doable with any tech will depend on how much 
>> hardware you give it, among other things.
>>
>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>> is also possible that I might need to store multiple indexes -- one for
>>> crawled content, and one for ancillary data that is also very large. Each
>>> of these indices would likely require a logically distributed and
>>> replicated index.
>>
>>
>> Yup, OK.
>>
>>> However, I would like for such a system to be homogenous with the Hadoop
>>> infrastructure that is already installed on the cluster (for the crawl). In
>>> other words, I would much prefer if the replication and distribution of the
>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>> using another scalability framework (such as SolrCloud). In addition, it
>>> would be ideal if this environment was flexible enough to be dynamically
>>> scaled based on the size requirements of the index and the search traffic
>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>>> enough to automatically provision additional processing power into the
>>> cluster without requiring server re-starts).
>>
>>
>> There is no such thing just yet.
>> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
>> automatically index HBase content, but that was either not completed or not 
>> committed into HBase.
>>
>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>>> mature enough and would be the right architectural choice to go along with
>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>>> above.
>>
>>
>> Here is a summary on all of them:
>> * Search on HBase - I assume you are referring to the same thing I mentioned 
>> above.  Not ready.
>> * Solandra - uses Cassandra+Solr, plus DataStax now has a different 
>> (commercial) offering that combines search and Cassandra.  Looks good.
>> * Lily - data stored in HBase cluster gets indexed to a separate Solr 
>> instance(s)  on the side.  Not really integrated the way you want it to be.
>> * ElasticSearch - solid at this point, the most dynamic solution today, can 
>> scale well (we are working on a mny-B documents index and hundreds of 
>> nodes with ElasticSearch right now), etc.  But again, not integrated with 
>> Hadoop the way you want it.
>> * IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
>> sure about its future considering LinkedIn uses Zoie and Sensei already.
>> * And there is SolrCloud, which is coming soon and will be solid, but is 
>> again not integrated.
>>
>> If I were you and I had to pick today - I'd pick ElasticSearch if I were 
>> completely open.  If I had Solr bias I'd give SolrCloud a try first.
>>
>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>>> estimate my needing with this setup, for regular web-data (HTML text) at
>>> this scale?
>>
>> I don't know off the topic of my head, but I'm guessing several hundred for 
>> serving search requests.
>>
>> HTH,
>>
>> Otis
>> --
>> Search Analytics - http://sematext.com/search-analytics/index.html
>>
>> Scalable Performance Monitoring - 

Populating a filter cache by means other than a query

2012-04-17 Thread Chris Collins
Hi, I am a long time Lucene user but new to solr.  I would like to use 
something like the filterCache but build a such a cache not from a query but 
custom code.  I guess I will ask my question by using techniques and vocab I am 
familiar with.  Not sure its actually the right way so I appologize if its just 
the wrong approach.

The scenario is that I would like to filter a result set by a set of labeled 
documents, I will call that set L.
L contains app specific document IDs that are indexed as literals in the 
lucenefield "myid".
I would imagine I could build a OpenBitSet from enumerating the termdocs and 
look for the intersecting ids in my label set.  
Now I have my bitset that I assume I could use in a filter.

Another approach would be to implement a hits collector, compute a fieldcache 
from that myid field and look for the intersection in a hashtable of L at 
scoring time, throwing out results that are not contained in the hashtable.  

Of course I am working within the confines / concepts that SOLR has layed out.  
Without going completely off the reservation is their a neat way of doing such 
a thing with SOLR?

Glad to clarify if my question makes absolutely no sense.

Best

C

Re: How sorlcloud distribute data among shards of the same cluster?

2012-04-17 Thread emma1023
Thanks for your reply. In sorl 3.x, we need to manually hash the doc Id to
the server.How does solrcloud do this instead? I am working on a project
using solrcloud.But we need to monitor how the solrcloud distribute the
data. I cannot find which part of the code it is from source code.Is it
from the cloud part? Thanks.


On Tue, Apr 17, 2012 at 3:16 PM, Mark Miller-3 [via Lucene] <
ml-node+s472066n3918192...@n3.nabble.com> wrote:

>
> On Apr 17, 2012, at 9:56 AM, emma1023 wrote:
>
> It hashes the id. The doc distribution is fairly even - but sizes may be
> fairly different.
>
> > How solrcloud manage distribute data among shards of the same cluster
> when
> > you query? Is it distribute the data equally? What is the basis? Which
> part
> > of the code that I can find about it?Thank you so much!
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3917323.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918192.html
>  To unsubscribe from How sorlcloud distribute data among shards of the
> same cluster?, click 
> here
> .
> NAML
>


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918348.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-17 Thread Jason Rutherglen
> redistributing shards from oversubscribed nodes to other nodes

Redistributing shards on a live system is not possible however because
the updates in-flight will likely be lost.  Also it is not simple
technology to build from the ground-up.

As is today, one would need to schedule downtime, for multi-terabyte
live realtime systems, that not acceptable and will cause the system
to not meet SLAs.

Solr Cloud seems limited to a simple hashing algorithm for sending
updates to the appropriate shard.  This is precisely what Dynamo (and
Cassandra) solves, eg, elastically and dynamically rearranging the
hash 'ring' both logically and physically.

In addition, there is the potential for data loss which Cassandra has
the technology for.

On Tue, Apr 17, 2012 at 1:33 PM, Otis Gospodnetic
 wrote:
> I think Jason is right - there is no index splitting in ES and SolrCloud, so 
> one has to think ahead, "overshard", and then count on redistributing shards 
> from oversubscribed nodes to other nodes.  No resharding on demand and no 
> index/shard splitting yet.
>
> Otis
> 
> Performance Monitoring SaaS for Solr - 
> http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
>
>>
>> From: Jason Rutherglen 
>>To: solr-user@lucene.apache.org
>>Sent: Monday, April 16, 2012 8:42 PM
>>Subject: Re: Options for automagically Scaling Solr (without needing 
>>distributed index/replication) in a Hadoop environment
>>
>>One of big weaknesses of Solr Cloud (and ES?) is the lack of the
>>ability to redistribute shards across servers.  Meaning, as a single
>>shard grows too large, splitting the shard, while live updates.
>>
>>How do you plan on elastically adding more servers without this feature?
>>
>>Cassandra and HBase handle elasticity in their own ways.  Cassandra
>>has successfully implemented the Dynamo model and HBase uses the
>>traditional BigTable 'split'.  Both systems are complex though are at
>>a singular level of maturity.
>>
>>Also Cassandra [successfully] implements multiple data center support,
>>is that available in SC or ES?
>>
>>On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
>> wrote:
>>> Hello Ali,
>>>
 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>>
 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.
>>>
>>>
>>> That's fine.  Whether it's doable with any tech will depend on how much 
>>> hardware you give it, among other things.
>>>
 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.
>>>
>>>
>>> Yup, OK.
>>>
 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).
>>>
>>>
>>> There is no such thing just yet.
>>> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
>>> automatically index HBase content, but that was either not completed or not 
>>> committed into HBase.
>>>
 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.
>>>
>>>
>>> Here is a summary on all of them:
>>> * Search on HBase - I assume you are referring to the same thing I 
>>> mentioned above.  Not ready.
>>> * Solandra - uses Cassandra+Solr, plus DataStax now has a different 
>>> (commercial) offering that combines search and Cassandra.  Looks good.
>>> * Lily - data stored in HBase cluster gets indexed to a separate Solr 
>>> instance(s)  on the side.  Not really integrated the way you want it to be.
>>> * ElasticSearch - solid at this point, the most dynamic solution today, can 
>>> scale well (we are working on a mny-B documents index and hundreds of 
>>> nodes with ElasticSearch right now), etc.  But 

RE: what's best to use for monitoring solr 3.6 farm on redhat/tomcat

2012-04-17 Thread Robert Petersen
Wow that looks like just what the doctor ordered!  Thanks Otis

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, April 17, 2012 1:29 PM
To: solr-user@lucene.apache.org
Subject: Re: what's best to use for monitoring solr 3.6 farm on redhat/tomcat

Hi Robert,

Have a look at SPM for 
Solr: http://sematext.com/spm/solr-performance-monitoring/index.html

It has all Solr metrics, works with 3.*, has a bunch of system metrics, 
filtering, alerting, email subscriptions, no loss of granularity, and you can 
use it to monitor other types of systems (e.g. HBase, ElasticSearch, Sensei...) 
and, starting with the next versions pretty much any Java app (not necessarily 
a webapp).

Otis

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>
> From: Robert Petersen 
>To: solr-user@lucene.apache.org 
>Sent: Tuesday, April 17, 2012 12:02 PM
>Subject: what's best to use for monitoring solr 3.6 farm on redhat/tomcat
> 
>Hello solr users,
>
>
>
>Is there any lightweight tool of choice for monitoring multiple solr
>boxes for memory consumption, heap usage, and other statistics?  We have
>a pretty large farm of RHEL servers running solr now and up until
>migrating from 1.4 to 3.6 we were running the lucid gaze component on
>each box for these stats... and this doesn't function under solr 3.x and
>this was cumbersome anyway as we had to hit each box separately.  What
>do the rest of you guys use to keep tabs on your servers?  We're running
>solr 3.6 in tomcat on RHEL
>
>
>
>Red Hat Enterprise Linux Server release 5.3 (Tikanga)
>
>Apache Tomcat Version 6.0.20
>
>java.runtime.version = 1.6.0_25-b06
>
>java.vm.name = Java HotSpot(TM) 64-Bit Server VM
>
>
>
>
>
>Thanks,
>
>
>
>Robert (Robi) Petersen
>
>Senior Software Engineer
>
>Site Search Specialist
>
>
>
>
>
>


Different solr config under tomcat.

2012-04-17 Thread mizayah
Is there a way to use different path for solrconfig.xml, like
solrconfig_slave.xml for instance under tomcat?

I dont want run cores. I know how to do it with cores, but i want to have
single instance.
Is there any parameter which i can use to say tomcat to use
solrconfig_slave.xml?

Pls help

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Different-solr-config-under-tomcat-tp3918465p3918465.html
Sent from the Solr - User mailing list archive at Nabble.com.


Hide results for dataimport - initArgs

2012-04-17 Thread Adolfo Carreno
Hi all.For security reasons I want to hide the result of a dataimport command, 
specifically the section "initArgs", in order to hide the connection parameters 
of the database. I removed from the config.xml the tag "datasource", and moved 
into the solrconfig.xml, in the requestHandler defined for the dataimport 
operation. This allows to hide the user, password, and database url from the 
dataimport.jsp app.
Now, everytime that I invoke the /dataimport?command=yyy, in the output XML is 
displaying this information:
03dih-tenant1-jdbc.xmlorg.postgresql.Driverjdbc:postgresql://localhost/db1udb1pdb1tenant1statusidleThis
 response format is experimental. It is likely to change in the 
future.
I wonder if there is a way to remove completely the initArrg section from the 
output XML, or a way to mask this information. I'm working in a very restricted 
environment, and we don't want this information to be shown in any XML output.
Thanks for your help!!!Adolfo

Re: Hide results for dataimport - initArgs

2012-04-17 Thread Tomás Fernández Löbbe
I guess this should be possible by setting the "echoParams"=none or
explicit as an invariant. For example:

  
 
   none
 
...
  

I haven't tried it, but I think that should work.


On Tue, Apr 17, 2012 at 6:20 PM, Adolfo Carreno wrote:

> Hi all.For security reasons I want to hide the result of a dataimport
> command, specifically the section "initArgs", in order to hide the
> connection parameters of the database. I removed from the config.xml the
> tag "datasource", and moved into the solrconfig.xml, in the requestHandler
> defined for the dataimport operation. This allows to hide the user,
> password, and database url from the dataimport.jsp app.
> Now, everytime that I invoke the /dataimport?command=yyy, in the output
> XML is displaying this information:
> 0 name="QTime">3 name="config">dih-tenant1-jdbc.xml name="driver">org.postgresql.Driver name="url">jdbc:postgresql://localhost/db1 name="user">udb1pdb1 name="invariants">tenant1 name="command">statusidle name="importResponse"/>This
> response format is experimental. It is likely to change in the
> future.
> I wonder if there is a way to remove completely the initArrg section from
> the output XML, or a way to mask this information. I'm working in a very
> restricted environment, and we don't want this information to be shown in
> any XML output.
> Thanks for your help!!!Adolfo


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-17 Thread Lukáš Vlček
Hi,

speaking about ES I think it would be fair to mention that one has to
specify number of shards upfront when the index is created - that is
correct, however, it is possible to give index one or more aliases which
basically means that you can add new indices on the fly and give them same
alias which is then used to search against. Given that you can add/remove
indices, nodes and aliases on the fly I think there is a way how to handle
growing data set with ease. If anyone is interested such scenario has been
discussed in detail in ES mail list.

Regards,
Lukas

On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> One of big weaknesses of Solr Cloud (and ES?) is the lack of the
> ability to redistribute shards across servers.  Meaning, as a single
> shard grows too large, splitting the shard, while live updates.
>
> How do you plan on elastically adding more servers without this feature?
>
> Cassandra and HBase handle elasticity in their own ways.  Cassandra
> has successfully implemented the Dynamo model and HBase uses the
> traditional BigTable 'split'.  Both systems are complex though are at
> a singular level of maturity.
>
> Also Cassandra [successfully] implements multiple data center support,
> is that available in SC or ES?
>
> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
>  wrote:
> > Hello Ali,
> >
> >> I'm trying to setup a large scale *Crawl + Index + Search
> *infrastructure
> >
> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
> pages*,
> >> crawled + indexed every *4 weeks, *with a search latency of less than
> 0.5
> >> seconds.
> >
> >
> > That's fine.  Whether it's doable with any tech will depend on how much
> hardware you give it, among other things.
> >
> >> Needless to mention, the search index needs to scale to 5Billion pages.
> It
> >> is also possible that I might need to store multiple indexes -- one for
> >> crawled content, and one for ancillary data that is also very large.
> Each
> >> of these indices would likely require a logically distributed and
> >> replicated index.
> >
> >
> > Yup, OK.
> >
> >> However, I would like for such a system to be homogenous with the Hadoop
> >> infrastructure that is already installed on the cluster (for the
> crawl). In
> >> other words, I would much prefer if the replication and distribution of
> the
> >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead
> of
> >> using another scalability framework (such as SolrCloud). In addition, it
> >> would be ideal if this environment was flexible enough to be dynamically
> >> scaled based on the size requirements of the index and the search
> traffic
> >> at the time (i.e. if it is deployed on an Amazon cluster, it should be
> easy
> >> enough to automatically provision additional processing power into the
> >> cluster without requiring server re-starts).
> >
> >
> > There is no such thing just yet.
> > There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt
> to automatically index HBase content, but that was either not completed or
> not committed into HBase.
> >
> >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem
> would
> >> be ideal for this scenario. I've heard mention of Solr-on-HBase,
> Solandra,
> >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of
> these is
> >> mature enough and would be the right architectural choice to go along
> with
> >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
> aspects
> >> above.
> >
> >
> > Here is a summary on all of them:
> > * Search on HBase - I assume you are referring to the same thing I
> mentioned above.  Not ready.
> > * Solandra - uses Cassandra+Solr, plus DataStax now has a different
> (commercial) offering that combines search and Cassandra.  Looks good.
> > * Lily - data stored in HBase cluster gets indexed to a separate Solr
> instance(s)  on the side.  Not really integrated the way you want it to be.
> > * ElasticSearch - solid at this point, the most dynamic solution today,
> can scale well (we are working on a mny-B documents index and hundreds
> of nodes with ElasticSearch right now), etc.  But again, not integrated
> with Hadoop the way you want it.
> > * IndexTank - has some technical weaknesses, not integrated with Hadoop,
> not sure about its future considering LinkedIn uses Zoie and Sensei already.
> > * And there is SolrCloud, which is coming soon and will be solid, but is
> again not integrated.
> >
> > If I were you and I had to pick today - I'd pick ElasticSearch if I were
> completely open.  If I had Solr bias I'd give SolrCloud a try first.
> >
> >> Lastly, how much hardware (assuming a medium sized EC2 instance) would
> you
> >> estimate my needing with this setup, for regular web-data (HTML text) at
> >> this scale?
> >
> > I don't know off the topic of my head, but I'm guessing several hundred
> for serving search requests.
> >
> > HTH,
> >
> > Otis
> > --
> > Searc

Re: Hide results for dataimport - initArgs

2012-04-17 Thread Adolfo Carreno
Thanks Tomas for your response, unfortunately didn't work, still is presenting 
the datasource information in each dataimport output:
00dih-tenant1-jdbc.xmlorg.postgresql.Driverjdbc:postgresql://localhost/cloududb1pdb1nbcnonestatusidleThis response format is 
experimental. It is likely to change in the future.

--- On Tue, 4/17/12, Tomás Fernández Löbbe  wrote:

From: Tomás Fernández Löbbe 
Subject: Re: Hide results for dataimport - initArgs
To: solr-user@lucene.apache.org
Date: Tuesday, April 17, 2012, 7:43 PM

I guess this should be possible by setting the "echoParams"=none or
explicit as an invariant. For example:

  
     
       none
     
...
  

I haven't tried it, but I think that should work.


On Tue, Apr 17, 2012 at 6:20 PM, Adolfo Carreno wrote:

> Hi all.For security reasons I want to hide the result of a dataimport
> command, specifically the section "initArgs", in order to hide the
> connection parameters of the database. I removed from the config.xml the
> tag "datasource", and moved into the solrconfig.xml, in the requestHandler
> defined for the dataimport operation. This allows to hide the user,
> password, and database url from the dataimport.jsp app.
> Now, everytime that I invoke the /dataimport?command=yyy, in the output
> XML is displaying this information:
> 0 name="QTime">3 name="config">dih-tenant1-jdbc.xml name="driver">org.postgresql.Driver name="url">jdbc:postgresql://localhost/db1 name="user">udb1pdb1 name="invariants">tenant1 name="command">statusidle name="importResponse"/>This
> response format is experimental. It is likely to change in the
> future.
> I wonder if there is a way to remove completely the initArrg section from
> the output XML, or a way to mask this information. I'm working in a very
> restricted environment, and we don't want this information to be shown in
> any XML output.
> Thanks for your help!!!Adolfo


SOLR 4 / Date Query: Spurious Results: Is it me or ... ?

2012-04-17 Thread vybe3142
I wrote a custom handler that uses externally injected metadata (bypassing
Tika et all)

WRT Dates, I see them associated with the correct docs when retrieving all
docs:

BUT: 

looking at the schema analyzer, things look wierd:
1. Top terms = -1
2. The Dates are all mixed up with some spurious 1970 dates thrown in (I can
get rid of the 1970 dates if i use type "date" vs "tdate")
3. Multi Valued values (should only be one per doc, as per input data, even
though the schema allows it).

Any ideas what, if anything, I'm doing wrong?

See pic http://lucene.472066.n3.nabble.com/file/n3918636/Capture.jpg 

Here's my SOLR schema:




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4-Date-Query-Spurious-Results-Is-it-me-or-tp3918636p3918636.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Difference between two solr indexes

2012-04-17 Thread search engn dev
Thanks Pawel Rog for much needed reply, i'll give try and let u know.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Difference-between-two-solr-indexes-tp3916328p3918996.html
Sent from the Solr - User mailing list archive at Nabble.com.


Haystack - Solr recommended solr directory location

2012-04-17 Thread BillB1951
I'm using django-haystack and I am a little confused about where to put
/solr, and it's schema.xml, solr.xml, and solrconfig.xml files.

I currently have /solr in the following longer path --- /home/mydir/
solr/apache-solr3.6.0/example/

I'm thinking about moving the guts up a level (getting rid of /apache-
solr3.6.0/) leaving --- /home/mydir/solr/example/

It looks like the Jetty servelet is all under the example directory, so my
thought is to either change the directory name "example" to "mysite", or
save a copy of "example" as "mysite" at the same level. In either case I end
up with --- /home/mydir/solr/mysite/

I read somewhere that schema.xml, solr.xml, and solrconfig.xml should be in
the /conf directory under the "solr home" directory. So I guess I should
create --- /home/mydir/solr/mysite/conf and place my files there.

I have added the following to my setting file:

HAYSTACK_CONNECTIONS = {
’default’: {
’ENGINE’: ’haystack.backends.solr_backend.SolrEngine’,
’URL’: ’http://127.0.0.1:8983/solr/mysite'
},
}

If I set everything up as outlined above should things work? If someone has
a clean elegant setup that works and they'd like to share --- I'm very open
to suggestions.

Thank you for the help.

btw - I have worked through the solr tutorial (so solr is working), and I
have installed haystack run syncdb (so appears haystack is ready to go
also).


-
BillB1951
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Haystack-Solr-recommended-solr-directory-location-tp3919030p3919030.html
Sent from the Solr - User mailing list archive at Nabble.com.